
Whisper Best Settings: Complete Guide to Optimal Configuration
Eric King
Author
Getting the best results from OpenAI Whisper requires understanding and configuring its various parameters correctly. While Whisper works well with default settings, optimizing these parameters can significantly improve accuracy, speed, and consistency.
This comprehensive guide covers all Whisper settings, explains what each parameter does, and provides optimal configurations for different use cases.
Understanding Whisper Parameters
Whisper's
transcribe() function accepts many parameters that control transcription behavior. Here's a complete overview:import whisper
model = whisper.load_model("base")
result = model.transcribe(
audio="audio.mp3",
verbose=False,
temperature=0.0,
compression_ratio_threshold=2.4,
logprob_threshold=-1.0,
no_speech_threshold=0.6,
condition_on_previous_text=True,
initial_prompt=None,
word_timestamps=False,
prepend_punctuations="\"'"ΒΏ([{-",
append_punctuations="\"'.γ,οΌ!οΌ?οΌ:οΌ")]}γ",
decode_options=None,
best_of=5,
beam_size=5,
patience=1.0,
length_penalty=1.0,
suppress_tokens="-1",
suppress_blank=True,
without_timestamps=False,
max_initial_timestamp=1.0,
word_timestamps=False,
language=None,
task="transcribe",
fp16=True,
temperature_increment_on_fallback=0.2,
compression_ratio_threshold=2.4,
logprob_threshold=-1.0,
no_speech_threshold=0.6
)
Let's break down each parameter and its optimal settings.
Core Parameters
1. Model Size (model)
Most Important Setting
The model size has the biggest impact on accuracy and speed.
# Available models (from smallest to largest):
model = whisper.load_model("tiny") # Fastest, lowest accuracy
model = whisper.load_model("base") # Balanced
model = whisper.load_model("small") # Good accuracy
model = whisper.load_model("medium") # High accuracy
model = whisper.load_model("large") # Best accuracy, slowest
Model Selection Guide:
| Model | Accuracy | Speed | VRAM | Best For |
|---|---|---|---|---|
| tiny | ββ | βββββ | ~1GB | Quick testing |
| base | βββ | ββββ | ~1GB | General use |
| small | ββββ | βββ | ~2GB | Good balance |
| medium | βββββ | ββ | ~5GB | Recommended for most |
| large | ββββββ | β | ~10GB | Maximum accuracy |
Best Practice:
- For most use cases: Use
mediummodel - For speed-critical: Use
smallorbase - For maximum accuracy: Use
large - For testing: Use
tinyorbase
2. Temperature
Controls randomness in decoding
Lower temperature = more deterministic, higher accuracy.
result = model.transcribe(
audio,
temperature=0.0 # Most deterministic, best for accuracy
)
Temperature Values:
| Value | Behavior | Use Case |
|---|---|---|
0.0 | Most deterministic | β Best for accuracy |
0.2 | Slight randomness | Good balance |
0.6 | Default, balanced | General use |
1.0+ | More creative | Not recommended for transcription |
Best Practice:
# For maximum accuracy
temperature=0.0
# Whisper can use multiple temperatures
temperature=(0.0, 0.2, 0.4, 0.6, 0.8) # Tries each, picks best
Why Lower Temperature is Better:
- Reduces random variations
- More consistent results
- Better for technical terms
- Improves accuracy by 2-5%
3. Best Of
Number of decoding attempts
Tries multiple decodings and selects the best result.
result = model.transcribe(
audio,
best_of=5 # Try 5 decodings, pick best
)
Best Of Values:
| Value | Accuracy Gain | Speed Impact | Recommendation |
|---|---|---|---|
1 (default) | Baseline | Fastest | General use |
3 | +2-4% | 3x slower | Good balance |
5 | +3-8% | 5x slower | β Best for accuracy |
10 | +5-10% | 10x slower | Overkill for most |
Best Practice:
# For critical transcriptions
best_of=5 # Good balance of accuracy and speed
# For general use
best_of=1 # Fastest, still accurate
Trade-off:
best_of=5 improves accuracy by 3-8% but is 5x slower.4. Beam Size
Beam search width
Controls how many candidate sequences are explored during decoding.
result = model.transcribe(
audio,
beam_size=5 # Explore 5 candidate sequences
)
Beam Size Values:
| Value | Accuracy | Speed | Use Case |
|---|---|---|---|
1 | Fastest | Fastest | Real-time (greedy) |
5 | High | Medium | β Recommended |
10 | Very High | Slow | Maximum accuracy |
Best Practice:
# Standard configuration
beam_size=5 # Good balance
# For maximum accuracy
beam_size=5 # Works well with best_of=5
Note: Beam size works together with
best_of. Use beam_size=5 with best_of=5 for optimal results.5. Patience
Beam search patience
Controls how long beam search continues before pruning candidates.
result = model.transcribe(
audio,
patience=1.0 # Default, good balance
)
Patience Values:
| Value | Behavior | Use Case |
|---|---|---|
0.0 | Aggressive pruning | Fast, less accurate |
1.0 | Default | β Recommended |
2.0 | More patient | Slower, slightly better |
Best Practice: Keep at default
1.0 unless you need maximum accuracy.6. Condition on Previous Text
Use context from previous segments
Improves accuracy by using context from earlier in the audio.
result = model.transcribe(
audio,
condition_on_previous_text=True # Use context
)
Settings:
| Value | Behavior | Impact |
|---|---|---|
True | Uses previous context | β Better accuracy |
False | Independent segments | Faster, less accurate |
Best Practice:
# Always enable for better accuracy
condition_on_previous_text=True
Why This Helps:
- Maintains context across segments
- Better handling of pronouns and references
- Improves accuracy by 2-5%
7. Language
Specify language explicitly
Improves accuracy when language is known.
result = model.transcribe(
audio,
language="en" # English
)
Language Codes:
language="en" # English
language="zh" # Chinese
language="es" # Spanish
language="fr" # French
language="de" # German
language="ja" # Japanese
# ... and 90+ more languages
Best Practice:
# If language is known, always specify
language="en" # Improves accuracy by 5-10%
# For auto-detection
language=None # Slower, may misdetect
Why This Helps:
- Skips language detection step (faster)
- Prevents misdetection errors
- Better handling of accents
- Improves accuracy by 5-10%
8. Initial Prompt
Provide context about content
Helps Whisper understand domain-specific terms and context.
result = model.transcribe(
audio,
initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)
Best Practice:
# Include:
# - Topic/domain
# - Speaker names
# - Technical terms
# - Context
initial_prompt="""
This is a business meeting about Q4 product planning.
Participants: Sarah (Product Manager), John (Engineer).
Topics: API endpoints, microservices, Kubernetes.
"""
Why This Helps:
- Better recognition of proper nouns
- Improved technical terminology
- Context-aware transcription
- Improves accuracy by 5-15% for domain-specific content
9. Word Timestamps
Get word-level timestamps
Useful for subtitles, search, and detailed analysis.
result = model.transcribe(
audio,
word_timestamps=True # Get word-level timestamps
)
# Access word timestamps
for segment in result["segments"]:
for word in segment.get("words", []):
print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
Settings:
| Value | Output | Use Case |
|---|---|---|
True | Word-level timestamps | Subtitles, search, analysis |
False | Segment-level only | General transcription |
Best Practice:
# Enable for subtitles, search, or detailed analysis
word_timestamps=True
# Disable for simple transcription (faster)
word_timestamps=False
10. Task
Transcribe or translate
Controls whether to transcribe or translate to English.
# Transcribe (default)
result = model.transcribe(audio, task="transcribe")
# Translate to English
result = model.transcribe(audio, task="translate")
Settings:
| Value | Behavior | Use Case |
|---|---|---|
"transcribe" | Original language | Standard transcription |
"translate" | English translation | Multilingual content |
Best Practice:
# For transcription
task="transcribe"
# For translation to English
task="translate" # Useful for multilingual meetings
Threshold Parameters
11. No Speech Threshold
Detect silence/non-speech
Controls sensitivity for detecting non-speech segments.
result = model.transcribe(
audio,
no_speech_threshold=0.6 # Default
)
Values:
| Value | Behavior | Use Case |
|---|---|---|
0.2 | Very sensitive | Noisy audio |
0.6 | Default | β Recommended |
0.8 | Less sensitive | Clean audio |
Best Practice: Keep at default
0.6 unless you have specific needs.12. Log Probability Threshold
Filter low-confidence segments
Removes segments with low confidence scores.
result = model.transcribe(
audio,
logprob_threshold=-1.0 # Default
)
Values:
| Value | Behavior | Use Case |
|---|---|---|
-2.0 | More permissive | Noisy audio |
-1.0 | Default | β Recommended |
0.0 | Strict | Very clean audio |
Best Practice: Keep at default
-1.0.13. Compression Ratio Threshold
Detect repetition/hallucination
Filters out repetitive or hallucinated text.
result = model.transcribe(
audio,
compression_ratio_threshold=2.4 # Default
)
Values:
| Value | Behavior | Use Case |
|---|---|---|
2.0 | More strict | Very clean audio |
2.4 | Default | β Recommended |
3.0 | More permissive | Noisy audio |
Best Practice: Keep at default
2.4.Optimal Settings by Use Case
Maximum Accuracy Configuration
Best for: Critical transcriptions, legal, medical, important meetings
import whisper
model = whisper.load_model("large") # or "medium"
result = model.transcribe(
audio,
language="en", # Specify if known
temperature=0.0, # Most deterministic
best_of=5, # Try multiple decodings
beam_size=5, # Beam search
patience=1.0, # Beam search patience
condition_on_previous_text=True, # Use context
word_timestamps=True, # Detailed output
initial_prompt="Context about your audio here...",
fp16=True # Use GPU if available
)
Accuracy Gain: +20-30% vs defaults
Speed: 5-10x slower
Speed: 5-10x slower
Balanced Speed/Accuracy Configuration
Best for: General transcription, podcasts, most use cases
import whisper
model = whisper.load_model("medium") # or "small"
result = model.transcribe(
audio,
language="en", # Specify if known
temperature=0.0, # Deterministic
best_of=1, # Single decoding (faster)
beam_size=5, # Standard beam search
condition_on_previous_text=True, # Use context
initial_prompt="Brief context if helpful...",
fp16=True
)
Accuracy Gain: +10-15% vs defaults
Speed: 2-3x slower
Speed: 2-3x slower
Fast Configuration
Best for: Quick transcriptions, testing, real-time needs
import whisper
model = whisper.load_model("base") # or "small"
result = model.transcribe(
audio,
language="en", # Still specify for speed
temperature=0.0,
best_of=1, # Single decoding
beam_size=1, # Greedy decoding (fastest)
condition_on_previous_text=False, # Skip context
fp16=True
)
Accuracy: Similar to defaults
Speed: Fastest
Speed: Fastest
Noisy Audio Configuration
Best for: Phone calls, poor quality recordings, background noise
import whisper
model = whisper.load_model("large") # Larger model for noise
result = model.transcribe(
audio,
language="en",
temperature=0.0,
best_of=5, # Multiple attempts help with noise
beam_size=5,
condition_on_previous_text=True,
initial_prompt="Phone call with background noise...",
no_speech_threshold=0.4, # More sensitive to speech
logprob_threshold=-1.5, # More permissive
fp16=True
)
Accuracy Gain: +15-25% for noisy audio
Speed: 5-10x slower
Speed: 5-10x slower
Multilingual Configuration
Best for: Multiple languages, code-switching
import whisper
model = whisper.load_model("medium")
# For transcription (keep original languages)
result = model.transcribe(
audio,
language=None, # Auto-detect
temperature=0.0,
best_of=3, # Helpful for language detection
condition_on_previous_text=True,
fp16=True
)
# For translation (to English)
result = model.transcribe(
audio,
task="translate", # Translate to English
temperature=0.0,
best_of=3,
fp16=True
)
Long-Form Audio Configuration
Best for: Podcasts, long meetings, interviews (1+ hours)
import whisper
model = whisper.load_model("medium")
# Process in chunks with context
def transcribe_long_audio(audio_path, chunk_length=60):
# Split audio into chunks
# Process each with context from previous
result = model.transcribe(
chunk_audio,
temperature=0.0,
best_of=1, # Use 1 for speed on long files
beam_size=5,
condition_on_previous_text=True, # Critical for long audio
initial_prompt=previous_context, # Pass previous text
fp16=True
)
return result
Key: Use
condition_on_previous_text=True and pass previous context via initial_prompt.Complete Optimal Settings Template
For Most Use Cases (Recommended)
import whisper
def transcribe_optimal(audio_path, model_size="medium", language="en", context=None):
"""
Optimal Whisper transcription settings for most use cases.
Args:
audio_path: Path to audio file
model_size: Model size ("small", "medium", "large")
language: Language code or None for auto-detect
context: Optional context string for initial_prompt
"""
# Load model
model = whisper.load_model(model_size)
# Prepare parameters
transcribe_params = {
"language": language,
"temperature": 0.0, # Most deterministic
"best_of": 5, # Try multiple decodings
"beam_size": 5, # Beam search
"patience": 1.0, # Beam search patience
"condition_on_previous_text": True, # Use context
"word_timestamps": True, # Detailed output
"fp16": True, # Use GPU if available
}
# Add context if provided
if context:
transcribe_params["initial_prompt"] = context
# Transcribe
result = model.transcribe(audio_path, **transcribe_params)
return result
# Usage
result = transcribe_optimal(
"meeting.mp3",
model_size="medium",
language="en",
context="Business meeting discussing Q4 product planning."
)
Settings Comparison Table
| Setting | Default | Optimal (Accuracy) | Optimal (Speed) |
|---|---|---|---|
| Model | base | large/medium | base/small |
| Temperature | 0.6 | 0.0 | 0.0 |
| Best Of | 1 | 5 | 1 |
| Beam Size | 5 | 5 | 1 |
| Patience | 1.0 | 1.0 | 1.0 |
| Condition on Previous | True | True | False |
| Language | Auto | Specify | Specify |
| Initial Prompt | None | Provide | Optional |
| Word Timestamps | False | True | False |
Performance Impact Summary
| Setting Change | Accuracy Impact | Speed Impact |
|---|---|---|
| base β medium | +15-20% | -50% |
| base β large | +25-30% | -75% |
| best_of: 1 β 5 | +3-8% | -80% |
| temperature: 0.6 β 0.0 | +2-5% | No change |
| Specify language | +5-10% | +10% (faster) |
| Add initial_prompt | +5-15% | No change |
| condition_on_previous: True | +2-5% | -5% |
Combined optimal settings: +30-50% accuracy improvement vs defaults
Common Mistakes to Avoid
β Using Default Temperature
Problem: Default
temperature=0.6 adds unnecessary randomness.Fix: Always use
temperature=0.0 for transcription.β Not Specifying Language
Problem: Auto-detection can fail or be slow.
Fix: Always specify
language when known.β Using Tiny Model for Important Content
Problem: Tiny model has poor accuracy.
Fix: Use at least
base, preferably small or medium.β Disabling Context
Problem:
condition_on_previous_text=False loses context.Fix: Always enable for better accuracy.
β Not Using Initial Prompt
Problem: Missing domain-specific context.
Fix: Provide
initial_prompt with relevant context.Best Practices Summary
- β
Use
mediummodel for best balance (orlargefor maximum accuracy) - β
Set
temperature=0.0for deterministic results - β
Use
best_of=5for critical content (or1for speed) - β
Set
beam_size=5for good accuracy - β
Enable
condition_on_previous_text=Truefor context - β
Specify
languagewhen known (improves accuracy and speed) - β
Provide
initial_promptwith domain context - β
Enable
word_timestamps=Truefor detailed output - β
Use
fp16=Trueif GPU available (faster) - β Chunk long audio and use context between chunks
Quick Reference: Copy-Paste Configurations
Maximum Accuracy
model.transcribe(audio,
language="en", temperature=0.0, best_of=5, beam_size=5,
condition_on_previous_text=True, word_timestamps=True,
initial_prompt="Your context here")
Balanced
model.transcribe(audio,
language="en", temperature=0.0, best_of=1, beam_size=5,
condition_on_previous_text=True)
Fast
model.transcribe(audio,
language="en", temperature=0.0, beam_size=1,
condition_on_previous_text=False)
Conclusion
Optimizing Whisper settings can significantly improve transcription accuracy. The most impactful settings are:
- Model size (biggest impact)
- Language specification (easy win)
- Temperature (always use 0.0)
- Best of (for critical content)
- Initial prompt (for domain-specific content)
For most use cases, the balanced configuration provides excellent results. For critical transcriptions, use the maximum accuracy configuration.
Remember: Clear audio beats perfect settings. Even with optimal settings, poor audio quality will limit accuracy. Focus on good recording quality first, then optimize settings.
For production-ready transcription with optimized Whisper settings, platforms like SayToWords automatically apply best practices and handle configuration optimization for you.
