Whisper Best Settings: Complete Guide to Optimal Configuration

Whisper Best Settings: Complete Guide to Optimal Configuration

Eric King

Eric King

Author


Getting the best results from OpenAI Whisper requires understanding and configuring its various parameters correctly. While Whisper works well with default settings, optimizing these parameters can significantly improve accuracy, speed, and consistency.
This comprehensive guide covers all Whisper settings, explains what each parameter does, and provides optimal configurations for different use cases.

Understanding Whisper Parameters

Whisper's transcribe() function accepts many parameters that control transcription behavior. Here's a complete overview:
import whisper

model = whisper.load_model("base")

result = model.transcribe(
    audio="audio.mp3",
    verbose=False,
    temperature=0.0,
    compression_ratio_threshold=2.4,
    logprob_threshold=-1.0,
    no_speech_threshold=0.6,
    condition_on_previous_text=True,
    initial_prompt=None,
    word_timestamps=False,
    prepend_punctuations="\"'"¿([{-",
    append_punctuations="\"'.。,,!!??::")]}、",
    decode_options=None,
    best_of=5,
    beam_size=5,
    patience=1.0,
    length_penalty=1.0,
    suppress_tokens="-1",
    suppress_blank=True,
    without_timestamps=False,
    max_initial_timestamp=1.0,
    word_timestamps=False,
    language=None,
    task="transcribe",
    fp16=True,
    temperature_increment_on_fallback=0.2,
    compression_ratio_threshold=2.4,
    logprob_threshold=-1.0,
    no_speech_threshold=0.6
)
Let's break down each parameter and its optimal settings.

Core Parameters

1. Model Size (model)

Most Important Setting
The model size has the biggest impact on accuracy and speed.
# Available models (from smallest to largest):
model = whisper.load_model("tiny")    # Fastest, lowest accuracy
model = whisper.load_model("base")    # Balanced
model = whisper.load_model("small")   # Good accuracy
model = whisper.load_model("medium")  # High accuracy
model = whisper.load_model("large")   # Best accuracy, slowest
Model Selection Guide:
ModelAccuracySpeedVRAMBest For
tiny⭐⭐⭐⭐⭐⭐⭐~1GBQuick testing
base⭐⭐⭐⭐⭐⭐⭐~1GBGeneral use
small⭐⭐⭐⭐⭐⭐⭐~2GBGood balance
medium⭐⭐⭐⭐⭐⭐⭐~5GBRecommended for most
large⭐⭐⭐⭐⭐⭐~10GBMaximum accuracy
Best Practice:
  • For most use cases: Use medium model
  • For speed-critical: Use small or base
  • For maximum accuracy: Use large
  • For testing: Use tiny or base

2. Temperature

Controls randomness in decoding
Lower temperature = more deterministic, higher accuracy.
result = model.transcribe(
    audio,
    temperature=0.0  # Most deterministic, best for accuracy
)
Temperature Values:
ValueBehaviorUse Case
0.0Most deterministicBest for accuracy
0.2Slight randomnessGood balance
0.6Default, balancedGeneral use
1.0+More creativeNot recommended for transcription
Best Practice:
# For maximum accuracy
temperature=0.0

# Whisper can use multiple temperatures
temperature=(0.0, 0.2, 0.4, 0.6, 0.8)  # Tries each, picks best
Why Lower Temperature is Better:
  • Reduces random variations
  • More consistent results
  • Better for technical terms
  • Improves accuracy by 2-5%

3. Best Of

Number of decoding attempts
Tries multiple decodings and selects the best result.
result = model.transcribe(
    audio,
    best_of=5  # Try 5 decodings, pick best
)
Best Of Values:
ValueAccuracy GainSpeed ImpactRecommendation
1 (default)BaselineFastestGeneral use
3+2-4%3x slowerGood balance
5+3-8%5x slowerBest for accuracy
10+5-10%10x slowerOverkill for most
Best Practice:
# For critical transcriptions
best_of=5  # Good balance of accuracy and speed

# For general use
best_of=1  # Fastest, still accurate
Trade-off: best_of=5 improves accuracy by 3-8% but is 5x slower.

4. Beam Size

Beam search width
Controls how many candidate sequences are explored during decoding.
result = model.transcribe(
    audio,
    beam_size=5  # Explore 5 candidate sequences
)
Beam Size Values:
ValueAccuracySpeedUse Case
1FastestFastestReal-time (greedy)
5HighMediumRecommended
10Very HighSlowMaximum accuracy
Best Practice:
# Standard configuration
beam_size=5  # Good balance

# For maximum accuracy
beam_size=5  # Works well with best_of=5
Note: Beam size works together with best_of. Use beam_size=5 with best_of=5 for optimal results.

5. Patience

Beam search patience
Controls how long beam search continues before pruning candidates.
result = model.transcribe(
    audio,
    patience=1.0  # Default, good balance
)
Patience Values:
ValueBehaviorUse Case
0.0Aggressive pruningFast, less accurate
1.0DefaultRecommended
2.0More patientSlower, slightly better
Best Practice: Keep at default 1.0 unless you need maximum accuracy.

6. Condition on Previous Text

Use context from previous segments
Improves accuracy by using context from earlier in the audio.
result = model.transcribe(
    audio,
    condition_on_previous_text=True  # Use context
)
Settings:
ValueBehaviorImpact
TrueUses previous contextBetter accuracy
FalseIndependent segmentsFaster, less accurate
Best Practice:
# Always enable for better accuracy
condition_on_previous_text=True
Why This Helps:
  • Maintains context across segments
  • Better handling of pronouns and references
  • Improves accuracy by 2-5%

7. Language

Specify language explicitly
Improves accuracy when language is known.
result = model.transcribe(
    audio,
    language="en"  # English
)
Language Codes:
language="en"  # English
language="zh"  # Chinese
language="es"  # Spanish
language="fr"  # French
language="de"  # German
language="ja"  # Japanese
# ... and 90+ more languages
Best Practice:
# If language is known, always specify
language="en"  # Improves accuracy by 5-10%

# For auto-detection
language=None  # Slower, may misdetect
Why This Helps:
  • Skips language detection step (faster)
  • Prevents misdetection errors
  • Better handling of accents
  • Improves accuracy by 5-10%

8. Initial Prompt

Provide context about content
Helps Whisper understand domain-specific terms and context.
result = model.transcribe(
    audio,
    initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)
Best Practice:
# Include:
# - Topic/domain
# - Speaker names
# - Technical terms
# - Context

initial_prompt="""
This is a business meeting about Q4 product planning.
Participants: Sarah (Product Manager), John (Engineer).
Topics: API endpoints, microservices, Kubernetes.
"""
Why This Helps:
  • Better recognition of proper nouns
  • Improved technical terminology
  • Context-aware transcription
  • Improves accuracy by 5-15% for domain-specific content

9. Word Timestamps

Get word-level timestamps
Useful for subtitles, search, and detailed analysis.
result = model.transcribe(
    audio,
    word_timestamps=True  # Get word-level timestamps
)

# Access word timestamps
for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
Settings:
ValueOutputUse Case
TrueWord-level timestampsSubtitles, search, analysis
FalseSegment-level onlyGeneral transcription
Best Practice:
# Enable for subtitles, search, or detailed analysis
word_timestamps=True

# Disable for simple transcription (faster)
word_timestamps=False

10. Task

Transcribe or translate
Controls whether to transcribe or translate to English.
# Transcribe (default)
result = model.transcribe(audio, task="transcribe")

# Translate to English
result = model.transcribe(audio, task="translate")
Settings:
ValueBehaviorUse Case
"transcribe"Original languageStandard transcription
"translate"English translationMultilingual content
Best Practice:
# For transcription
task="transcribe"

# For translation to English
task="translate"  # Useful for multilingual meetings

Threshold Parameters

11. No Speech Threshold

Detect silence/non-speech
Controls sensitivity for detecting non-speech segments.
result = model.transcribe(
    audio,
    no_speech_threshold=0.6  # Default
)
Values:
ValueBehaviorUse Case
0.2Very sensitiveNoisy audio
0.6DefaultRecommended
0.8Less sensitiveClean audio
Best Practice: Keep at default 0.6 unless you have specific needs.

12. Log Probability Threshold

Filter low-confidence segments
Removes segments with low confidence scores.
result = model.transcribe(
    audio,
    logprob_threshold=-1.0  # Default
)
Values:
ValueBehaviorUse Case
-2.0More permissiveNoisy audio
-1.0DefaultRecommended
0.0StrictVery clean audio
Best Practice: Keep at default -1.0.

13. Compression Ratio Threshold

Detect repetition/hallucination
Filters out repetitive or hallucinated text.
result = model.transcribe(
    audio,
    compression_ratio_threshold=2.4  # Default
)
Values:
ValueBehaviorUse Case
2.0More strictVery clean audio
2.4DefaultRecommended
3.0More permissiveNoisy audio
Best Practice: Keep at default 2.4.

Optimal Settings by Use Case

Maximum Accuracy Configuration

Best for: Critical transcriptions, legal, medical, important meetings
import whisper

model = whisper.load_model("large")  # or "medium"

result = model.transcribe(
    audio,
    language="en",  # Specify if known
    temperature=0.0,  # Most deterministic
    best_of=5,  # Try multiple decodings
    beam_size=5,  # Beam search
    patience=1.0,  # Beam search patience
    condition_on_previous_text=True,  # Use context
    word_timestamps=True,  # Detailed output
    initial_prompt="Context about your audio here...",
    fp16=True  # Use GPU if available
)
Accuracy Gain: +20-30% vs defaults
Speed: 5-10x slower

Balanced Speed/Accuracy Configuration

Best for: General transcription, podcasts, most use cases
import whisper

model = whisper.load_model("medium")  # or "small"

result = model.transcribe(
    audio,
    language="en",  # Specify if known
    temperature=0.0,  # Deterministic
    best_of=1,  # Single decoding (faster)
    beam_size=5,  # Standard beam search
    condition_on_previous_text=True,  # Use context
    initial_prompt="Brief context if helpful...",
    fp16=True
)
Accuracy Gain: +10-15% vs defaults
Speed: 2-3x slower

Fast Configuration

Best for: Quick transcriptions, testing, real-time needs
import whisper

model = whisper.load_model("base")  # or "small"

result = model.transcribe(
    audio,
    language="en",  # Still specify for speed
    temperature=0.0,
    best_of=1,  # Single decoding
    beam_size=1,  # Greedy decoding (fastest)
    condition_on_previous_text=False,  # Skip context
    fp16=True
)
Accuracy: Similar to defaults
Speed: Fastest

Noisy Audio Configuration

Best for: Phone calls, poor quality recordings, background noise
import whisper

model = whisper.load_model("large")  # Larger model for noise

result = model.transcribe(
    audio,
    language="en",
    temperature=0.0,
    best_of=5,  # Multiple attempts help with noise
    beam_size=5,
    condition_on_previous_text=True,
    initial_prompt="Phone call with background noise...",
    no_speech_threshold=0.4,  # More sensitive to speech
    logprob_threshold=-1.5,  # More permissive
    fp16=True
)
Accuracy Gain: +15-25% for noisy audio
Speed: 5-10x slower

Multilingual Configuration

Best for: Multiple languages, code-switching
import whisper

model = whisper.load_model("medium")

# For transcription (keep original languages)
result = model.transcribe(
    audio,
    language=None,  # Auto-detect
    temperature=0.0,
    best_of=3,  # Helpful for language detection
    condition_on_previous_text=True,
    fp16=True
)

# For translation (to English)
result = model.transcribe(
    audio,
    task="translate",  # Translate to English
    temperature=0.0,
    best_of=3,
    fp16=True
)

Long-Form Audio Configuration

Best for: Podcasts, long meetings, interviews (1+ hours)
import whisper

model = whisper.load_model("medium")

# Process in chunks with context
def transcribe_long_audio(audio_path, chunk_length=60):
    # Split audio into chunks
    # Process each with context from previous
    result = model.transcribe(
        chunk_audio,
        temperature=0.0,
        best_of=1,  # Use 1 for speed on long files
        beam_size=5,
        condition_on_previous_text=True,  # Critical for long audio
        initial_prompt=previous_context,  # Pass previous text
        fp16=True
    )
    return result
Key: Use condition_on_previous_text=True and pass previous context via initial_prompt.

Complete Optimal Settings Template

import whisper

def transcribe_optimal(audio_path, model_size="medium", language="en", context=None):
    """
    Optimal Whisper transcription settings for most use cases.
    
    Args:
        audio_path: Path to audio file
        model_size: Model size ("small", "medium", "large")
        language: Language code or None for auto-detect
        context: Optional context string for initial_prompt
    """
    # Load model
    model = whisper.load_model(model_size)
    
    # Prepare parameters
    transcribe_params = {
        "language": language,
        "temperature": 0.0,  # Most deterministic
        "best_of": 5,  # Try multiple decodings
        "beam_size": 5,  # Beam search
        "patience": 1.0,  # Beam search patience
        "condition_on_previous_text": True,  # Use context
        "word_timestamps": True,  # Detailed output
        "fp16": True,  # Use GPU if available
    }
    
    # Add context if provided
    if context:
        transcribe_params["initial_prompt"] = context
    
    # Transcribe
    result = model.transcribe(audio_path, **transcribe_params)
    
    return result

# Usage
result = transcribe_optimal(
    "meeting.mp3",
    model_size="medium",
    language="en",
    context="Business meeting discussing Q4 product planning."
)

Settings Comparison Table

SettingDefaultOptimal (Accuracy)Optimal (Speed)
Modelbaselarge/mediumbase/small
Temperature0.60.00.0
Best Of151
Beam Size551
Patience1.01.01.0
Condition on PreviousTrueTrueFalse
LanguageAutoSpecifySpecify
Initial PromptNoneProvideOptional
Word TimestampsFalseTrueFalse

Performance Impact Summary

Setting ChangeAccuracy ImpactSpeed Impact
base → medium+15-20%-50%
base → large+25-30%-75%
best_of: 1 → 5+3-8%-80%
temperature: 0.6 → 0.0+2-5%No change
Specify language+5-10%+10% (faster)
Add initial_prompt+5-15%No change
condition_on_previous: True+2-5%-5%
Combined optimal settings: +30-50% accuracy improvement vs defaults

Common Mistakes to Avoid

❌ Using Default Temperature

Problem: Default temperature=0.6 adds unnecessary randomness.
Fix: Always use temperature=0.0 for transcription.

❌ Not Specifying Language

Problem: Auto-detection can fail or be slow.
Fix: Always specify language when known.

❌ Using Tiny Model for Important Content

Problem: Tiny model has poor accuracy.
Fix: Use at least base, preferably small or medium.

❌ Disabling Context

Problem: condition_on_previous_text=False loses context.
Fix: Always enable for better accuracy.

❌ Not Using Initial Prompt

Problem: Missing domain-specific context.
Fix: Provide initial_prompt with relevant context.

Best Practices Summary

  1. Use medium model for best balance (or large for maximum accuracy)
  2. Set temperature=0.0 for deterministic results
  3. Use best_of=5 for critical content (or 1 for speed)
  4. Set beam_size=5 for good accuracy
  5. Enable condition_on_previous_text=True for context
  6. Specify language when known (improves accuracy and speed)
  7. Provide initial_prompt with domain context
  8. Enable word_timestamps=True for detailed output
  9. Use fp16=True if GPU available (faster)
  10. Chunk long audio and use context between chunks

Quick Reference: Copy-Paste Configurations

Maximum Accuracy

model.transcribe(audio, 
    language="en", temperature=0.0, best_of=5, beam_size=5,
    condition_on_previous_text=True, word_timestamps=True,
    initial_prompt="Your context here")

Balanced

model.transcribe(audio,
    language="en", temperature=0.0, best_of=1, beam_size=5,
    condition_on_previous_text=True)

Fast

model.transcribe(audio,
    language="en", temperature=0.0, beam_size=1,
    condition_on_previous_text=False)

Conclusion

Optimizing Whisper settings can significantly improve transcription accuracy. The most impactful settings are:
  1. Model size (biggest impact)
  2. Language specification (easy win)
  3. Temperature (always use 0.0)
  4. Best of (for critical content)
  5. Initial prompt (for domain-specific content)
For most use cases, the balanced configuration provides excellent results. For critical transcriptions, use the maximum accuracy configuration.
Remember: Clear audio beats perfect settings. Even with optimal settings, poor audio quality will limit accuracy. Focus on good recording quality first, then optimize settings.

For production-ready transcription with optimized Whisper settings, platforms like SayToWords automatically apply best practices and handle configuration optimization for you.

立即免費試用

現在就體驗我們的 AI 語音與音視頻服務!不僅可以享受高精度語音轉文字、多語言翻譯與智能說話人識別,還能自動生成視頻字幕、智能編輯音視頻內容並進行聲畫同步分析,全面覆蓋會議記錄、短視頻創作、播客製作等場景——立即開始免費試用吧!

在线声音转文字免费声音转文字声音转文字转换器声音转文字 MP3声音转文字 WAV声音转文字(带时间戳)会议声音转文字Sound to Text Multi Language声音转文字字幕转换WAV为文字语音转文字在线语音转文字语音转文字转换MP3为文字语音录音转文字在线语音输入带时间戳的语音转文字实时语音转文字长音频语音转文字视频语音转文字YouTube语音转文字视频编辑语音转文字字幕语音转文字播客语音转文字采访语音转文字访谈音频转文字录音语音转文字会议语音转文字讲座语音转文字语音笔记转文字多语言语音转文字高准确度语音转文字快速语音转文字Premiere Pro 语音转文字替代方案DaVinci 语音转文字替代方案VEED 语音转文字替代方案InVideo 语音转文字替代方案Otter.ai 语音转文字替代方案Descript 语音转文字替代方案Trint 语音转文字替代方案Rev 语音转文字替代方案Sonix 语音转文字替代方案Happy Scribe 语音转文字替代方案Zoom 语音转文字替代方案Google Meet 语音转文字替代方案Microsoft Teams 语音转文字替代方案Fireflies.ai 语音转文字替代方案Fathom 语音转文字替代方案FlexClip 语音转文字替代方案Kapwing 语音转文字替代方案Canva 语音转文字替代方案长音频语音转文字AI语音转文字免费语音转文字无广告语音转文字噪音音频语音转文字带时间戳的语音转文字从音频生成字幕播客转录在线转录客户通话TikTok语音转文字TikTok音频转文字YouTube语音转文字YouTube音频转文字语音备忘录转文字WhatsApp语音消息转文字Telegram语音转文字Discord通话转录Twitch语音转文字Skype语音转文字Messenger语音转文字LINE语音消息转文字Vlog转录转文字讲道音频转文字语音转文字音频转文字语音笔记转文字语音输入会议语音输入YouTube语音输入说话打字免提打字语音转文字语音转文字在线语音转文字Online Transcription Software会议语音转文字快速语音转文字Real Time Speech to TextLive Transcription AppTikTok语音转文字TikTok音频转文字说话转文字语音转文字Talk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for Meetings音频转文字声音转文字语音写作工具语音写作工具语音听写法律转录工具医疗语音听写工具日语音频转录韩语会议转录会议转录工具会议音频转文字讲座转文字转换器讲座音频转文字视频转文字转录TikTok字幕生成器呼叫中心转录Reels音频转文字工具MP3转录为文字WAV文件转录为文字CapCut语音转文字CapCut语音转文字英语语音转文字英语音频转文字西班牙语语音转文字法语语音转文字法语音频转文字德语语音转文字德语音频转文字日语语音转文字日语音频转文字韩语语音转文字韩语音频转文字葡萄牙语语音转文字阿拉伯语语音转文字中文语音转文字印地语语音转文字俄语语音转文字网页语音输入工具语音输入网站