πŸŽ‰ We're live! All services are free during our trial periodβ€”pricing plans coming soon.

Whisper Audio Requirements: Complete Guide to Supported Formats and Specifications

Whisper Audio Requirements: Complete Guide to Supported Formats and Specifications

Eric King

Eric King

Author


Understanding Whisper's audio requirements is crucial for achieving the best transcription accuracy. While Whisper is flexible and can handle many audio formats, following optimal specifications ensures maximum performance.
This comprehensive guide covers all audio requirements, supported formats, technical specifications, and best practices for preparing audio files for Whisper transcription.

Supported Audio Formats

Whisper supports a wide range of audio and video formats through FFmpeg. Here's the complete list:

Audio Formats

FormatExtensionNotes
WAV.wavβœ… Preferred, lossless
MP3.mp3βœ… Most common, widely used
FLAC.flacβœ… Lossless, good compression
M4A.m4aβœ… Apple format, AAC codec
AAC.aacβœ… High quality compression
OGG.oggβœ… Open source format
OPUS.opusβœ… Low latency, web-friendly
WMA.wma⚠️ Less common
AMR.amr⚠️ Low quality, phone recordings

Video Formats (Audio Extraction)

FormatExtensionNotes
MP4.mp4βœ… Most common video format
AVI.aviβœ… Older format, still supported
MKV.mkvβœ… Container format
MOV.movβœ… QuickTime format
WebM.webmβœ… Web video format
FLV.flv⚠️ Legacy Flash format
Important: Whisper automatically extracts audio from video files, so you can upload video files directly.

Sample Rate Requirements

Optimal Sample Rate: 16 kHz

Whisper internally resamples all audio to 16 kHz mono before processing. This is the optimal sample rate for speech recognition.

Supported Sample Rates

Whisper accepts audio at any sample rate, but here's what you should know:
Input Sample RateWhisper ProcessingRecommendation
8 kHzResampled to 16 kHzβœ… Phone calls, acceptable
16 kHzUsed directlyβœ… Optimal, no resampling
22.05 kHzResampled to 16 kHzβœ… Good quality
44.1 kHzResampled to 16 kHzβœ… CD quality, fine
48 kHzResampled to 16 kHzβœ… Professional audio, fine
96 kHzResampled to 16 kHz⚠️ Unnecessary, larger files
Key Insight: Higher sample rates don't improve Whisper's accuracy. The model was trained on 16 kHz audio, so providing 16 kHz input avoids unnecessary resampling and file size.

Best Practice

# Convert audio to 16 kHz before processing (optional optimization)
import ffmpeg

def convert_to_16khz(input_file, output_file):
    stream = ffmpeg.input(input_file)
    stream = ffmpeg.output(
        stream,
        output_file,
        acodec='pcm_s16le',
        ac=1,  # Mono
        ar=16000  # 16 kHz
    )
    ffmpeg.run(stream, overwrite_output=True)

Bit Depth Requirements

Supported Bit Depths

Bit DepthStatusNotes
8-bitβœ… SupportedLow quality, not recommended
16-bitβœ… RecommendedStandard, sufficient
24-bitβœ… SupportedProfessional, larger files
32-bit floatβœ… SupportedStudio quality, overkill
Recommended: 16-bit PCM is the standard and provides excellent quality for speech recognition. Higher bit depths don't improve transcription accuracy.

Channel Configuration: Mono vs Stereo

Whisper processes audio as mono internally, so mono input is optimal.
Advantages:
  • Smaller file size
  • Faster processing
  • No channel mixing needed
  • Optimal for single-speaker content
Use mono for:
  • Single speaker recordings
  • Phone calls
  • Podcasts with one host
  • Most transcription tasks

Stereo (Supported)

Stereo files are automatically converted to mono by averaging or selecting one channel.
When stereo is useful:
  • Separate speakers on different channels (rare)
  • Original recording is stereo (conversion is automatic)
Best practice: Convert stereo to mono before processing if you have control:
import ffmpeg

# Convert stereo to mono
stream = ffmpeg.input('stereo_audio.wav')
stream = ffmpeg.output(
    stream,
    'mono_audio.wav',
    ac=1  # Mono channel
)
ffmpeg.run(stream, overwrite_output=True)

File Size Limits

Practical Limits

Whisper doesn't have a hard file size limit, but practical considerations apply:
File SizeProcessing TimeRecommendation
< 10 MBSecondsβœ… Ideal
10-100 MBMinutesβœ… Good
100-500 MB10-30 minutes⚠️ Consider chunking
> 500 MB30+ minutes⚠️ Must chunk

Memory Considerations

Large files require more RAM/VRAM:
  • Base model: ~1-2 GB VRAM
  • Small model: ~2-3 GB VRAM
  • Medium model: ~5-6 GB VRAM
  • Large model: ~10-12 GB VRAM
Best practice: For files > 100 MB, split into chunks (see below).

Duration Limits

DurationStatusNotes
< 30 minutesβœ… OptimalProcess directly
30-60 minutesβœ… GoodMay need chunking
1-2 hours⚠️ Chunk recommendedBetter accuracy with chunks
> 2 hours⚠️ Must chunkRequired for stability

Why Chunk Long Audio?

  1. Memory limits: Prevents out-of-memory errors
  2. Better accuracy: Smaller chunks maintain context better
  3. Faster processing: Parallel processing possible
  4. Error recovery: If one chunk fails, others succeed
Chunking strategy:
# Split long audio into 30-60 second chunks with 5-10 second overlap
def chunk_audio(audio_path, chunk_length=60, overlap=5):
    # Implementation: split audio into segments
    # Process each chunk separately
    # Merge results with timestamps
    pass

Audio Quality Requirements

Minimum Quality Standards

For acceptable accuracy, your audio should meet these criteria:
FactorMinimumOptimal
Signal-to-Noise Ratio> 10 dB> 20 dB
Bitrate (MP3)β‰₯ 64 kbpsβ‰₯ 128 kbps
Volume levelAudibleNormalized to -3 dB
Background noiseMinimalNone
Echo/reverbMinimalNone

Quality Checklist

Before transcribing, ensure:
  • βœ… Clear speech: Speakers are audible and understandable
  • βœ… Minimal noise: Background sounds don't overpower speech
  • βœ… Consistent volume: No sudden volume changes
  • βœ… No clipping: Audio isn't distorted or saturated
  • βœ… Good microphone: Quality recording equipment used

Codec Requirements

CodecFormatQualityRecommendation
PCMWAVLosslessβœ… Best for accuracy
FLACFLACLosslessβœ… Excellent, compressed
AACM4A, MP4High qualityβœ… Very good
MP3MP3Lossyβœ… Good at β‰₯128 kbps
OGG VorbisOGGLossyβœ… Good quality
OPUSOPUSLossyβœ… Good, low latency

Codec Best Practices

For maximum accuracy:
  • Use PCM (WAV) or FLAC (lossless)
For practical use:
  • Use AAC or MP3 at β‰₯128 kbps (excellent results)
Avoid:
  • Very low bitrate MP3 (< 64 kbps)
  • Highly compressed formats
  • Phone codecs (AMR, G.711) unless necessary

Audio Preprocessing Recommendations

Before Transcribing

While Whisper handles many issues automatically, preprocessing can improve results:

1. Normalize Volume

import numpy as np
from scipy.io import wavfile

def normalize_audio(audio_path, output_path, target_dB=-3.0):
    sr, audio = wavfile.read(audio_path)
    audio = audio.astype(np.float32)
    
    # Normalize to target dB
    max_val = np.max(np.abs(audio))
    target_linear = 10 ** (target_dB / 20)
    audio = audio * (target_linear / max_val)
    
    # Clip to prevent overflow
    audio = np.clip(audio, -1.0, 1.0)
    
    wavfile.write(output_path, sr, (audio * 32767).astype(np.int16))

2. Remove Silence

# Remove leading/trailing silence
# Helps with processing time and accuracy

3. Noise Reduction (Optional)

For noisy recordings:
# Use noise reduction libraries
# librosa, noisereduce, or specialized tools
# Only if background noise is significant

4. Resample to 16 kHz (Optional)

If you want to optimize file size:
import ffmpeg

stream = ffmpeg.input('input.wav')
stream = ffmpeg.output(
    stream,
    'output_16k.wav',
    ar=16000  # Resample to 16 kHz
)
ffmpeg.run(stream, overwrite_output=True)

Common Audio Issues and Solutions

Issue 1: Very Low Sample Rate (8 kHz)

Problem: Phone recordings at 8 kHz may have reduced accuracy.
Solution:
  • Use Whisper medium or large model (better at low sample rates)
  • Upsample to 16 kHz (doesn't restore quality but helps processing)

Issue 2: Stereo with Different Speakers

Problem: Two speakers on separate channels.
Solution:
# Extract each channel separately
import torchaudio

audio, sr = torchaudio.load('stereo.wav')
speaker1 = audio[0]  # Left channel
speaker2 = audio[1]  # Right channel

# Transcribe each separately
result1 = model.transcribe(speaker1)
result2 = model.transcribe(speaker2)

Issue 3: Variable Bitrate MP3

Problem: VBR MP3 may cause issues with some tools.
Solution:
  • Convert to constant bitrate (CBR) or WAV
  • Whisper handles VBR fine, but CBR is more predictable

Issue 4: Corrupted Audio Files

Problem: File plays but Whisper fails.
Solution:
# Re-encode the file
import ffmpeg

stream = ffmpeg.input('corrupted.mp3')
stream = ffmpeg.output(
    stream,
    'fixed.wav',
    acodec='pcm_s16le'
)
ffmpeg.run(stream, overwrite_output=True)

Issue 5: Very Long Audio Files

Problem: Out of memory or very slow processing.
Solution:
  • Split into 30-60 second chunks
  • Process chunks sequentially or in parallel
  • Merge results with timestamps

Format-Specific Recommendations

For Phone Calls

ParameterValueReason
Sample rate8-16 kHzPhone quality
FormatWAV or MP3Standard
Bitrateβ‰₯ 64 kbpsPhone codec quality
ChannelsMonoStandard for calls

For Meetings (Zoom, Teams)

ParameterValueReason
Sample rate16-48 kHzHigh quality
FormatMP4 (extract audio)Video format
Bitrateβ‰₯ 128 kbpsGood quality
ChannelsMono or StereoDepends on setup

For Podcasts

ParameterValueReason
Sample rate44.1-48 kHzProfessional quality
FormatMP3, WAV, or M4ACommon formats
Bitrateβ‰₯ 128 kbpsGood quality
ChannelsMonoStandard for podcasts

For Interviews

ParameterValueReason
Sample rate16-48 kHzHigh quality
FormatWAV or FLACMaximum accuracy
BitrateLossless or β‰₯ 192 kbpsProfessional
ChannelsMonoStandard

Whisper Audio Requirements Summary

Minimum Requirements

  • βœ… Format: Any FFmpeg-supported format
  • βœ… Sample rate: Any (8 kHz minimum recommended)
  • βœ… Bit depth: 8-bit or higher
  • βœ… Channels: Mono or stereo (mono preferred)
  • βœ… File size: No hard limit (chunk if > 100 MB)
  • βœ… Duration: No hard limit (chunk if > 1 hour)

Optimal Requirements

  • βœ… Format: WAV, FLAC, or MP3 (β‰₯128 kbps)
  • βœ… Sample rate: 16 kHz (optimal, no resampling)
  • βœ… Bit depth: 16-bit PCM
  • βœ… Channels: Mono
  • βœ… Quality: Clear speech, minimal noise
  • βœ… Preprocessing: Normalized volume, no clipping

Quick Reference: Audio Preparation Checklist

Before transcribing with Whisper:
  • Format: WAV, MP3, FLAC, M4A, or other supported format
  • Sample rate: 16 kHz (optimal) or any supported rate
  • Bit depth: 16-bit (recommended)
  • Channels: Mono (preferred) or stereo
  • File size: < 100 MB (or plan to chunk)
  • Duration: < 1 hour (or plan to chunk)
  • Quality: Clear speech, minimal background noise
  • Volume: Normalized, no clipping
  • Codec: Lossless (WAV/FLAC) or high-quality lossy (MP3 β‰₯128 kbps)

Testing Your Audio

Quick Test

import whisper

# Load model
model = whisper.load_model("base")

# Test transcription
result = model.transcribe("your_audio.wav")

# Check if successful
if result["text"]:
    print("βœ… Audio format is compatible")
    print(f"Detected language: {result['language']}")
else:
    print("⚠️ Transcription failed - check audio format")

Common Error Messages

ErrorCauseSolution
"File not found"Wrong pathCheck file path
"Unsupported format"Format not supportedConvert to WAV/MP3
"Out of memory"File too largeChunk the audio
"Empty audio"Corrupted fileRe-encode the file

Best Practices Summary

  1. Use 16 kHz sample rate when possible (optimal for Whisper)
  2. Prefer mono over stereo (Whisper processes mono internally)
  3. Use lossless formats (WAV/FLAC) for maximum accuracy
  4. Chunk long files (> 1 hour) for better accuracy and stability
  5. Normalize audio to consistent volume levels
  6. Minimize background noise for best results
  7. Use appropriate model size (larger models handle poor audio better)
  8. Test with base model first before using larger models

Conclusion

Whisper is highly flexible and can handle a wide variety of audio formats and qualities. However, following optimal specifications ensures the best transcription accuracy:
  • Format: WAV, FLAC, or MP3 (β‰₯128 kbps)
  • Sample rate: 16 kHz (optimal)
  • Bit depth: 16-bit PCM
  • Channels: Mono
  • Quality: Clear speech with minimal noise
Remember: Clear audio beats perfect format specifications. Even with optimal technical specs, poor recording quality will reduce accuracy. Focus on clear speech, minimal noise, and good microphone placement for the best results.
For production use, platforms like SayToWords automatically handle format conversion, resampling, and optimization, so you can focus on getting clear audio rather than technical specifications.

Need help preparing your audio for Whisper transcription? Check out our other guides on audio preprocessing, chunking strategies, and accuracy optimization.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast productionβ€”start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineSpeech to Text for MeetingsFast Speech to TextTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website