
Whisper Audio Requirements: Complete Guide to Supported Formats and Specifications
Eric King
Author
Understanding Whisper's audio requirements is crucial for achieving the best transcription accuracy. While Whisper is flexible and can handle many audio formats, following optimal specifications ensures maximum performance.
This comprehensive guide covers all audio requirements, supported formats, technical specifications, and best practices for preparing audio files for Whisper transcription.
Supported Audio Formats
Whisper supports a wide range of audio and video formats through FFmpeg. Here's the complete list:
Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | β Preferred, lossless |
| MP3 | .mp3 | β Most common, widely used |
| FLAC | .flac | β Lossless, good compression |
| M4A | .m4a | β Apple format, AAC codec |
| AAC | .aac | β High quality compression |
| OGG | .ogg | β Open source format |
| OPUS | .opus | β Low latency, web-friendly |
| WMA | .wma | β οΈ Less common |
| AMR | .amr | β οΈ Low quality, phone recordings |
Video Formats (Audio Extraction)
| Format | Extension | Notes |
|---|---|---|
| MP4 | .mp4 | β Most common video format |
| AVI | .avi | β Older format, still supported |
| MKV | .mkv | β Container format |
| MOV | .mov | β QuickTime format |
| WebM | .webm | β Web video format |
| FLV | .flv | β οΈ Legacy Flash format |
Important: Whisper automatically extracts audio from video files, so you can upload video files directly.
Sample Rate Requirements
Optimal Sample Rate: 16 kHz
Whisper internally resamples all audio to 16 kHz mono before processing. This is the optimal sample rate for speech recognition.
Supported Sample Rates
Whisper accepts audio at any sample rate, but here's what you should know:
| Input Sample Rate | Whisper Processing | Recommendation |
|---|---|---|
| 8 kHz | Resampled to 16 kHz | β Phone calls, acceptable |
| 16 kHz | Used directly | β Optimal, no resampling |
| 22.05 kHz | Resampled to 16 kHz | β Good quality |
| 44.1 kHz | Resampled to 16 kHz | β CD quality, fine |
| 48 kHz | Resampled to 16 kHz | β Professional audio, fine |
| 96 kHz | Resampled to 16 kHz | β οΈ Unnecessary, larger files |
Key Insight: Higher sample rates don't improve Whisper's accuracy. The model was trained on 16 kHz audio, so providing 16 kHz input avoids unnecessary resampling and file size.
Best Practice
# Convert audio to 16 kHz before processing (optional optimization)
import ffmpeg
def convert_to_16khz(input_file, output_file):
stream = ffmpeg.input(input_file)
stream = ffmpeg.output(
stream,
output_file,
acodec='pcm_s16le',
ac=1, # Mono
ar=16000 # 16 kHz
)
ffmpeg.run(stream, overwrite_output=True)
Bit Depth Requirements
Supported Bit Depths
| Bit Depth | Status | Notes |
|---|---|---|
| 8-bit | β Supported | Low quality, not recommended |
| 16-bit | β Recommended | Standard, sufficient |
| 24-bit | β Supported | Professional, larger files |
| 32-bit float | β Supported | Studio quality, overkill |
Recommended: 16-bit PCM is the standard and provides excellent quality for speech recognition. Higher bit depths don't improve transcription accuracy.
Channel Configuration: Mono vs Stereo
Mono (Recommended)
Whisper processes audio as mono internally, so mono input is optimal.
Advantages:
- Smaller file size
- Faster processing
- No channel mixing needed
- Optimal for single-speaker content
Use mono for:
- Single speaker recordings
- Phone calls
- Podcasts with one host
- Most transcription tasks
Stereo (Supported)
Stereo files are automatically converted to mono by averaging or selecting one channel.
When stereo is useful:
- Separate speakers on different channels (rare)
- Original recording is stereo (conversion is automatic)
Best practice: Convert stereo to mono before processing if you have control:
import ffmpeg
# Convert stereo to mono
stream = ffmpeg.input('stereo_audio.wav')
stream = ffmpeg.output(
stream,
'mono_audio.wav',
ac=1 # Mono channel
)
ffmpeg.run(stream, overwrite_output=True)
File Size Limits
Practical Limits
Whisper doesn't have a hard file size limit, but practical considerations apply:
| File Size | Processing Time | Recommendation |
|---|---|---|
| < 10 MB | Seconds | β Ideal |
| 10-100 MB | Minutes | β Good |
| 100-500 MB | 10-30 minutes | β οΈ Consider chunking |
| > 500 MB | 30+ minutes | β οΈ Must chunk |
Memory Considerations
Large files require more RAM/VRAM:
- Base model: ~1-2 GB VRAM
- Small model: ~2-3 GB VRAM
- Medium model: ~5-6 GB VRAM
- Large model: ~10-12 GB VRAM
Best practice: For files > 100 MB, split into chunks (see below).
Duration Limits
Recommended Duration
| Duration | Status | Notes |
|---|---|---|
| < 30 minutes | β Optimal | Process directly |
| 30-60 minutes | β Good | May need chunking |
| 1-2 hours | β οΈ Chunk recommended | Better accuracy with chunks |
| > 2 hours | β οΈ Must chunk | Required for stability |
Why Chunk Long Audio?
- Memory limits: Prevents out-of-memory errors
- Better accuracy: Smaller chunks maintain context better
- Faster processing: Parallel processing possible
- Error recovery: If one chunk fails, others succeed
Chunking strategy:
# Split long audio into 30-60 second chunks with 5-10 second overlap
def chunk_audio(audio_path, chunk_length=60, overlap=5):
# Implementation: split audio into segments
# Process each chunk separately
# Merge results with timestamps
pass
Audio Quality Requirements
Minimum Quality Standards
For acceptable accuracy, your audio should meet these criteria:
| Factor | Minimum | Optimal |
|---|---|---|
| Signal-to-Noise Ratio | > 10 dB | > 20 dB |
| Bitrate (MP3) | β₯ 64 kbps | β₯ 128 kbps |
| Volume level | Audible | Normalized to -3 dB |
| Background noise | Minimal | None |
| Echo/reverb | Minimal | None |
Quality Checklist
Before transcribing, ensure:
- β Clear speech: Speakers are audible and understandable
- β Minimal noise: Background sounds don't overpower speech
- β Consistent volume: No sudden volume changes
- β No clipping: Audio isn't distorted or saturated
- β Good microphone: Quality recording equipment used
Codec Requirements
Recommended Codecs
| Codec | Format | Quality | Recommendation |
|---|---|---|---|
| PCM | WAV | Lossless | β Best for accuracy |
| FLAC | FLAC | Lossless | β Excellent, compressed |
| AAC | M4A, MP4 | High quality | β Very good |
| MP3 | MP3 | Lossy | β Good at β₯128 kbps |
| OGG Vorbis | OGG | Lossy | β Good quality |
| OPUS | OPUS | Lossy | β Good, low latency |
Codec Best Practices
For maximum accuracy:
- Use PCM (WAV) or FLAC (lossless)
For practical use:
- Use AAC or MP3 at β₯128 kbps (excellent results)
Avoid:
- Very low bitrate MP3 (< 64 kbps)
- Highly compressed formats
- Phone codecs (AMR, G.711) unless necessary
Audio Preprocessing Recommendations
Before Transcribing
While Whisper handles many issues automatically, preprocessing can improve results:
1. Normalize Volume
import numpy as np
from scipy.io import wavfile
def normalize_audio(audio_path, output_path, target_dB=-3.0):
sr, audio = wavfile.read(audio_path)
audio = audio.astype(np.float32)
# Normalize to target dB
max_val = np.max(np.abs(audio))
target_linear = 10 ** (target_dB / 20)
audio = audio * (target_linear / max_val)
# Clip to prevent overflow
audio = np.clip(audio, -1.0, 1.0)
wavfile.write(output_path, sr, (audio * 32767).astype(np.int16))
2. Remove Silence
# Remove leading/trailing silence
# Helps with processing time and accuracy
3. Noise Reduction (Optional)
For noisy recordings:
# Use noise reduction libraries
# librosa, noisereduce, or specialized tools
# Only if background noise is significant
4. Resample to 16 kHz (Optional)
If you want to optimize file size:
import ffmpeg
stream = ffmpeg.input('input.wav')
stream = ffmpeg.output(
stream,
'output_16k.wav',
ar=16000 # Resample to 16 kHz
)
ffmpeg.run(stream, overwrite_output=True)
Common Audio Issues and Solutions
Issue 1: Very Low Sample Rate (8 kHz)
Problem: Phone recordings at 8 kHz may have reduced accuracy.
Solution:
- Use Whisper medium or large model (better at low sample rates)
- Upsample to 16 kHz (doesn't restore quality but helps processing)
Issue 2: Stereo with Different Speakers
Problem: Two speakers on separate channels.
Solution:
# Extract each channel separately
import torchaudio
audio, sr = torchaudio.load('stereo.wav')
speaker1 = audio[0] # Left channel
speaker2 = audio[1] # Right channel
# Transcribe each separately
result1 = model.transcribe(speaker1)
result2 = model.transcribe(speaker2)
Issue 3: Variable Bitrate MP3
Problem: VBR MP3 may cause issues with some tools.
Solution:
- Convert to constant bitrate (CBR) or WAV
- Whisper handles VBR fine, but CBR is more predictable
Issue 4: Corrupted Audio Files
Problem: File plays but Whisper fails.
Solution:
# Re-encode the file
import ffmpeg
stream = ffmpeg.input('corrupted.mp3')
stream = ffmpeg.output(
stream,
'fixed.wav',
acodec='pcm_s16le'
)
ffmpeg.run(stream, overwrite_output=True)
Issue 5: Very Long Audio Files
Problem: Out of memory or very slow processing.
Solution:
- Split into 30-60 second chunks
- Process chunks sequentially or in parallel
- Merge results with timestamps
Format-Specific Recommendations
For Phone Calls
| Parameter | Value | Reason |
|---|---|---|
| Sample rate | 8-16 kHz | Phone quality |
| Format | WAV or MP3 | Standard |
| Bitrate | β₯ 64 kbps | Phone codec quality |
| Channels | Mono | Standard for calls |
For Meetings (Zoom, Teams)
| Parameter | Value | Reason |
|---|---|---|
| Sample rate | 16-48 kHz | High quality |
| Format | MP4 (extract audio) | Video format |
| Bitrate | β₯ 128 kbps | Good quality |
| Channels | Mono or Stereo | Depends on setup |
For Podcasts
| Parameter | Value | Reason |
|---|---|---|
| Sample rate | 44.1-48 kHz | Professional quality |
| Format | MP3, WAV, or M4A | Common formats |
| Bitrate | β₯ 128 kbps | Good quality |
| Channels | Mono | Standard for podcasts |
For Interviews
| Parameter | Value | Reason |
|---|---|---|
| Sample rate | 16-48 kHz | High quality |
| Format | WAV or FLAC | Maximum accuracy |
| Bitrate | Lossless or β₯ 192 kbps | Professional |
| Channels | Mono | Standard |
Whisper Audio Requirements Summary
Minimum Requirements
- β Format: Any FFmpeg-supported format
- β Sample rate: Any (8 kHz minimum recommended)
- β Bit depth: 8-bit or higher
- β Channels: Mono or stereo (mono preferred)
- β File size: No hard limit (chunk if > 100 MB)
- β Duration: No hard limit (chunk if > 1 hour)
Optimal Requirements
- β Format: WAV, FLAC, or MP3 (β₯128 kbps)
- β Sample rate: 16 kHz (optimal, no resampling)
- β Bit depth: 16-bit PCM
- β Channels: Mono
- β Quality: Clear speech, minimal noise
- β Preprocessing: Normalized volume, no clipping
Quick Reference: Audio Preparation Checklist
Before transcribing with Whisper:
- Format: WAV, MP3, FLAC, M4A, or other supported format
- Sample rate: 16 kHz (optimal) or any supported rate
- Bit depth: 16-bit (recommended)
- Channels: Mono (preferred) or stereo
- File size: < 100 MB (or plan to chunk)
- Duration: < 1 hour (or plan to chunk)
- Quality: Clear speech, minimal background noise
- Volume: Normalized, no clipping
- Codec: Lossless (WAV/FLAC) or high-quality lossy (MP3 β₯128 kbps)
Testing Your Audio
Quick Test
import whisper
# Load model
model = whisper.load_model("base")
# Test transcription
result = model.transcribe("your_audio.wav")
# Check if successful
if result["text"]:
print("β
Audio format is compatible")
print(f"Detected language: {result['language']}")
else:
print("β οΈ Transcription failed - check audio format")
Common Error Messages
| Error | Cause | Solution |
|---|---|---|
| "File not found" | Wrong path | Check file path |
| "Unsupported format" | Format not supported | Convert to WAV/MP3 |
| "Out of memory" | File too large | Chunk the audio |
| "Empty audio" | Corrupted file | Re-encode the file |
Best Practices Summary
- Use 16 kHz sample rate when possible (optimal for Whisper)
- Prefer mono over stereo (Whisper processes mono internally)
- Use lossless formats (WAV/FLAC) for maximum accuracy
- Chunk long files (> 1 hour) for better accuracy and stability
- Normalize audio to consistent volume levels
- Minimize background noise for best results
- Use appropriate model size (larger models handle poor audio better)
- Test with base model first before using larger models
Conclusion
Whisper is highly flexible and can handle a wide variety of audio formats and qualities. However, following optimal specifications ensures the best transcription accuracy:
- Format: WAV, FLAC, or MP3 (β₯128 kbps)
- Sample rate: 16 kHz (optimal)
- Bit depth: 16-bit PCM
- Channels: Mono
- Quality: Clear speech with minimal noise
Remember: Clear audio beats perfect format specifications. Even with optimal technical specs, poor recording quality will reduce accuracy. Focus on clear speech, minimal noise, and good microphone placement for the best results.
For production use, platforms like SayToWords automatically handle format conversion, resampling, and optimization, so you can focus on getting clear audio rather than technical specifications.
Need help preparing your audio for Whisper transcription? Check out our other guides on audio preprocessing, chunking strategies, and accuracy optimization.
