Whisper Audio Requirements: Complete Guide to Supported Formats and Specifications

2026-01-12SpeechToText Whisper

Eric King

Author

Understanding Whisper's audio requirements is crucial for achieving the best transcription accuracy. While Whisper is flexible and can handle many audio formats, following optimal specifications ensures maximum performance.

This comprehensive guide covers all audio requirements, supported formats, technical specifications, and best practices for preparing audio files for Whisper transcription.

Supported Audio Formats

Whisper supports a wide range of audio and video formats through FFmpeg. Here's the complete list:

Audio Formats

Format	Extension	Notes
WAV	`.wav`	✅ Preferred, lossless
MP3	`.mp3`	✅ Most common, widely used
FLAC	`.flac`	✅ Lossless, good compression
M4A	`.m4a`	✅ Apple format, AAC codec
AAC	`.aac`	✅ High quality compression
OGG	`.ogg`	✅ Open source format
OPUS	`.opus`	✅ Low latency, web-friendly
WMA	`.wma`	⚠️ Less common
AMR	`.amr`	⚠️ Low quality, phone recordings

Video Formats (Audio Extraction)

Format	Extension	Notes
MP4	`.mp4`	✅ Most common video format
AVI	`.avi`	✅ Older format, still supported
MKV	`.mkv`	✅ Container format
MOV	`.mov`	✅ QuickTime format
WebM	`.webm`	✅ Web video format
FLV	`.flv`	⚠️ Legacy Flash format

Important: Whisper automatically extracts audio from video files, so you can upload video files directly.

Sample Rate Requirements

Optimal Sample Rate: 16 kHz

Whisper internally resamples all audio to 16 kHz mono before processing. This is the optimal sample rate for speech recognition.

Supported Sample Rates

Whisper accepts audio at any sample rate, but here's what you should know:

Input Sample Rate	Whisper Processing	Recommendation
8 kHz	Resampled to 16 kHz	✅ Phone calls, acceptable
16 kHz	Used directly	✅ Optimal, no resampling
22.05 kHz	Resampled to 16 kHz	✅ Good quality
44.1 kHz	Resampled to 16 kHz	✅ CD quality, fine
48 kHz	Resampled to 16 kHz	✅ Professional audio, fine
96 kHz	Resampled to 16 kHz	⚠️ Unnecessary, larger files

Key Insight: Higher sample rates don't improve Whisper's accuracy. The model was trained on 16 kHz audio, so providing 16 kHz input avoids unnecessary resampling and file size.

Best Practice

# Convert audio to 16 kHz before processing (optional optimization)
import ffmpeg

def convert_to_16khz(input_file, output_file):
    stream = ffmpeg.input(input_file)
    stream = ffmpeg.output(
        stream,
        output_file,
        acodec='pcm_s16le',
        ac=1,  # Mono
        ar=16000  # 16 kHz
    )
    ffmpeg.run(stream, overwrite_output=True)

Bit Depth Requirements

Supported Bit Depths

Bit Depth	Status	Notes
8-bit	✅ Supported	Low quality, not recommended
16-bit	✅ Recommended	Standard, sufficient
24-bit	✅ Supported	Professional, larger files
32-bit float	✅ Supported	Studio quality, overkill

Recommended: 16-bit PCM is the standard and provides excellent quality for speech recognition. Higher bit depths don't improve transcription accuracy.

Channel Configuration: Mono vs Stereo

Mono (Recommended)

Whisper processes audio as mono internally, so mono input is optimal.

Advantages:

Smaller file size
Faster processing
No channel mixing needed
Optimal for single-speaker content

Use mono for:

Single speaker recordings
Phone calls
Podcasts with one host
Most transcription tasks

Stereo (Supported)

Stereo files are automatically converted to mono by averaging or selecting one channel.

When stereo is useful:

Separate speakers on different channels (rare)
Original recording is stereo (conversion is automatic)

Best practice: Convert stereo to mono before processing if you have control:

import ffmpeg

# Convert stereo to mono
stream = ffmpeg.input('stereo_audio.wav')
stream = ffmpeg.output(
    stream,
    'mono_audio.wav',
    ac=1  # Mono channel
)
ffmpeg.run(stream, overwrite_output=True)

File Size Limits

Practical Limits

Whisper doesn't have a hard file size limit, but practical considerations apply:

File Size	Processing Time	Recommendation
< 10 MB	Seconds	✅ Ideal
10-100 MB	Minutes	✅ Good
100-500 MB	10-30 minutes	⚠️ Consider chunking
> 500 MB	30+ minutes	⚠️ Must chunk

Memory Considerations

Large files require more RAM/VRAM:

Base model: ~1-2 GB VRAM
Small model: ~2-3 GB VRAM
Medium model: ~5-6 GB VRAM
Large model: ~10-12 GB VRAM

Best practice: For files > 100 MB, split into chunks (see below).

Duration Limits

Recommended Duration

Duration	Status	Notes
< 30 minutes	✅ Optimal	Process directly
30-60 minutes	✅ Good	May need chunking
1-2 hours	⚠️ Chunk recommended	Better accuracy with chunks
> 2 hours	⚠️ Must chunk	Required for stability

Why Chunk Long Audio?

Memory limits: Prevents out-of-memory errors
Better accuracy: Smaller chunks maintain context better
Faster processing: Parallel processing possible
Error recovery: If one chunk fails, others succeed

Chunking strategy:

# Split long audio into 30-60 second chunks with 5-10 second overlap
def chunk_audio(audio_path, chunk_length=60, overlap=5):
    # Implementation: split audio into segments
    # Process each chunk separately
    # Merge results with timestamps
    pass

Audio Quality Requirements

Minimum Quality Standards

For acceptable accuracy, your audio should meet these criteria:

Factor	Minimum	Optimal
Signal-to-Noise Ratio	> 10 dB	> 20 dB
Bitrate (MP3)	≥ 64 kbps	≥ 128 kbps
Volume level	Audible	Normalized to -3 dB
Background noise	Minimal	None
Echo/reverb	Minimal	None

Quality Checklist

Before transcribing, ensure:

✅ Clear speech: Speakers are audible and understandable
✅ Minimal noise: Background sounds don't overpower speech
✅ Consistent volume: No sudden volume changes
✅ No clipping: Audio isn't distorted or saturated
✅ Good microphone: Quality recording equipment used

Codec Requirements

Recommended Codecs

Codec	Format	Quality	Recommendation
PCM	WAV	Lossless	✅ Best for accuracy
FLAC	FLAC	Lossless	✅ Excellent, compressed
AAC	M4A, MP4	High quality	✅ Very good
MP3	MP3	Lossy	✅ Good at ≥128 kbps
OGG Vorbis	OGG	Lossy	✅ Good quality
OPUS	OPUS	Lossy	✅ Good, low latency

Codec Best Practices

For maximum accuracy:

Use PCM (WAV) or FLAC (lossless)

For practical use:

Use AAC or MP3 at ≥128 kbps (excellent results)

Avoid:

Very low bitrate MP3 (< 64 kbps)
Highly compressed formats
Phone codecs (AMR, G.711) unless necessary

Audio Preprocessing Recommendations

Before Transcribing

While Whisper handles many issues automatically, preprocessing can improve results:

1. Normalize Volume

import numpy as np
from scipy.io import wavfile

def normalize_audio(audio_path, output_path, target_dB=-3.0):
    sr, audio = wavfile.read(audio_path)
    audio = audio.astype(np.float32)
    
    # Normalize to target dB
    max_val = np.max(np.abs(audio))
    target_linear = 10 ** (target_dB / 20)
    audio = audio * (target_linear / max_val)
    
    # Clip to prevent overflow
    audio = np.clip(audio, -1.0, 1.0)
    
    wavfile.write(output_path, sr, (audio * 32767).astype(np.int16))

2. Remove Silence

# Remove leading/trailing silence
# Helps with processing time and accuracy

3. Noise Reduction (Optional)

For noisy recordings:

# Use noise reduction libraries
# librosa, noisereduce, or specialized tools
# Only if background noise is significant

4. Resample to 16 kHz (Optional)

If you want to optimize file size:

import ffmpeg

stream = ffmpeg.input('input.wav')
stream = ffmpeg.output(
    stream,
    'output_16k.wav',
    ar=16000  # Resample to 16 kHz
)
ffmpeg.run(stream, overwrite_output=True)

Common Audio Issues and Solutions

Issue 1: Very Low Sample Rate (8 kHz)

Problem: Phone recordings at 8 kHz may have reduced accuracy.

Solution:

Use Whisper medium or large model (better at low sample rates)
Upsample to 16 kHz (doesn't restore quality but helps processing)

Issue 2: Stereo with Different Speakers

Problem: Two speakers on separate channels.

Solution:

# Extract each channel separately
import torchaudio

audio, sr = torchaudio.load('stereo.wav')
speaker1 = audio[0]  # Left channel
speaker2 = audio[1]  # Right channel

# Transcribe each separately
result1 = model.transcribe(speaker1)
result2 = model.transcribe(speaker2)

Issue 3: Variable Bitrate MP3

Problem: VBR MP3 may cause issues with some tools.

Solution:

Convert to constant bitrate (CBR) or WAV
Whisper handles VBR fine, but CBR is more predictable

Issue 4: Corrupted Audio Files

Problem: File plays but Whisper fails.

Solution:

# Re-encode the file
import ffmpeg

stream = ffmpeg.input('corrupted.mp3')
stream = ffmpeg.output(
    stream,
    'fixed.wav',
    acodec='pcm_s16le'
)
ffmpeg.run(stream, overwrite_output=True)

Issue 5: Very Long Audio Files

Problem: Out of memory or very slow processing.

Solution:

Split into 30-60 second chunks
Process chunks sequentially or in parallel
Merge results with timestamps

Format-Specific Recommendations

For Phone Calls

Parameter	Value	Reason
Sample rate	8-16 kHz	Phone quality
Format	WAV or MP3	Standard
Bitrate	≥ 64 kbps	Phone codec quality
Channels	Mono	Standard for calls

For Meetings (Zoom, Teams)

Parameter	Value	Reason
Sample rate	16-48 kHz	High quality
Format	MP4 (extract audio)	Video format
Bitrate	≥ 128 kbps	Good quality
Channels	Mono or Stereo	Depends on setup

For Podcasts

Parameter	Value	Reason
Sample rate	44.1-48 kHz	Professional quality
Format	MP3, WAV, or M4A	Common formats
Bitrate	≥ 128 kbps	Good quality
Channels	Mono	Standard for podcasts

For Interviews

Parameter	Value	Reason
Sample rate	16-48 kHz	High quality
Format	WAV or FLAC	Maximum accuracy
Bitrate	Lossless or ≥ 192 kbps	Professional
Channels	Mono	Standard

Whisper Audio Requirements Summary

Minimum Requirements

✅ Format: Any FFmpeg-supported format
✅ Sample rate: Any (8 kHz minimum recommended)
✅ Bit depth: 8-bit or higher
✅ Channels: Mono or stereo (mono preferred)
✅ File size: No hard limit (chunk if > 100 MB)
✅ Duration: No hard limit (chunk if > 1 hour)

Optimal Requirements

✅ Format: WAV, FLAC, or MP3 (≥128 kbps)
✅ Sample rate: 16 kHz (optimal, no resampling)
✅ Bit depth: 16-bit PCM
✅ Channels: Mono
✅ Quality: Clear speech, minimal noise
✅ Preprocessing: Normalized volume, no clipping

Quick Reference: Audio Preparation Checklist

Before transcribing with Whisper:

Format: WAV, MP3, FLAC, M4A, or other supported format
Sample rate: 16 kHz (optimal) or any supported rate
Bit depth: 16-bit (recommended)
Channels: Mono (preferred) or stereo
File size: < 100 MB (or plan to chunk)
Duration: < 1 hour (or plan to chunk)
Quality: Clear speech, minimal background noise
Volume: Normalized, no clipping
Codec: Lossless (WAV/FLAC) or high-quality lossy (MP3 ≥128 kbps)

Testing Your Audio

Quick Test

import whisper

# Load model
model = whisper.load_model("base")

# Test transcription
result = model.transcribe("your_audio.wav")

# Check if successful
if result["text"]:
    print("✅ Audio format is compatible")
    print(f"Detected language: {result['language']}")
else:
    print("⚠️ Transcription failed - check audio format")

Common Error Messages

Error	Cause	Solution
"File not found"	Wrong path	Check file path
"Unsupported format"	Format not supported	Convert to WAV/MP3
"Out of memory"	File too large	Chunk the audio
"Empty audio"	Corrupted file	Re-encode the file

Best Practices Summary

Use 16 kHz sample rate when possible (optimal for Whisper)
Prefer mono over stereo (Whisper processes mono internally)
Use lossless formats (WAV/FLAC) for maximum accuracy
Chunk long files (> 1 hour) for better accuracy and stability
Normalize audio to consistent volume levels
Minimize background noise for best results
Use appropriate model size (larger models handle poor audio better)
Test with base model first before using larger models

Conclusion

Whisper is highly flexible and can handle a wide variety of audio formats and qualities. However, following optimal specifications ensures the best transcription accuracy:

Format: WAV, FLAC, or MP3 (≥128 kbps)
Sample rate: 16 kHz (optimal)
Bit depth: 16-bit PCM
Channels: Mono
Quality: Clear speech with minimal noise

Remember: Clear audio beats perfect format specifications. Even with optimal technical specs, poor recording quality will reduce accuracy. Focus on clear speech, minimal noise, and good microphone placement for the best results.

For production use, platforms like SayToWords automatically handle format conversion, resampling, and optimization, so you can focus on getting clear audio rather than technical specifications.

Need help preparing your audio for Whisper transcription? Check out our other guides on audio preprocessing, chunking strategies, and accuracy optimization.

Whisper Audio Requirements: Complete Guide to Supported Formats and Specifications

Supported Audio Formats

Audio Formats

Video Formats (Audio Extraction)

Sample Rate Requirements

Optimal Sample Rate: 16 kHz

Supported Sample Rates

Best Practice

Bit Depth Requirements

Supported Bit Depths

Channel Configuration: Mono vs Stereo

Mono (Recommended)

Stereo (Supported)

File Size Limits

Practical Limits

Memory Considerations

Duration Limits

Recommended Duration

Why Chunk Long Audio?

Audio Quality Requirements

Minimum Quality Standards

Quality Checklist

Codec Requirements

Recommended Codecs

Codec Best Practices

Audio Preprocessing Recommendations

Before Transcribing

1. Normalize Volume

2. Remove Silence

3. Noise Reduction (Optional)

4. Resample to 16 kHz (Optional)

Common Audio Issues and Solutions

Issue 1: Very Low Sample Rate (8 kHz)

Issue 2: Stereo with Different Speakers

Issue 3: Variable Bitrate MP3

Issue 4: Corrupted Audio Files

Issue 5: Very Long Audio Files

Format-Specific Recommendations

For Phone Calls

For Meetings (Zoom, Teams)

For Podcasts

For Interviews

Whisper Audio Requirements Summary

Minimum Requirements

Optimal Requirements

Quick Reference: Audio Preparation Checklist

Testing Your Audio

Quick Test

Common Error Messages

Best Practices Summary

Conclusion

Related Posts

What Is Speech to Text and How to Use It: A Complete Beginner's Guide

How to Convert Audio to Text Online: Free & Accurate Methods (2026 Guide)

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Try It Free Now