
Whisper Chunk Size Best Practices: Optimal Settings for Accuracy and Latency
Eric King
Author
Introduction
Choosing the right chunk size is one of the most important factors when using Whisper for speech-to-text.
A poor chunk size can lead to:
- Broken sentences
- Missing words
- Higher word error rate (WER)
- Unnecessary latency and cost
In this guide, weβll break down Whisper chunk size best practices and help you choose the optimal settings for different use cases.
Why Chunk Size Matters in Whisper
Whisper processes up to ~30 seconds of audio per inference.
When dealing with long or continuous audio, chunking is unavoidable.
When dealing with long or continuous audio, chunking is unavoidable.
Chunk size directly affects:
- Context awareness
- Transcription accuracy
- Latency
- System throughput
Recommended Whisper Chunk Sizes
Quick Reference Table
| Use Case | Chunk Size | Overlap |
|---|---|---|
| Batch transcription | 20β30s | 2β3s |
| Podcasts / YouTube | 25β30s | 3s |
| Meetings | 15β20s | 2s |
| Call recordings | 10β15s | 2s |
| Streaming / live | 2β5s | 0.5β1s |
Long Audio Transcription (Best Accuracy)
Recommended settings
- Chunk size: 20β30 seconds
- Overlap: 2β3 seconds
Why it works:
- Preserves sentence-level context
- Improves punctuation and capitalization
- Reduces mid-sentence cuts
β οΈ Avoid exceeding 30 seconds β Whisper may truncate audio.
Short Chunks: When Lower Latency Matters
Short chunks are useful for:
- Real-time captions
- Live meetings
- Voice assistants
Recommended settings
- Chunk size: 2β5 seconds
- Overlap: 0.5β1 second
Trade-offs:
- Faster feedback
- Lower context
- Requires buffering or re-prompting
Chunk Overlap: Donβt Skip This
Overlap prevents word loss at boundaries.
Best practices
- Overlap β 10β15% of chunk size
- Deduplicate overlapping text in post-processing
- Keep the higher-confidence transcription
Example:
- Chunk size: 20s
- Overlap: 2s
Fixed-Length vs VAD-Based Chunk Sizes
Fixed-Length Chunking
- Simple
- Predictable
β May cut off sentences
β Worse for conversations
β Worse for conversations
VAD-Based Chunking (Recommended)
Using Voice Activity Detection:
- Splits on silence
- Produces natural segments
- Improves readability
Popular VAD options:
- WebRTC VAD
- Silero VAD
- pyannote.audio
Adjusting Chunk Size by Audio Type
Podcasts & Monologues
- Larger chunks (25β30s)
- Minimal overlap
- High accuracy focus
Conversations & Calls
- Medium chunks (10β15s)
- VAD-based splitting
- Speaker-aware merging
Noisy Audio
- Smaller chunks (8β12s)
- More overlap
- Helps reduce error propagation
Prompting Between Chunks
Whisper does not keep memory across chunks.
To improve continuity:
result = model.transcribe(
chunk,
initial_prompt=previous_text
)
This simulates context carry-over and improves coherence.
Performance & Cost Considerations
| Chunk Size | Accuracy | Latency | Cost |
|---|---|---|---|
| 2β5s | Medium | Very Low | High |
| 10β15s | High | Medium | Medium |
| 20β30s | Very High | Higher | Low |
π‘ Larger chunks = fewer API calls and better cost efficiency.
Common Chunk Size Mistakes
β Avoid:
- Using maximum size everywhere
- No overlap between chunks
- Same chunk size for all audio types
- Ignoring silence detection
β
Best practices:
- Tune chunk size per use case
- Always use overlap
- Test and measure WER
Real-World Production Recommendation
For most speech-to-text platforms:
- Live preview β 3β5s chunks
- Final transcript β 20β30s chunks
- VAD + overlap everywhere
This hybrid approach balances:
- User experience
- Accuracy
- Cost
Final Thoughts
There is no universal "best" Whisper chunk size.
The optimal setup depends on:
- Audio length
- Latency requirements
- Accuracy expectations
- Infrastructure cost
By following these best practices, you can significantly improve transcription quality while keeping your system efficient and scalable.
If you want a production-ready solution that already applies these optimizations, tools like SayToWords handle chunk size, overlap, and post-processing automatically.
FAQ
Q: What is the maximum chunk size for Whisper?
A: About 30 seconds per inference.
Q: Is overlap really necessary?
A: Yes. Overlap prevents missing words at chunk boundaries.
Q: Should I use the same chunk size for streaming and batch?
A: No. Streaming favors small chunks; batch favors larger chunks.
