
Whisper for Call Transcription: Accurate Speech-to-Text for Phone Calls
Eric King
Author
Phone call transcription is one of the most common and high-value use cases for speech-to-text. OpenAI Whisper is especially well-suited for this scenario thanks to its robustness to noise, accents, and imperfect audio quality.
This article explains how to use Whisper for call transcription, including audio formats, speaker separation, accuracy optimization, and real-world deployment patterns.
Why Whisper for Call Transcription?
Compared to traditional ASR engines, Whisper performs well on:
- Low-quality phone audio (8kHz)
- Accents and non-native speakers
- Background noise
- Long conversations (10β120 minutes)
- Multilingual calls and code-switching
Typical use cases:
- Customer support call logs
- Sales call analysis
- QA & compliance
- Call summarization and insights
- CRM automation
Typical Call Transcription Pipeline
Call (PSTN / VoIP)
β
Call Recording (WAV / MP3)
β
Preprocessing (resample, channel split)
β
Whisper Transcription
β
Speaker Diarization (optional)
β
Post-processing (punctuation, timestamps, summaries)
Audio Formats: What Works Best
Recommended Settings
| Parameter | Value |
|---|---|
| Sample rate | 8kHz or 16kHz |
| Channels | Mono or Stereo |
| Format | WAV (preferred), FLAC |
| Bit depth | 16-bit PCM |
Whisper automatically resamples internally, but clean input improves accuracy.
Mono vs Stereo Calls
Mono (Most Common)
- Both speakers mixed into one channel
- Easier pipeline
- Harder to separate speakers
Best for:
- Simple transcription
- Search and archiving
Stereo (Best Practice)
- Agent on left channel
- Customer on right channel
Advantages:
- Clear speaker separation
- No diarization needed
- Higher downstream accuracy
# Split stereo call into two mono tracks
import torchaudio
audio, sr = torchaudio.load("call.wav")
agent = audio[0]
customer = audio[1]
Then transcribe each channel separately.
Speaker Diarization with Whisper
Whisper does not natively support diarization, but you can combine it with:
- Pyannote.audio
- WebRTC VAD + clustering
- Channel-based separation (preferred)
Typical approach:
- Run diarization model
- Split audio by speaker segments
- Transcribe each segment with Whisper
- Merge results with speaker labels
Best Whisper Models for Calls
| Model | Accuracy | Speed | Recommended |
|---|---|---|---|
| base | Medium | Fast | β Short calls |
| small | High | Medium | β Most cases |
| medium | Very High | Slower | β Compliance |
| large-v3 | Excellent | Slow | β Legal / QA |
Recommended: small or medium for call centers
Handling Long Calls (30β120 Minutes)
For long calls, avoid feeding full audio at once.
Best Practice
- Chunk audio into 2β5 minute segments
- Use small overlaps (5β10 seconds)
- Preserve timestamps
result = model.transcribe(
audio_chunk,
condition_on_previous_text=True
)
This preserves context across chunks.
Improving Accuracy for Phone Calls
1. Normalize Audio
- Remove silence
- Normalize volume
- Apply noise reduction if needed
2. Use Language Hints
model.transcribe(audio, language="en")
3. Enable FP16 on GPU
Faster and more stable inference.
4. Avoid Over-Chunking
Too small chunks reduce context and accuracy.
Real-Time vs Batch Call Transcription
| Mode | Use Case |
|---|---|
| Real-time | Live monitoring, alerts |
| Near real-time | QA dashboards |
| Batch | Analytics, archiving |
Most call centers use near real-time or batch for stability and cost efficiency.
Scaling Whisper for Call Centers
Small Scale (β€100 calls/day)
- Single GPU server
- Whisper small
Medium Scale (1kβ10k calls/day)
- GPU pool
- Async job queue (RabbitMQ / Kafka)
- Chunk-based processing
Large Scale (Enterprise)
- Multiple GPU nodes
- Audio pre-processing service
- Transcription + summarization pipelines
Post-Processing & Value Extraction
After transcription, common steps include:
- Sentence punctuation
- Speaker tagging
- Keyword extraction
- Sentiment analysis
- Call summaries (LLMs)
- CRM integration
Whisper vs Cloud Call Transcription APIs
| Feature | Whisper | Cloud APIs |
|---|---|---|
| Cost | Low (self-hosted) | High |
| Data privacy | Full control | Vendor-controlled |
| Accuracy | Very high | High |
| Customization | Full | Limited |
Whisper is ideal for teams that need privacy, cost control, and customization.
Conclusion
Whisper is a powerful choice for call transcription, especially for:
- Customer support
- Sales and QA
- Compliance-heavy industries
With proper audio handling, chunking, and optional diarization, Whisper can deliver production-grade call transcription at scale.
