
Whisper for Meetings: Accurate Transcription for Business Meetings
Eric King
Author
Meeting transcription is one of the most valuable applications of speech-to-text technology. OpenAI Whisper excels at transcribing business meetings thanks to its ability to handle multiple speakers, background noise, accents, and long-form conversations.
This article explains how to use Whisper for meeting transcription, including audio preprocessing, speaker separation, accuracy optimization, and real-world deployment patterns for various meeting platforms.
Why Whisper for Meeting Transcription?
Compared to traditional ASR engines, Whisper performs exceptionally well on:
- Multiple speakers with varying voice characteristics
- Background noise from video calls and office environments
- Accents and non-native speakers in global teams
- Long meetings (30 minutes to several hours)
- Overlapping speech and interruptions
- Multilingual meetings and code-switching
- Variable audio quality from different devices and connections
Typical use cases:
- Corporate meeting minutes and documentation
- Team standups and retrospectives
- Client meetings and consultations
- Training sessions and webinars
- Board meetings and compliance records
- Interview transcription
- Knowledge base creation from recorded meetings
Typical Meeting Transcription Pipeline
Meeting Recording (Zoom / Teams / Local)
β
Audio Extraction (WAV / MP3 / M4A)
β
Preprocessing (normalize, denoise, resample)
β
Speaker Diarization (optional but recommended)
β
Whisper Transcription (chunked for long meetings)
β
Post-processing (punctuation, speaker labels, timestamps)
β
Formatting (minutes, summaries, searchable text)
Audio Formats: What Works Best for Meetings
Recommended Settings
| Parameter | Value | Notes |
|---|---|---|
| Sample rate | 16kHz or 48kHz | Higher is better if available |
| Channels | Mono or Stereo | Mono is fine for most cases |
| Format | WAV (preferred), FLAC, MP3 | Lossless preferred |
| Bit depth | 16-bit or 24-bit PCM | 16-bit is sufficient |
Important: Whisper automatically resamples internally, but clean, high-quality input significantly improves accuracy.
Handling Different Meeting Platforms
Zoom Recordings
Zoom typically exports audio as:
- MP4 (video) or M4A (audio-only)
- 48kHz sample rate (good quality)
- Stereo or mono depending on settings
Best practice:
# Extract audio from Zoom recording
import ffmpeg
def extract_audio_from_zoom(zoom_file, output_wav):
stream = ffmpeg.input(zoom_file)
stream = ffmpeg.output(
stream,
output_wav,
acodec='pcm_s16le',
ac=1, # Mono
ar=16000 # 16kHz
)
ffmpeg.run(stream, overwrite_output=True)
Microsoft Teams Recordings
Teams recordings are typically:
- MP4 format
- 48kHz audio
- May include multiple audio tracks
Google Meet Recordings
- Usually MP4 or WebM
- Variable quality depending on connection
- May need audio extraction
Local Recordings
If recording locally:
- Use WAV format at 16kHz or higher
- Ensure good microphone placement
- Minimize background noise
Speaker Diarization for Meetings
One of the biggest challenges in meeting transcription is identifying who said what. Whisper does not natively support speaker diarization, but you can combine it with specialized tools.
Why Diarization Matters
- Meeting minutes require speaker attribution
- Action items need to be assigned to speakers
- Search and analysis by participant
- Compliance and record-keeping
Diarization Approaches
1. Pyannote.audio (Recommended)
from pyannote.audio import Pipeline
# Load diarization pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_TOKEN"
)
# Run diarization
diarization = pipeline(audio_file)
# Get speaker segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"Speaker {speaker}: {turn.start:.2f}s - {turn.end:.2f}s")
Advantages:
- High accuracy
- Handles multiple speakers well
- Works with Whisper seamlessly
2. Channel-Based Separation
If your meeting recording has separate audio tracks per participant (rare but ideal):
import torchaudio
audio, sr = torchaudio.load("meeting.wav")
# Assuming stereo with different speakers per channel
speaker1 = audio[0]
speaker2 = audio[1]
# Transcribe each separately
result1 = model.transcribe(speaker1)
result2 = model.transcribe(speaker2)
3. Simple VAD + Clustering
For basic scenarios with 2-3 speakers:
# Use Voice Activity Detection to find speech segments
# Cluster segments by acoustic similarity
# Assign speaker labels
Combining Diarization with Whisper
Typical workflow:
- Run diarization to get speaker segments
- Split audio by speaker segments
- Transcribe each segment with Whisper
- Merge results with speaker labels and timestamps
def transcribe_meeting_with_diarization(audio_path, model):
# Step 1: Diarization
diarization = pipeline(audio_path)
# Step 2: Transcribe each speaker segment
transcripts = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
# Extract segment
segment_audio = extract_segment(audio_path, turn.start, turn.end)
# Transcribe with Whisper
result = model.transcribe(segment_audio)
# Add speaker label
transcripts.append({
"speaker": speaker,
"start": turn.start,
"end": turn.end,
"text": result["text"]
})
return transcripts
Best Whisper Models for Meetings
| Model | Accuracy | Speed | VRAM | Recommended For |
|---|---|---|---|---|
| base | Medium | Fast | ~1GB | Quick drafts |
| small | High | Medium | ~2GB | β Most meetings |
| medium | Very High | Slower | ~5GB | β Important meetings |
| large-v3 | Excellent | Slow | ~10GB | β Critical/legal meetings |
Recommended:
- small for regular team meetings
- medium for client meetings and important discussions
- large-v3 for board meetings and compliance-critical recordings
Handling Long Meetings (30+ Minutes)
Long meetings require careful chunking to maintain accuracy and manage memory.
Best Practice: Smart Chunking
- Chunk size: 30β60 seconds
- Overlap: 5β10 seconds between chunks
- Preserve context across chunks
def transcribe_long_meeting(audio_path, model, chunk_length=60, overlap=5):
# Load audio
audio = whisper.load_audio(audio_path)
# Split into chunks with overlap
chunks = []
start = 0
while start < len(audio):
end = min(start + chunk_length, len(audio))
chunks.append((start, end))
start = end - overlap # Overlap for context
# Transcribe each chunk
results = []
previous_text = ""
for start_time, end_time in chunks:
chunk_audio = audio[start_time:end_time]
result = model.transcribe(
chunk_audio,
condition_on_previous_text=True,
initial_prompt=previous_text[-200:] if previous_text else None
)
results.append({
"start": start_time,
"end": end_time,
"text": result["text"]
})
previous_text = result["text"]
return merge_transcripts(results)
Why Overlap Matters
Overlap ensures that:
- Words at chunk boundaries aren't lost
- Context is preserved between segments
- Speaker transitions are captured correctly
Improving Accuracy for Meetings
1. Audio Preprocessing
Normalize audio:
import numpy as np
from scipy.io import wavfile
def normalize_audio(audio_path, output_path):
sr, audio = wavfile.read(audio_path)
# Normalize to [-1, 1]
audio = audio.astype(np.float32)
audio = audio / np.max(np.abs(audio))
# Remove silence (optional)
# Apply noise reduction (optional)
wavfile.write(output_path, sr, audio)
2. Use Meeting-Specific Context
Provide context about the meeting topic:
context_prompt = """
This is a business meeting about Q4 product planning.
Participants include: Sarah (Product Manager), John (Engineer), Lisa (Designer).
Topics discussed: feature roadmap, technical constraints, user research.
"""
result = model.transcribe(
audio,
initial_prompt=context_prompt,
language="en"
)
3. Handle Technical Terms
For meetings with domain-specific terminology:
# Add custom vocabulary or use phrase boosting
context = "This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
4. Enable Word Timestamps
Essential for meeting minutes and search:
result = model.transcribe(
audio,
word_timestamps=True # Get word-level timestamps
)
Real-Time vs Batch Meeting Transcription
Real-Time Transcription
Use cases:
- Live meeting captions
- Accessibility during meetings
- Real-time note-taking
Challenges:
- Lower accuracy (no full context)
- Higher latency requirements
- More complex implementation
Implementation:
# Stream audio in small chunks (1-5 seconds)
# Transcribe incrementally
# Update display in real-time
Batch Transcription (Recommended)
Use cases:
- Meeting minutes and documentation
- Post-meeting analysis
- Knowledge base creation
Advantages:
- Higher accuracy (full context)
- Better speaker diarization
- More cost-effective
- Easier to implement
Typical workflow:
- Record meeting
- Process after meeting ends
- Generate transcript and summary
- Distribute to participants
Post-Processing Meeting Transcripts
After transcription, enhance the output for usability:
1. Format as Meeting Minutes
def format_meeting_minutes(transcript, speakers, metadata):
minutes = f"""
# Meeting Minutes
**Date:** {metadata['date']}
**Participants:** {', '.join(speakers)}
**Duration:** {metadata['duration']}
## Transcript
"""
for segment in transcript:
minutes += f"**[{segment['speaker']}]** ({segment['start']:.0f}s): {segment['text']}\n\n"
return minutes
2. Extract Action Items
# Use LLM or pattern matching to extract:
# - Action items
# - Decisions made
# - Next steps
# - Questions raised
3. Generate Summaries
# Use LLM (GPT-4, Claude, etc.) to summarize:
# - Key discussion points
# - Decisions and outcomes
# - Action items and owners
4. Create Searchable Index
# Index transcript for search
# Tag by speaker, topic, timestamp
# Enable full-text search
Integration with Meeting Platforms
Zoom Integration
# After Zoom meeting ends:
# 1. Download recording from Zoom API
# 2. Extract audio
# 3. Transcribe with Whisper
# 4. Upload transcript back to Zoom or share via email
Microsoft Teams Integration
# Use Microsoft Graph API to:
# 1. Access Teams meeting recordings
# 2. Download audio files
# 3. Process with Whisper
# 4. Store in SharePoint or OneDrive
Google Meet Integration
# Use Google Drive API to:
# 1. Access Meet recordings
# 2. Download and process
# 3. Store transcripts in Drive
Custom Integration
For custom meeting platforms:
# Webhook-based workflow:
# 1. Meeting platform sends recording URL
# 2. Download and transcribe
# 3. Send transcript back via webhook
# 4. Update meeting platform UI
Scaling Whisper for Enterprise Meetings
Small Scale (β€50 meetings/day)
- Single GPU server
- Whisper small or medium
- Simple queue system
Medium Scale (100β1000 meetings/day)
- GPU pool (2β4 GPUs)
- Async job queue (RabbitMQ, Redis)
- Chunk-based processing
- Load balancing
Large Scale (Enterprise)
- Multiple GPU nodes
- Distributed processing (Kubernetes)
- Audio preprocessing service
- Transcription + summarization pipelines
- Caching for repeated content
Common Challenges and Solutions
Challenge 1: Overlapping Speech
Problem: Multiple people talking at once
Solutions:
- Use better diarization models
- Post-process to identify overlaps
- Mark overlapping segments in transcript
Challenge 2: Background Noise
Problem: Office noise, typing, echoes
Solutions:
- Audio preprocessing (noise reduction)
- Use Whisper medium/large (better noise handling)
- Encourage better recording practices
Challenge 3: Accents and Non-Native Speakers
Problem: Lower accuracy for accented speech
Solutions:
- Use larger Whisper models
- Provide context about participants
- Fine-tune on accent-specific data (if needed)
Challenge 4: Technical Terminology
Problem: Domain-specific terms misrecognized
Solutions:
- Use initial prompts with terminology
- Post-process with custom dictionaries
- Fine-tune Whisper on domain data
Whisper vs Cloud Meeting Transcription Services
| Feature | Whisper (Self-hosted) | Cloud Services (Otter, Rev) |
|---|---|---|
| Cost | Low (one-time GPU) | High (per-minute pricing) |
| Data Privacy | Full control | Vendor-controlled |
| Accuracy | Very high | High |
| Customization | Full control | Limited |
| Speaker Diarization | Requires integration | Built-in |
| Integration | Custom | Pre-built connectors |
Whisper is ideal for:
- Organizations with privacy requirements
- High-volume meeting transcription
- Custom integration needs
- Cost-sensitive deployments
Best Practices Summary
- Use appropriate model size (small for most, medium for important)
- Enable speaker diarization for multi-speaker meetings
- Chunk long meetings (30β60s segments with overlap)
- Preprocess audio (normalize, denoise if needed)
- Provide context (participants, topics, terminology)
- Enable word timestamps for searchability
- Post-process transcripts (format, summarize, extract actions)
- Test with your meeting types before full deployment
Conclusion
Whisper is an excellent choice for meeting transcription, offering:
- High accuracy across diverse speakers and conditions
- Cost-effectiveness for high-volume use
- Full control over data and processing
- Flexibility for custom integrations
With proper audio handling, speaker diarization, and chunking strategies, Whisper can deliver production-grade meeting transcription that rivals or exceeds commercial services.
Whether you're transcribing team standups, client meetings, or board sessions, Whisper provides the accuracy and control needed for professional meeting documentation.
For production-ready meeting transcription with Whisper, consider platforms like SayToWords that provide scalable, enterprise-grade transcription services built on Whisper technology.
