
Whisper Python Example: Complete Guide to Speech-to-Text Transcription
Eric King
Author
Whisper Python Example: Complete Guide to Speech-to-Text Transcription
OpenAI Whisper is one of the most powerful open-source speech recognition models available today. In this comprehensive guide, you'll learn how to use Whisper with Python to transcribe audio files into text with high accuracy.
This tutorial is perfect for:
- Developers building speech-to-text features
- Data scientists working with audio data
- Anyone looking for a complete Whisper Python example
What Is OpenAI Whisper?
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual audio data. It can:
- Transcribe speech in 99+ languages
- Detect language automatically
- Translate speech to English
- Handle noisy audio and accents
- Process long-form audio files
Prerequisites
Before you start, ensure you have:
- Python 3.8+ installed
- pip package manager
- FFmpeg installed (for audio processing)
- (Optional) NVIDIA GPU for faster processing
Step 1: Install Whisper
Install the OpenAI Whisper package using pip:
pip install openai-whisper
Install FFmpeg
macOS (using Homebrew):
brew install ffmpeg
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
Windows:
Download FFmpeg from ffmpeg.org and add it to your PATH.
Step 2: Basic Whisper Python Example
Here's a simple Python script to transcribe an audio file:
import whisper
# Load the Whisper model
model = whisper.load_model("base")
# Transcribe audio file
result = model.transcribe("audio.mp3")
# Print the transcription
print(result["text"])
Output:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.
Step 3: Complete Python Example with Error Handling
Here's a more robust example with proper error handling:
import whisper
import os
def transcribe_audio(audio_path, model_size="base"):
"""
Transcribe an audio file using Whisper.
Args:
audio_path (str): Path to the audio file
model_size (str): Whisper model size (tiny, base, small, medium, large)
Returns:
dict: Transcription result with text and segments
"""
try:
# Check if audio file exists
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Load the Whisper model
print(f"Loading Whisper model: {model_size}")
model = whisper.load_model(model_size)
# Transcribe the audio
print(f"Transcribing: {audio_path}")
result = model.transcribe(audio_path)
return result
except Exception as e:
print(f"Error during transcription: {str(e)}")
return None
# Example usage
if __name__ == "__main__":
audio_file = "sample_audio.mp3"
result = transcribe_audio(audio_file, model_size="base")
if result:
print("\nTranscription:")
print(result["text"])
Step 4: Advanced Example with Language Detection
Whisper can automatically detect the language, but you can also specify it:
import whisper
model = whisper.load_model("base")
# Auto-detect language
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Transcription: {result['text']}")
# Specify language explicitly
result_en = model.transcribe("audio.mp3", language="en")
result_zh = model.transcribe("audio.mp3", language="zh")
Step 5: Get Timestamps and Segments
Whisper provides detailed segment information with timestamps:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
# Print full transcription
print("Full Text:")
print(result["text"])
# Print segments with timestamps
print("\nSegments with Timestamps:")
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.2f}s - {end:.2f}s] {text}")
Output:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.
Segments with Timestamps:
[0.00s - 2.50s] Hello everyone, welcome to today's meeting.
[2.50s - 5.80s] We will discuss the project timeline.
Step 6: Translate Audio to English
Whisper can translate non-English speech directly to English:
import whisper
model = whisper.load_model("base")
# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print("Translated text:")
print(result["text"])
Step 7: Process Multiple Audio Files
Here's how to transcribe multiple files in batch:
import whisper
import os
from pathlib import Path
def batch_transcribe(audio_directory, model_size="base", output_dir="transcriptions"):
"""
Transcribe all audio files in a directory.
Args:
audio_directory (str): Directory containing audio files
model_size (str): Whisper model size
output_dir (str): Directory to save transcriptions
"""
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Load model once
model = whisper.load_model(model_size)
# Supported audio formats
audio_extensions = ['.mp3', '.wav', '.m4a', '.flac', '.ogg']
# Process each audio file
audio_files = [
f for f in os.listdir(audio_directory)
if any(f.lower().endswith(ext) for ext in audio_extensions)
]
for audio_file in audio_files:
audio_path = os.path.join(audio_directory, audio_file)
print(f"\nProcessing: {audio_file}")
try:
result = model.transcribe(audio_path)
# Save transcription to file
output_file = os.path.join(
output_dir,
Path(audio_file).stem + ".txt"
)
with open(output_file, "w", encoding="utf-8") as f:
f.write(result["text"])
print(f"β Saved: {output_file}")
except Exception as e:
print(f"β Error processing {audio_file}: {str(e)}")
# Example usage
batch_transcribe("audio_files/", model_size="base")
Step 8: Export to SRT Subtitle Format
Create SRT subtitle files from transcriptions:
import whisper
def transcribe_to_srt(audio_path, output_path, model_size="base"):
"""
Transcribe audio and save as SRT subtitle file.
Args:
audio_path (str): Path to audio file
output_path (str): Path to save SRT file
model_size (str): Whisper model size
"""
model = whisper.load_model(model_size)
result = model.transcribe(audio_path)
# Generate SRT content
srt_content = ""
for i, segment in enumerate(result["segments"], start=1):
start_time = format_timestamp(segment["start"])
end_time = format_timestamp(segment["end"])
text = segment["text"].strip()
srt_content += f"{i}\n"
srt_content += f"{start_time} --> {end_time}\n"
srt_content += f"{text}\n\n"
# Save SRT file
with open(output_path, "w", encoding="utf-8") as f:
f.write(srt_content)
print(f"SRT file saved: {output_path}")
def format_timestamp(seconds):
"""Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
# Example usage
transcribe_to_srt("video.mp4", "subtitles.srt", model_size="base")
Whisper Model Sizes Comparison
Choose the right model size based on your needs:
| Model | Parameters | Speed | Accuracy | Memory | Use Case |
|---|---|---|---|---|---|
| tiny | 39M | βββββ | ββ | ~1GB | Fast testing, simple audio |
| base | 74M | ββββ | βββ | ~1GB | General purpose |
| small | 244M | βββ | ββββ | ~2GB | Balanced performance |
| medium | 769M | ββ | βββββ | ~5GB | High accuracy needed |
| large | 1550M | β | ββββββ | ~10GB | Best accuracy, noisy audio |
Best Practices for Whisper Python
1. Choose the Right Model Size
# Fast and lightweight
model = whisper.load_model("tiny") # Good for testing
# Balanced
model = whisper.load_model("base") # Good for most cases
# High accuracy
model = whisper.load_model("medium") # For important transcriptions
2. Handle Long Audio Files
For very long audio files, consider chunking:
import whisper
from pydub import AudioSegment
def transcribe_long_audio(audio_path, chunk_length_ms=60000):
"""
Transcribe long audio by splitting into chunks.
Args:
audio_path: Path to audio file
chunk_length_ms: Length of each chunk in milliseconds
"""
model = whisper.load_model("base")
# Load audio
audio = AudioSegment.from_file(audio_path)
# Split into chunks
chunks = []
for i in range(0, len(audio), chunk_length_ms):
chunks.append(audio[i:i + chunk_length_ms])
# Transcribe each chunk
full_text = []
for i, chunk in enumerate(chunks):
chunk_path = f"chunk_{i}.wav"
chunk.export(chunk_path, format="wav")
result = model.transcribe(chunk_path)
full_text.append(result["text"])
# Clean up chunk file
os.remove(chunk_path)
return " ".join(full_text)
3. Use GPU for Faster Processing
If you have an NVIDIA GPU:
import whisper
# Whisper will automatically use GPU if available
model = whisper.load_model("base", device="cuda")
4. Specify Language for Better Accuracy
# If you know the language, specify it
result = model.transcribe("audio.mp3", language="en")
Common Use Cases
Podcast Transcription
import whisper
model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3")
# Save transcript
with open("podcast_transcript.txt", "w") as f:
f.write(result["text"])
Meeting Notes
import whisper
from datetime import datetime
model = whisper.load_model("base")
result = model.transcribe("meeting_recording.mp3")
# Create formatted meeting notes
notes = f"""
Meeting Notes - {datetime.now().strftime('%Y-%m-%d')}
========================================
{result["text"]}
"""
with open("meeting_notes.txt", "w") as f:
f.write(notes)
Video Subtitles
import whisper
model = whisper.load_model("base")
result = model.transcribe("video.mp4")
# Generate VTT subtitle file
vtt_content = "WEBVTT\n\n"
for segment in result["segments"]:
start = format_vtt_timestamp(segment["start"])
end = format_vtt_timestamp(segment["end"])
text = segment["text"].strip()
vtt_content += f"{start} --> {end}\n{text}\n\n"
with open("subtitles.vtt", "w") as f:
f.write(vtt_content)
Troubleshooting Common Issues
Issue 1: FFmpeg Not Found
Error:
FileNotFoundError: ffmpegSolution:
# Install FFmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# Windows
# Download from ffmpeg.org and add to PATH
Issue 2: Out of Memory
Error:
RuntimeError: CUDA out of memorySolution:
# Use a smaller model
model = whisper.load_model("tiny") # Instead of "large"
# Or use CPU
model = whisper.load_model("base", device="cpu")
Issue 3: Slow Processing
Solutions:
- Use a smaller model (tiny or base)
- Enable GPU acceleration
- Process audio in chunks
- Use multiprocessing for batch jobs
Performance Tips
- Use GPU when available - 10-50x faster than CPU
- Choose appropriate model size - Don't use "large" for simple tasks
- Pre-process audio - Remove silence, normalize volume
- Batch process - Load model once, process multiple files
- Use threading - For I/O-bound operations
Whisper Python vs Other Solutions
| Feature | Whisper Python | Google Speech-to-Text | AssemblyAI |
|---|---|---|---|
| Cost | Free (local) | Paid per minute | Paid per minute |
| Offline | β | β | β |
| Accuracy | High | High | High |
| Setup | Medium | Easy | Easy |
| Long audio | β | β | β |
| Multilingual | β | β | β |
Complete Example: Production-Ready Script
Here's a complete, production-ready example:
#!/usr/bin/env python3
"""
Production-ready Whisper transcription script.
"""
import whisper
import argparse
import os
import json
from pathlib import Path
from datetime import datetime
def transcribe_file(
audio_path,
model_size="base",
language=None,
output_format="txt",
output_dir=None
):
"""
Transcribe an audio file with comprehensive output options.
Args:
audio_path: Path to audio file
model_size: Whisper model size
language: Language code (optional, auto-detected if None)
output_format: Output format (txt, json, srt, vtt)
output_dir: Output directory (default: same as audio file)
"""
# Validate input file
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Set output directory
if output_dir is None:
output_dir = os.path.dirname(audio_path)
os.makedirs(output_dir, exist_ok=True)
# Load model
print(f"Loading Whisper model: {model_size}")
model = whisper.load_model(model_size)
# Transcribe
print(f"Transcribing: {audio_path}")
transcribe_kwargs = {}
if language:
transcribe_kwargs["language"] = language
result = model.transcribe(audio_path, **transcribe_kwargs)
# Generate output filename
base_name = Path(audio_path).stem
output_path = os.path.join(output_dir, base_name)
# Save based on format
if output_format == "txt":
with open(f"{output_path}.txt", "w", encoding="utf-8") as f:
f.write(result["text"])
elif output_format == "json":
with open(f"{output_path}.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
elif output_format == "srt":
srt_content = generate_srt(result["segments"])
with open(f"{output_path}.srt", "w", encoding="utf-8") as f:
f.write(srt_content)
elif output_format == "vtt":
vtt_content = generate_vtt(result["segments"])
with open(f"{output_path}.vtt", "w", encoding="utf-8") as f:
f.write(vtt_content)
print(f"β Transcription saved: {output_path}.{output_format}")
print(f" Language: {result['language']}")
print(f" Duration: {result['segments'][-1]['end']:.2f}s")
return result
def generate_srt(segments):
"""Generate SRT subtitle content."""
srt = ""
for i, segment in enumerate(segments, start=1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
srt += f"{i}\n{start} --> {end}\n{text}\n\n"
return srt
def generate_vtt(segments):
"""Generate VTT subtitle content."""
vtt = "WEBVTT\n\n"
for segment in segments:
start = format_vtt_timestamp(segment["start"])
end = format_vtt_timestamp(segment["end"])
text = segment["text"].strip()
vtt += f"{start} --> {end}\n{text}\n\n"
return vtt
def format_timestamp(seconds):
"""Format timestamp for SRT."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def format_vtt_timestamp(seconds):
"""Format timestamp for VTT."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
def main():
parser = argparse.ArgumentParser(
description="Transcribe audio files using OpenAI Whisper"
)
parser.add_argument("audio", help="Path to audio file")
parser.add_argument(
"--model",
default="base",
choices=["tiny", "base", "small", "medium", "large"],
help="Whisper model size"
)
parser.add_argument(
"--language",
default=None,
help="Language code (e.g., 'en', 'zh', 'es')"
)
parser.add_argument(
"--output-format",
default="txt",
choices=["txt", "json", "srt", "vtt"],
help="Output format"
)
parser.add_argument(
"--output-dir",
default=None,
help="Output directory"
)
args = parser.parse_args()
transcribe_file(
args.audio,
model_size=args.model,
language=args.language,
output_format=args.output_format,
output_dir=args.output_dir
)
if __name__ == "__main__":
main()
Usage:
# Basic usage
python transcribe.py audio.mp3
# With options
python transcribe.py audio.mp3 --model medium --language en --output-format srt
# Save to specific directory
python transcribe.py audio.mp3 --output-dir ./transcriptions
Conclusion
This comprehensive Whisper Python example guide covers everything you need to get started with speech-to-text transcription using OpenAI Whisper. Whether you're transcribing podcasts, meetings, or creating subtitles, Whisper provides a powerful, free solution for converting audio to text.
Key Takeaways:
- Whisper is free and open-source
- Supports 99+ languages
- Works offline (no API calls needed)
- High accuracy for most use cases
- Easy to integrate into Python projects
For production use cases requiring real-time transcription or API access, consider using cloud-based solutions like SayToWords, which provides Whisper-powered transcription via API.
Ready to get started? Install Whisper and try transcribing your first audio file today!