
OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription
Eric King
Author
OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription
OpenAI Whisper is an open-source automatic speech recognition (ASR) model designed for speech-to-text transcription and speech translation. It supports multiple languages, handles accents and background noise well, and is widely used for podcasts, meetings, interviews, and video subtitles.
This comprehensive tutorial will guide you through everything you need to know to get started with Whisper, from installation to advanced usage.
What Is OpenAI Whisper?
Whisper is trained on 680,000 hours of multilingual audio data, making it especially strong for real-world, imperfect audio. It's one of the most accurate open-source speech recognition models available.
Key Features
- Multilingual support - 99+ languages
- Speech-to-text transcription - Convert audio to text
- Speech translation - Translate speech directly to English
- Language detection - Automatically detects spoken language
- Timestamp generation - Word and segment-level timestamps
- Open-source and free - MIT license, no API costs
- Offline capable - Runs locally on your machine
- Multiple formats - Supports various audio/video formats
Whisper Model Sizes Explained
Whisper provides multiple model sizes to balance speed and accuracy:
| Model | Parameters | Speed | Accuracy | Memory | Use Case |
|---|---|---|---|---|---|
| tiny | 39M | βββββ | ββ | ~1 GB | Fast testing, demos |
| base | 74M | ββββ | βββ | ~1 GB | Simple audio, quick tasks |
| small | 244M | βββ | ββββ | ~2 GB | General use, balanced |
| medium | 769M | ββ | βββββ | ~5 GB | Noisy audio, high accuracy |
| large | 1550M | β | ββββββ | ~10 GB | Best accuracy, production |
Recommendations:
- For speed: Use
tinyorbase - For balance: Use
smallormedium - For accuracy: Use
largeorlarge-v3 - For production: Most use
mediumorlarge-v2
Prerequisites
Before using Whisper, ensure you have:
- Python 3.8 or later (Python 3.9+ recommended)
- pip package manager
- FFmpeg installed (for audio/video processing)
- (Optional) NVIDIA GPU with CUDA for faster processing
- (Optional) 4GB+ RAM for base model, 10GB+ for large model
Step 1: Installation
Install Whisper
Install the OpenAI Whisper package using pip:
pip install openai-whisper
Or with specific version:
pip install openai-whisper==20231117
Install FFmpeg
FFmpeg is required for decoding audio and video files.
macOS (using Homebrew):
brew install ffmpeg
Ubuntu / Debian:
sudo apt update
sudo apt install ffmpeg
Windows:
- Download FFmpeg from ffmpeg.org
- Extract and add to your system PATH
- Or use:
choco install ffmpeg(with Chocolatey)
Verify Installation:
ffmpeg -version
whisper --version
Step 2: Basic Usage - Python
Simple Transcription
Here's the simplest way to transcribe audio:
import whisper
# Load model (downloads automatically on first use)
model = whisper.load_model("base")
# Transcribe audio file
result = model.transcribe("audio.mp3")
# Print transcription
print(result["text"])
Output:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.
Complete Example with Error Handling
import whisper
import os
def transcribe_audio(audio_path, model_size="base"):
"""
Transcribe an audio file using Whisper.
Args:
audio_path (str): Path to the audio file
model_size (str): Whisper model size (tiny, base, small, medium, large)
Returns:
dict: Transcription result with text and segments
"""
try:
# Check if audio file exists
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Load the Whisper model
print(f"Loading Whisper model: {model_size}")
model = whisper.load_model(model_size)
# Transcribe the audio
print(f"Transcribing: {audio_path}")
result = model.transcribe(audio_path)
print(f"β Transcription complete!")
print(f" Language: {result['language']}")
print(f" Duration: {result['segments'][-1]['end']:.2f}s")
return result
except Exception as e:
print(f"Error during transcription: {str(e)}")
return None
# Example usage
if __name__ == "__main__":
audio_file = "meeting.mp3"
result = transcribe_audio(audio_file, model_size="base")
if result:
print("\n" + "="*50)
print("TRANSCRIPTION:")
print("="*50)
print(result["text"])
Step 3: Language Detection and Specification
Auto-Detect Language
Whisper automatically detects the language:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Language probability: {result.get('language_probability', 0):.2%}")
print(f"\nTranscription:\n{result['text']}")
Specify Language (Faster and More Accurate)
When you know the language, specifying it improves speed and accuracy:
import whisper
model = whisper.load_model("base")
# Specify language
result_en = model.transcribe("audio.mp3", language="en") # English
result_zh = model.transcribe("audio.mp3", language="zh") # Chinese
result_es = model.transcribe("audio.mp3", language="es") # Spanish
result_fr = model.transcribe("audio.mp3", language="fr") # French
result_de = model.transcribe("audio.mp3", language="de") # German
result_ja = model.transcribe("audio.mp3", language="ja") # Japanese
print(result_en["text"])
Supported Languages:
Whisper supports 99+ languages. Common language codes:
en- Englishzh- Chinesees- Spanishfr- Frenchde- Germanja- Japaneseko- Koreanpt- Portugueseru- Russianit- Italian
Step 4: Timestamps and Segments
Access Segments with Timestamps
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
# Print full transcription
print("Full Text:")
print(result["text"])
# Print segments with timestamps
print("\n" + "="*50)
print("Segments with Timestamps:")
print("="*50)
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"].strip()
print(f"[{start:6.2f}s - {end:6.2f}s] {text}")
Output:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.
==================================================
Segments with Timestamps:
==================================================
[ 0.00s - 5.20s] Hello everyone, welcome to today's meeting.
[ 5.20s - 12.50s] We will discuss the project timeline.
Format Timestamps as Timecode
def format_timestamp(seconds):
"""Format seconds to HH:MM:SS."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
for segment in result["segments"]:
start_time = format_timestamp(segment["start"])
end_time = format_timestamp(segment["end"])
print(f"[{start_time} - {end_time}] {segment['text']}")
Word-Level Timestamps
Enable word-level timestamps for precise timing:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"audio.mp3",
word_timestamps=True # Enable word-level timestamps
)
for segment in result["segments"]:
print(f"\n[{segment['start']:.2f}s - {segment['end']:.2f}s]")
print(f"Text: {segment['text']}")
# Word-level timestamps
if "words" in segment:
print("Words:")
for word in segment["words"]:
print(f" {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
Step 5: Speech Translation
Whisper can translate non-English speech directly to English:
import whisper
model = whisper.load_model("base")
# Translate to English (regardless of source language)
result = model.transcribe("spanish_audio.mp3", task="translate")
print("Translated to English:")
print(result["text"])
# Original transcription (in original language)
result_original = model.transcribe("spanish_audio.mp3", task="transcribe")
print("\nOriginal language transcription:")
print(result_original["text"])
Use cases:
- International meetings
- Multilingual content processing
- Content localization
- Language learning materials
Step 6: Advanced Parameters
Temperature and Beam Size
Control transcription quality and speed:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"audio.mp3",
temperature=0.0, # Lower = more deterministic (0.0 recommended)
beam_size=5, # Higher = more accurate but slower (default: 5)
best_of=5, # Number of candidates to consider
patience=1.0, # Beam search patience
condition_on_previous_text=True, # Use context from previous segments
initial_prompt="This is a technical meeting about AI and machine learning." # Context prompt
)
Temperature Values
temperature=0.0- Most deterministic, recommendedtemperature=0.2-0.4- Slightly more variationtemperature=1.0- More creative, less accurate
Initial Prompt for Context
Provide context to improve accuracy:
result = model.transcribe(
"technical_meeting.mp3",
initial_prompt="This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
)
result = model.transcribe(
"medical_audio.mp3",
initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)
Step 7: Command Line Interface (CLI)
Whisper provides a powerful command-line interface:
Basic CLI Usage
whisper audio.mp3
Specify Model
whisper audio.mp3 --model small
whisper audio.mp3 --model medium
whisper audio.mp3 --model large-v2
Specify Language
whisper audio.mp3 --language en
whisper audio.mp3 --language zh
Output Formats
# SRT subtitles
whisper audio.mp3 --output_format srt
# VTT subtitles
whisper audio.mp3 --output_format vtt
# Text file
whisper audio.mp3 --output_format txt
# JSON (with all metadata)
whisper audio.mp3 --output_format json
# TSV (tab-separated values)
whisper audio.mp3 --output_format tsv
Advanced CLI Options
# Full example with all options
whisper audio.mp3 \
--model medium \
--language en \
--task transcribe \
--output_format srt \
--output_dir ./transcripts \
--verbose True \
--temperature 0.0 \
--beam_size 5 \
--best_of 5 \
--fp16 True
CLI Options Reference
| Option | Description | Default |
|---|---|---|
--model | Model size (tiny, base, small, medium, large) | base |
--language | Language code (en, zh, es, etc.) | Auto-detect |
--task | transcribe or translate | transcribe |
--output_format | Output format (txt, srt, vtt, json, tsv) | txt |
--output_dir | Output directory | Current directory |
--temperature | Temperature for sampling | 0.0 |
--beam_size | Beam size for beam search | 5 |
--best_of | Number of candidates | 5 |
--fp16 | Use FP16 precision (GPU) | True |
--verbose | Print verbose output | False |
Step 8: Supported Audio & Video Formats
Whisper supports most common formats via FFmpeg:
Supported Formats
- Audio: MP3, WAV, M4A, FLAC, OGG, AAC, WMA
- Video: MP4, AVI, MKV, MOV, WebM, FLV
- Streaming: Can process audio streams
Format Examples
import whisper
model = whisper.load_model("base")
# Audio formats
model.transcribe("audio.mp3")
model.transcribe("audio.wav")
model.transcribe("audio.m4a")
model.transcribe("audio.flac")
# Video formats (extracts audio automatically)
model.transcribe("video.mp4")
model.transcribe("video.mkv")
model.transcribe("video.webm")
Step 9: Complete Production Example
Here's a complete, production-ready example:
import whisper
import json
from pathlib import Path
from datetime import datetime
class WhisperTranscriber:
"""Production-ready Whisper transcription service."""
def __init__(self, model_size="base"):
"""Initialize transcriber with specified model."""
print(f"Loading Whisper model: {model_size}")
self.model = whisper.load_model(model_size)
print("β Model loaded successfully")
def transcribe_file(self, audio_path, output_dir="transcripts", **kwargs):
"""
Transcribe audio file and save results.
Args:
audio_path: Path to audio file
output_dir: Directory to save outputs
**kwargs: Additional transcribe parameters
"""
audio_path = Path(audio_path)
if not audio_path.exists():
raise FileNotFoundError(f"Audio file not found: {audio_path}")
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
print(f"\nTranscribing: {audio_path.name}")
# Transcribe
result = self.model.transcribe(
str(audio_path),
word_timestamps=True,
**kwargs
)
# Prepare output data
output_data = {
"file": str(audio_path),
"transcribed_at": datetime.now().isoformat(),
"language": result["language"],
"language_probability": result.get("language_probability", 0),
"duration": result["segments"][-1]["end"] if result["segments"] else 0,
"text": result["text"],
"segments": result["segments"]
}
# Save outputs
base_name = audio_path.stem
# Save as text
text_file = output_path / f"{base_name}.txt"
with open(text_file, "w", encoding="utf-8") as f:
f.write(result["text"])
# Save as JSON
json_file = output_path / f"{base_name}.json"
with open(json_file, "w", encoding="utf-8") as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
# Save as SRT
srt_file = output_path / f"{base_name}.srt"
self._save_srt(result["segments"], srt_file)
print(f"β Transcription saved:")
print(f" - Text: {text_file}")
print(f" - JSON: {json_file}")
print(f" - SRT: {srt_file}")
return output_data
def _save_srt(self, segments, output_path):
"""Save segments as SRT subtitle file."""
with open(output_path, "w", encoding="utf-8") as f:
for i, segment in enumerate(segments, start=1):
start = self._format_srt_time(segment["start"])
end = self._format_srt_time(segment["end"])
text = segment["text"].strip()
f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
def _format_srt_time(self, seconds):
"""Format seconds to SRT timestamp."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
# Usage
if __name__ == "__main__":
transcriber = WhisperTranscriber(model_size="base")
result = transcriber.transcribe_file(
"meeting.mp3",
output_dir="transcripts",
language="en",
temperature=0.0
)
print(f"\nLanguage: {result['language']}")
print(f"Duration: {result['duration']:.2f}s")
print(f"\nTranscription preview:")
print(result['text'][:200] + "...")
Step 10: Best Practices
1. Choose the Right Model
# For speed (testing, demos)
model = whisper.load_model("tiny")
# For balance (general use)
model = whisper.load_model("base") # or "small"
# For accuracy (production)
model = whisper.load_model("medium") # or "large-v2"
2. Specify Language When Known
# Faster and more accurate
result = model.transcribe("audio.mp3", language="en")
# Instead of auto-detection
result = model.transcribe("audio.mp3") # Slower
3. Use Appropriate Temperature
# Recommended for most cases
result = model.transcribe("audio.mp3", temperature=0.0)
# For creative content (not recommended for transcription)
result = model.transcribe("audio.mp3", temperature=0.2)
4. Provide Context with Initial Prompt
# Technical content
result = model.transcribe(
"meeting.mp3",
initial_prompt="This meeting discusses software architecture, APIs, and deployment strategies."
)
# Medical content
result = model.transcribe(
"consultation.mp3",
initial_prompt="This is a medical consultation about patient symptoms and treatment."
)
5. Reuse Model Instances
# Load once, reuse multiple times
model = whisper.load_model("base")
# Process multiple files
for audio_file in ["file1.mp3", "file2.mp3", "file3.mp3"]:
result = model.transcribe(audio_file)
# Process result...
6. Handle Long Audio Files
For very long audio files, consider chunking:
import whisper
from pydub import AudioSegment
def transcribe_long_audio(audio_path, chunk_length_ms=600000): # 10 minutes
"""Transcribe long audio by splitting into chunks."""
model = whisper.load_model("base")
# Load audio
audio = AudioSegment.from_file(audio_path)
duration_ms = len(audio)
all_text = []
all_segments = []
# Process in chunks
for i in range(0, duration_ms, chunk_length_ms):
chunk = audio[i:i + chunk_length_ms]
chunk_path = f"chunk_{i}.wav"
chunk.export(chunk_path, format="wav")
result = model.transcribe(chunk_path)
all_text.append(result["text"])
all_segments.extend(result["segments"])
# Clean up chunk file
os.remove(chunk_path)
return {
"text": " ".join(all_text),
"segments": all_segments
}
Common Issues and Solutions
Issue 1: FFmpeg Not Found
Error:
FileNotFoundError: ffmpegSolution:
# Install FFmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# Verify
ffmpeg -version
Issue 2: Out of Memory
Error:
RuntimeError: CUDA out of memory or system runs out of RAMSolutions:
# Use smaller model
model = whisper.load_model("base") # Instead of "large"
# Or use CPU
import torch
model = whisper.load_model("base", device="cpu")
# Or process in chunks (see above)
Issue 3: Slow Transcription
Problem: Transcription is very slow
Solutions:
# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)
# Use smaller model
model = whisper.load_model("tiny") # or "base"
# Reduce beam size (faster but slightly less accurate)
result = model.transcribe("audio.mp3", beam_size=1)
Issue 4: Poor Accuracy
Problem: Transcription has many errors
Solutions:
# Use larger model
model = whisper.load_model("medium") # or "large"
# Specify language
result = model.transcribe("audio.mp3", language="en")
# Provide context
result = model.transcribe(
"audio.mp3",
initial_prompt="Context about the audio content..."
)
# Use optimal settings
result = model.transcribe(
"audio.mp3",
temperature=0.0,
beam_size=5,
best_of=5
)
Use Cases
1. Podcast Transcription
model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3", language="en")
# Save transcript
with open("podcast_transcript.txt", "w") as f:
f.write(result["text"])
2. YouTube Subtitle Generation
model = whisper.load_model("base")
result = model.transcribe("video.mp4", language="en")
# Generate SRT
# (Use CLI: whisper video.mp4 --output_format srt)
3. Meeting Notes
model = whisper.load_model("base")
result = model.transcribe(
"meeting.mp3",
language="en",
initial_prompt="This is a business meeting discussing project updates and deadlines."
)
# Save with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.0f}s] {segment['text']}")
4. Interview Transcription
model = whisper.load_model("medium")
result = model.transcribe("interview.mp3", language="en")
# Export for editing
with open("interview.txt", "w") as f:
for segment in result["segments"]:
f.write(f"[{segment['start']:.2f}s] {segment['text']}\n")
5. Multilingual Content Translation
model = whisper.load_model("base")
# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"]) # English translation
Whisper vs Alternatives
| Feature | Whisper | Cloud APIs | Faster-Whisper |
|---|---|---|---|
| Cost | Free | Paid per minute | Free |
| Offline | β | β | β |
| Speed | Medium | Fast | Fast (2-4Γ) |
| Accuracy | High | High | High (same) |
| Setup | Easy | Very Easy | Easy |
| Real-time | β | β | β |
| Privacy | β Local | β Cloud | β Local |
Choose Whisper when:
- You want free, offline transcription
- Privacy is important
- You have control over infrastructure
- Processing batch files or archived content
Choose Cloud APIs when:
- You need real-time transcription
- You want managed infrastructure
- You have budget for API costs
- You need enterprise support
Next Steps
Now that you've learned the basics, explore:
- Whisper Python Example - More detailed Python examples
- Faster-Whisper Guide - 2-4Γ faster transcription
- Whisper Accuracy Tips - Improve transcription quality
- Whisper Transcript Formatting - Format outputs (SRT, VTT, JSON)
- Whisper for Meetings - Meeting-specific transcription
Conclusion
OpenAI Whisper is one of the most powerful open-source speech-to-text models available today. With strong multilingual support, high transcription accuracy, and complete offline capability, it's an excellent choice for developers and content creators who want full control over their transcription workflow.
Key takeaways:
- Whisper supports 99+ languages with high accuracy
- Choose the right model size for your needs
- Specify language when known for better performance
- Use word timestamps for precise timing
- Reuse model instances for multiple files
- Consider faster-whisper for production deployments
Whether you're transcribing podcasts, generating subtitles, or processing meeting recordings, Whisper provides a robust, free, and privacy-preserving solution for speech-to-text transcription.
Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.