OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

2026-01-17SpeechToText AI Tutorial Whisper

Eric King

Author

OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

OpenAI Whisper is an open-source automatic speech recognition (ASR) model designed for speech-to-text transcription and speech translation. It supports multiple languages, handles accents and background noise well, and is widely used for podcasts, meetings, interviews, and video subtitles.

This comprehensive tutorial will guide you through everything you need to know to get started with Whisper, from installation to advanced usage.

What Is OpenAI Whisper?

Whisper is trained on 680,000 hours of multilingual audio data, making it especially strong for real-world, imperfect audio. It's one of the most accurate open-source speech recognition models available.

Key Features

Multilingual support - 99+ languages
Speech-to-text transcription - Convert audio to text
Speech translation - Translate speech directly to English
Language detection - Automatically detects spoken language
Timestamp generation - Word and segment-level timestamps
Open-source and free - MIT license, no API costs
Offline capable - Runs locally on your machine
Multiple formats - Supports various audio/video formats

Whisper Model Sizes Explained

Whisper provides multiple model sizes to balance speed and accuracy:

Model	Parameters	Speed	Accuracy	Memory	Use Case
tiny	39M	⭐⭐⭐⭐⭐	⭐⭐	~1 GB	Fast testing, demos
base	74M	⭐⭐⭐⭐	⭐⭐⭐	~1 GB	Simple audio, quick tasks
small	244M	⭐⭐⭐	⭐⭐⭐⭐	~2 GB	General use, balanced
medium	769M	⭐⭐	⭐⭐⭐⭐⭐	~5 GB	Noisy audio, high accuracy
large	1550M	⭐	⭐⭐⭐⭐⭐⭐	~10 GB	Best accuracy, production

Recommendations:

For speed: Use tiny or base
For balance: Use small or medium
For accuracy: Use large or large-v3
For production: Most use medium or large-v2

Prerequisites

Before using Whisper, ensure you have:

Python 3.8 or later (Python 3.9+ recommended)
pip package manager
FFmpeg installed (for audio/video processing)
(Optional) NVIDIA GPU with CUDA for faster processing
(Optional) 4GB+ RAM for base model, 10GB+ for large model

Step 1: Installation

Install Whisper

Install the OpenAI Whisper package using pip:

pip install openai-whisper

Or with specific version:

pip install openai-whisper==20231117

Install FFmpeg

FFmpeg is required for decoding audio and video files.

macOS (using Homebrew):

brew install ffmpeg

Ubuntu / Debian:

sudo apt update
sudo apt install ffmpeg

Windows:

Download FFmpeg from ffmpeg.org
Extract and add to your system PATH
Or use: choco install ffmpeg (with Chocolatey)

Verify Installation:

ffmpeg -version
whisper --version

Step 2: Basic Usage - Python

Simple Transcription

Here's the simplest way to transcribe audio:

import whisper

# Load model (downloads automatically on first use)
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print transcription
print(result["text"])

Output:

Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

Complete Example with Error Handling

import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        print(f"✓ Transcription complete!")
        print(f"  Language: {result['language']}")
        print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "meeting.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\n" + "="*50)
        print("TRANSCRIPTION:")
        print("="*50)
        print(result["text"])

Step 3: Language Detection and Specification

Auto-Detect Language

Whisper automatically detects the language:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

print(f"Detected language: {result['language']}")
print(f"Language probability: {result.get('language_probability', 0):.2%}")
print(f"\nTranscription:\n{result['text']}")

Specify Language (Faster and More Accurate)

When you know the language, specifying it improves speed and accuracy:

import whisper

model = whisper.load_model("base")

# Specify language
result_en = model.transcribe("audio.mp3", language="en")  # English
result_zh = model.transcribe("audio.mp3", language="zh")   # Chinese
result_es = model.transcribe("audio.mp3", language="es")  # Spanish
result_fr = model.transcribe("audio.mp3", language="fr")  # French
result_de = model.transcribe("audio.mp3", language="de")  # German
result_ja = model.transcribe("audio.mp3", language="ja")   # Japanese

print(result_en["text"])

Supported Languages: Whisper supports 99+ languages. Common language codes:

en - English
zh - Chinese
es - Spanish
fr - French
de - German
ja - Japanese
ko - Korean
pt - Portuguese
ru - Russian
it - Italian

Step 4: Timestamps and Segments

Access Segments with Timestamps

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\n" + "="*50)
print("Segments with Timestamps:")
print("="*50)

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:6.2f}s - {end:6.2f}s] {text}")

Output:

Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

==================================================
Segments with Timestamps:
==================================================
[  0.00s -   5.20s] Hello everyone, welcome to today's meeting.
[  5.20s -  12.50s] We will discuss the project timeline.

Format Timestamps as Timecode

def format_timestamp(seconds):
    """Format seconds to HH:MM:SS."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"

for segment in result["segments"]:
    start_time = format_timestamp(segment["start"])
    end_time = format_timestamp(segment["end"])
    print(f"[{start_time} - {end_time}] {segment['text']}")

Word-Level Timestamps

Enable word-level timestamps for precise timing:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    word_timestamps=True  # Enable word-level timestamps
)

for segment in result["segments"]:
    print(f"\n[{segment['start']:.2f}s - {segment['end']:.2f}s]")
    print(f"Text: {segment['text']}")
    
    # Word-level timestamps
    if "words" in segment:
        print("Words:")
        for word in segment["words"]:
            print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Step 5: Speech Translation

Whisper can translate non-English speech directly to English:

import whisper

model = whisper.load_model("base")

# Translate to English (regardless of source language)
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated to English:")
print(result["text"])

# Original transcription (in original language)
result_original = model.transcribe("spanish_audio.mp3", task="transcribe")
print("\nOriginal language transcription:")
print(result_original["text"])

Use cases:

International meetings
Multilingual content processing
Content localization
Language learning materials

Step 6: Advanced Parameters

Temperature and Beam Size

Control transcription quality and speed:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=0.0,        # Lower = more deterministic (0.0 recommended)
    beam_size=5,            # Higher = more accurate but slower (default: 5)
    best_of=5,              # Number of candidates to consider
    patience=1.0,           # Beam search patience
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."  # Context prompt
)

Temperature Values

temperature=0.0 - Most deterministic, recommended
temperature=0.2-0.4 - Slightly more variation
temperature=1.0 - More creative, less accurate

Initial Prompt for Context

Provide context to improve accuracy:

result = model.transcribe(
    "technical_meeting.mp3",
    initial_prompt="This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
)

result = model.transcribe(
    "medical_audio.mp3",
    initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)

Step 7: Command Line Interface (CLI)

Whisper provides a powerful command-line interface:

Basic CLI Usage

whisper audio.mp3

Specify Model

whisper audio.mp3 --model small
whisper audio.mp3 --model medium
whisper audio.mp3 --model large-v2

Specify Language

whisper audio.mp3 --language en
whisper audio.mp3 --language zh

Output Formats

# SRT subtitles
whisper audio.mp3 --output_format srt

# VTT subtitles
whisper audio.mp3 --output_format vtt

# Text file
whisper audio.mp3 --output_format txt

# JSON (with all metadata)
whisper audio.mp3 --output_format json

# TSV (tab-separated values)
whisper audio.mp3 --output_format tsv

Advanced CLI Options

# Full example with all options
whisper audio.mp3 \
  --model medium \
  --language en \
  --task transcribe \
  --output_format srt \
  --output_dir ./transcripts \
  --verbose True \
  --temperature 0.0 \
  --beam_size 5 \
  --best_of 5 \
  --fp16 True

CLI Options Reference

Option	Description	Default
`--model`	Model size (tiny, base, small, medium, large)	`base`
`--language`	Language code (en, zh, es, etc.)	Auto-detect
`--task`	`transcribe` or `translate`	`transcribe`
`--output_format`	Output format (txt, srt, vtt, json, tsv)	`txt`
`--output_dir`	Output directory	Current directory
`--temperature`	Temperature for sampling	`0.0`
`--beam_size`	Beam size for beam search	`5`
`--best_of`	Number of candidates	`5`
`--fp16`	Use FP16 precision (GPU)	`True`
`--verbose`	Print verbose output	`False`

Step 8: Supported Audio & Video Formats

Whisper supports most common formats via FFmpeg:

Supported Formats

Audio: MP3, WAV, M4A, FLAC, OGG, AAC, WMA
Video: MP4, AVI, MKV, MOV, WebM, FLV
Streaming: Can process audio streams

Format Examples

import whisper

model = whisper.load_model("base")

# Audio formats
model.transcribe("audio.mp3")
model.transcribe("audio.wav")
model.transcribe("audio.m4a")
model.transcribe("audio.flac")

# Video formats (extracts audio automatically)
model.transcribe("video.mp4")
model.transcribe("video.mkv")
model.transcribe("video.webm")

Step 9: Complete Production Example

Here's a complete, production-ready example:

import whisper
import json
from pathlib import Path
from datetime import datetime

class WhisperTranscriber:
    """Production-ready Whisper transcription service."""
    
    def __init__(self, model_size="base"):
        """Initialize transcriber with specified model."""
        print(f"Loading Whisper model: {model_size}")
        self.model = whisper.load_model(model_size)
        print("✓ Model loaded successfully")
    
    def transcribe_file(self, audio_path, output_dir="transcripts", **kwargs):
        """
        Transcribe audio file and save results.
        
        Args:
            audio_path: Path to audio file
            output_dir: Directory to save outputs
            **kwargs: Additional transcribe parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)
        
        print(f"\nTranscribing: {audio_path.name}")
        
        # Transcribe
        result = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Prepare output data
        output_data = {
            "file": str(audio_path),
            "transcribed_at": datetime.now().isoformat(),
            "language": result["language"],
            "language_probability": result.get("language_probability", 0),
            "duration": result["segments"][-1]["end"] if result["segments"] else 0,
            "text": result["text"],
            "segments": result["segments"]
        }
        
        # Save outputs
        base_name = audio_path.stem
        
        # Save as text
        text_file = output_path / f"{base_name}.txt"
        with open(text_file, "w", encoding="utf-8") as f:
            f.write(result["text"])
        
        # Save as JSON
        json_file = output_path / f"{base_name}.json"
        with open(json_file, "w", encoding="utf-8") as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        # Save as SRT
        srt_file = output_path / f"{base_name}.srt"
        self._save_srt(result["segments"], srt_file)
        
        print(f"✓ Transcription saved:")
        print(f"  - Text: {text_file}")
        print(f"  - JSON: {json_file}")
        print(f"  - SRT: {srt_file}")
        
        return output_data
    
    def _save_srt(self, segments, output_path):
        """Save segments as SRT subtitle file."""
        with open(output_path, "w", encoding="utf-8") as f:
            for i, segment in enumerate(segments, start=1):
                start = self._format_srt_time(segment["start"])
                end = self._format_srt_time(segment["end"])
                text = segment["text"].strip()
                f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format seconds to SRT timestamp."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Usage
if __name__ == "__main__":
    transcriber = WhisperTranscriber(model_size="base")
    
    result = transcriber.transcribe_file(
        "meeting.mp3",
        output_dir="transcripts",
        language="en",
        temperature=0.0
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"\nTranscription preview:")
    print(result['text'][:200] + "...")

Step 10: Best Practices

1. Choose the Right Model

# For speed (testing, demos)
model = whisper.load_model("tiny")

# For balance (general use)
model = whisper.load_model("base")  # or "small"

# For accuracy (production)
model = whisper.load_model("medium")  # or "large-v2"

2. Specify Language When Known

# Faster and more accurate
result = model.transcribe("audio.mp3", language="en")

# Instead of auto-detection
result = model.transcribe("audio.mp3")  # Slower

3. Use Appropriate Temperature

# Recommended for most cases
result = model.transcribe("audio.mp3", temperature=0.0)

# For creative content (not recommended for transcription)
result = model.transcribe("audio.mp3", temperature=0.2)

4. Provide Context with Initial Prompt

# Technical content
result = model.transcribe(
    "meeting.mp3",
    initial_prompt="This meeting discusses software architecture, APIs, and deployment strategies."
)

# Medical content
result = model.transcribe(
    "consultation.mp3",
    initial_prompt="This is a medical consultation about patient symptoms and treatment."
)

5. Reuse Model Instances

# Load once, reuse multiple times
model = whisper.load_model("base")

# Process multiple files
for audio_file in ["file1.mp3", "file2.mp3", "file3.mp3"]:
    result = model.transcribe(audio_file)
    # Process result...

6. Handle Long Audio Files

For very long audio files, consider chunking:

import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=600000):  # 10 minutes
    """Transcribe long audio by splitting into chunks."""
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    duration_ms = len(audio)
    
    all_text = []
    all_segments = []
    
    # Process in chunks
    for i in range(0, duration_ms, chunk_length_ms):
        chunk = audio[i:i + chunk_length_ms]
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        all_text.append(result["text"])
        all_segments.extend(result["segments"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return {
        "text": " ".join(all_text),
        "segments": all_segments
    }

Common Issues and Solutions

Issue 1: FFmpeg Not Found

Error: FileNotFoundError: ffmpeg

Solution:

# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Verify
ffmpeg -version

Issue 2: Out of Memory

Error: RuntimeError: CUDA out of memory or system runs out of RAM

Solutions:

# Use smaller model
model = whisper.load_model("base")  # Instead of "large"

# Or use CPU
import torch
model = whisper.load_model("base", device="cpu")

# Or process in chunks (see above)

Issue 3: Slow Transcription

Problem: Transcription is very slow

Solutions:

# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)

# Use smaller model
model = whisper.load_model("tiny")  # or "base"

# Reduce beam size (faster but slightly less accurate)
result = model.transcribe("audio.mp3", beam_size=1)

Issue 4: Poor Accuracy

Problem: Transcription has many errors

Solutions:

# Use larger model
model = whisper.load_model("medium")  # or "large"

# Specify language
result = model.transcribe("audio.mp3", language="en")

# Provide context
result = model.transcribe(
    "audio.mp3",
    initial_prompt="Context about the audio content..."
)

# Use optimal settings
result = model.transcribe(
    "audio.mp3",
    temperature=0.0,
    beam_size=5,
    best_of=5
)

Use Cases

1. Podcast Transcription

model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3", language="en")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

2. YouTube Subtitle Generation

model = whisper.load_model("base")
result = model.transcribe("video.mp4", language="en")

# Generate SRT
# (Use CLI: whisper video.mp4 --output_format srt)

3. Meeting Notes

model = whisper.load_model("base")
result = model.transcribe(
    "meeting.mp3",
    language="en",
    initial_prompt="This is a business meeting discussing project updates and deadlines."
)

# Save with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.0f}s] {segment['text']}")

4. Interview Transcription

model = whisper.load_model("medium")
result = model.transcribe("interview.mp3", language="en")

# Export for editing
with open("interview.txt", "w") as f:
    for segment in result["segments"]:
        f.write(f"[{segment['start']:.2f}s] {segment['text']}\n")

5. Multilingual Content Translation

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Whisper vs Alternatives

Feature	Whisper	Cloud APIs	Faster-Whisper
Cost	Free	Paid per minute	Free
Offline	✅	❌	✅
Speed	Medium	Fast	Fast (2-4×)
Accuracy	High	High	High (same)
Setup	Easy	Very Easy	Easy
Real-time	❌	✅	❌
Privacy	✅ Local	❌ Cloud	✅ Local

Choose Whisper when:

You want free, offline transcription
Privacy is important
You have control over infrastructure
Processing batch files or archived content

Choose Cloud APIs when:

You need real-time transcription
You want managed infrastructure
You have budget for API costs
You need enterprise support

Next Steps

Now that you've learned the basics, explore:

Whisper Python Example - More detailed Python examples
Faster-Whisper Guide - 2-4× faster transcription
Whisper Accuracy Tips - Improve transcription quality
Whisper Transcript Formatting - Format outputs (SRT, VTT, JSON)
Whisper for Meetings - Meeting-specific transcription

Conclusion

OpenAI Whisper is one of the most powerful open-source speech-to-text models available today. With strong multilingual support, high transcription accuracy, and complete offline capability, it's an excellent choice for developers and content creators who want full control over their transcription workflow.

Key takeaways:

Whisper supports 99+ languages with high accuracy
Choose the right model size for your needs
Specify language when known for better performance
Use word timestamps for precise timing
Reuse model instances for multiple files
Consider faster-whisper for production deployments

Whether you're transcribing podcasts, generating subtitles, or processing meeting recordings, Whisper provides a robust, free, and privacy-preserving solution for speech-to-text transcription.

Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.

OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

What Is OpenAI Whisper?

Key Features

Whisper Model Sizes Explained

Prerequisites

Step 1: Installation

Install Whisper

Install FFmpeg

Step 2: Basic Usage - Python

Simple Transcription

Complete Example with Error Handling

Step 3: Language Detection and Specification

Auto-Detect Language

Specify Language (Faster and More Accurate)

Step 4: Timestamps and Segments

Access Segments with Timestamps

Format Timestamps as Timecode

Word-Level Timestamps

Step 5: Speech Translation

Step 6: Advanced Parameters

Temperature and Beam Size

Temperature Values

Initial Prompt for Context

Step 7: Command Line Interface (CLI)

Basic CLI Usage

Specify Model

Specify Language

Output Formats

Advanced CLI Options

CLI Options Reference

Step 8: Supported Audio & Video Formats

Supported Formats

Format Examples

Step 9: Complete Production Example

Step 10: Best Practices

1. Choose the Right Model

2. Specify Language When Known

3. Use Appropriate Temperature

4. Provide Context with Initial Prompt

5. Reuse Model Instances

6. Handle Long Audio Files

Common Issues and Solutions

Issue 1: FFmpeg Not Found

Issue 2: Out of Memory

Issue 3: Slow Transcription

Issue 4: Poor Accuracy

Use Cases

1. Podcast Transcription

2. YouTube Subtitle Generation

3. Meeting Notes

4. Interview Transcription

5. Multilingual Content Translation

Whisper vs Alternatives

Next Steps

Conclusion

Related Posts

What Is Speech to Text and How to Use It: A Complete Beginner's Guide

How to Convert Audio to Text Online: Free & Accurate Methods (2026 Guide)

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Try It Free Now