πŸŽ‰ We're live! All services are free during our trial periodβ€”pricing plans coming soon.

OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

Eric King

Eric King

Author


OpenAI Whisper Tutorial: Complete Guide to Speech-to-Text Transcription

OpenAI Whisper is an open-source automatic speech recognition (ASR) model designed for speech-to-text transcription and speech translation. It supports multiple languages, handles accents and background noise well, and is widely used for podcasts, meetings, interviews, and video subtitles.
This comprehensive tutorial will guide you through everything you need to know to get started with Whisper, from installation to advanced usage.

What Is OpenAI Whisper?

Whisper is trained on 680,000 hours of multilingual audio data, making it especially strong for real-world, imperfect audio. It's one of the most accurate open-source speech recognition models available.

Key Features

  • Multilingual support - 99+ languages
  • Speech-to-text transcription - Convert audio to text
  • Speech translation - Translate speech directly to English
  • Language detection - Automatically detects spoken language
  • Timestamp generation - Word and segment-level timestamps
  • Open-source and free - MIT license, no API costs
  • Offline capable - Runs locally on your machine
  • Multiple formats - Supports various audio/video formats

Whisper Model Sizes Explained

Whisper provides multiple model sizes to balance speed and accuracy:
ModelParametersSpeedAccuracyMemoryUse Case
tiny39M⭐⭐⭐⭐⭐⭐⭐~1 GBFast testing, demos
base74M⭐⭐⭐⭐⭐⭐⭐~1 GBSimple audio, quick tasks
small244M⭐⭐⭐⭐⭐⭐⭐~2 GBGeneral use, balanced
medium769M⭐⭐⭐⭐⭐⭐⭐~5 GBNoisy audio, high accuracy
large1550M⭐⭐⭐⭐⭐⭐⭐~10 GBBest accuracy, production
Recommendations:
  • For speed: Use tiny or base
  • For balance: Use small or medium
  • For accuracy: Use large or large-v3
  • For production: Most use medium or large-v2

Prerequisites

Before using Whisper, ensure you have:
  • Python 3.8 or later (Python 3.9+ recommended)
  • pip package manager
  • FFmpeg installed (for audio/video processing)
  • (Optional) NVIDIA GPU with CUDA for faster processing
  • (Optional) 4GB+ RAM for base model, 10GB+ for large model

Step 1: Installation

Install Whisper

Install the OpenAI Whisper package using pip:
pip install openai-whisper
Or with specific version:
pip install openai-whisper==20231117

Install FFmpeg

FFmpeg is required for decoding audio and video files.
macOS (using Homebrew):
brew install ffmpeg
Ubuntu / Debian:
sudo apt update
sudo apt install ffmpeg
Windows:
  1. Download FFmpeg from ffmpeg.org
  2. Extract and add to your system PATH
  3. Or use: choco install ffmpeg (with Chocolatey)
Verify Installation:
ffmpeg -version
whisper --version

Step 2: Basic Usage - Python

Simple Transcription

Here's the simplest way to transcribe audio:
import whisper

# Load model (downloads automatically on first use)
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print transcription
print(result["text"])
Output:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

Complete Example with Error Handling

import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        print(f"βœ“ Transcription complete!")
        print(f"  Language: {result['language']}")
        print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "meeting.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\n" + "="*50)
        print("TRANSCRIPTION:")
        print("="*50)
        print(result["text"])

Step 3: Language Detection and Specification

Auto-Detect Language

Whisper automatically detects the language:
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

print(f"Detected language: {result['language']}")
print(f"Language probability: {result.get('language_probability', 0):.2%}")
print(f"\nTranscription:\n{result['text']}")

Specify Language (Faster and More Accurate)

When you know the language, specifying it improves speed and accuracy:
import whisper

model = whisper.load_model("base")

# Specify language
result_en = model.transcribe("audio.mp3", language="en")  # English
result_zh = model.transcribe("audio.mp3", language="zh")   # Chinese
result_es = model.transcribe("audio.mp3", language="es")  # Spanish
result_fr = model.transcribe("audio.mp3", language="fr")  # French
result_de = model.transcribe("audio.mp3", language="de")  # German
result_ja = model.transcribe("audio.mp3", language="ja")   # Japanese

print(result_en["text"])
Supported Languages: Whisper supports 99+ languages. Common language codes:
  • en - English
  • zh - Chinese
  • es - Spanish
  • fr - French
  • de - German
  • ja - Japanese
  • ko - Korean
  • pt - Portuguese
  • ru - Russian
  • it - Italian

Step 4: Timestamps and Segments

Access Segments with Timestamps

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\n" + "="*50)
print("Segments with Timestamps:")
print("="*50)

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:6.2f}s - {end:6.2f}s] {text}")
Output:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

==================================================
Segments with Timestamps:
==================================================
[  0.00s -   5.20s] Hello everyone, welcome to today's meeting.
[  5.20s -  12.50s] We will discuss the project timeline.

Format Timestamps as Timecode

def format_timestamp(seconds):
    """Format seconds to HH:MM:SS."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"

for segment in result["segments"]:
    start_time = format_timestamp(segment["start"])
    end_time = format_timestamp(segment["end"])
    print(f"[{start_time} - {end_time}] {segment['text']}")

Word-Level Timestamps

Enable word-level timestamps for precise timing:
import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    word_timestamps=True  # Enable word-level timestamps
)

for segment in result["segments"]:
    print(f"\n[{segment['start']:.2f}s - {segment['end']:.2f}s]")
    print(f"Text: {segment['text']}")
    
    # Word-level timestamps
    if "words" in segment:
        print("Words:")
        for word in segment["words"]:
            print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Step 5: Speech Translation

Whisper can translate non-English speech directly to English:
import whisper

model = whisper.load_model("base")

# Translate to English (regardless of source language)
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated to English:")
print(result["text"])

# Original transcription (in original language)
result_original = model.transcribe("spanish_audio.mp3", task="transcribe")
print("\nOriginal language transcription:")
print(result_original["text"])
Use cases:
  • International meetings
  • Multilingual content processing
  • Content localization
  • Language learning materials

Step 6: Advanced Parameters

Temperature and Beam Size

Control transcription quality and speed:
import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=0.0,        # Lower = more deterministic (0.0 recommended)
    beam_size=5,            # Higher = more accurate but slower (default: 5)
    best_of=5,              # Number of candidates to consider
    patience=1.0,           # Beam search patience
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."  # Context prompt
)

Temperature Values

  • temperature=0.0 - Most deterministic, recommended
  • temperature=0.2-0.4 - Slightly more variation
  • temperature=1.0 - More creative, less accurate

Initial Prompt for Context

Provide context to improve accuracy:
result = model.transcribe(
    "technical_meeting.mp3",
    initial_prompt="This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
)

result = model.transcribe(
    "medical_audio.mp3",
    initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)

Step 7: Command Line Interface (CLI)

Whisper provides a powerful command-line interface:

Basic CLI Usage

whisper audio.mp3

Specify Model

whisper audio.mp3 --model small
whisper audio.mp3 --model medium
whisper audio.mp3 --model large-v2

Specify Language

whisper audio.mp3 --language en
whisper audio.mp3 --language zh

Output Formats

# SRT subtitles
whisper audio.mp3 --output_format srt

# VTT subtitles
whisper audio.mp3 --output_format vtt

# Text file
whisper audio.mp3 --output_format txt

# JSON (with all metadata)
whisper audio.mp3 --output_format json

# TSV (tab-separated values)
whisper audio.mp3 --output_format tsv

Advanced CLI Options

# Full example with all options
whisper audio.mp3 \
  --model medium \
  --language en \
  --task transcribe \
  --output_format srt \
  --output_dir ./transcripts \
  --verbose True \
  --temperature 0.0 \
  --beam_size 5 \
  --best_of 5 \
  --fp16 True

CLI Options Reference

OptionDescriptionDefault
--modelModel size (tiny, base, small, medium, large)base
--languageLanguage code (en, zh, es, etc.)Auto-detect
--tasktranscribe or translatetranscribe
--output_formatOutput format (txt, srt, vtt, json, tsv)txt
--output_dirOutput directoryCurrent directory
--temperatureTemperature for sampling0.0
--beam_sizeBeam size for beam search5
--best_ofNumber of candidates5
--fp16Use FP16 precision (GPU)True
--verbosePrint verbose outputFalse

Step 8: Supported Audio & Video Formats

Whisper supports most common formats via FFmpeg:

Supported Formats

  • Audio: MP3, WAV, M4A, FLAC, OGG, AAC, WMA
  • Video: MP4, AVI, MKV, MOV, WebM, FLV
  • Streaming: Can process audio streams

Format Examples

import whisper

model = whisper.load_model("base")

# Audio formats
model.transcribe("audio.mp3")
model.transcribe("audio.wav")
model.transcribe("audio.m4a")
model.transcribe("audio.flac")

# Video formats (extracts audio automatically)
model.transcribe("video.mp4")
model.transcribe("video.mkv")
model.transcribe("video.webm")

Step 9: Complete Production Example

Here's a complete, production-ready example:
import whisper
import json
from pathlib import Path
from datetime import datetime

class WhisperTranscriber:
    """Production-ready Whisper transcription service."""
    
    def __init__(self, model_size="base"):
        """Initialize transcriber with specified model."""
        print(f"Loading Whisper model: {model_size}")
        self.model = whisper.load_model(model_size)
        print("βœ“ Model loaded successfully")
    
    def transcribe_file(self, audio_path, output_dir="transcripts", **kwargs):
        """
        Transcribe audio file and save results.
        
        Args:
            audio_path: Path to audio file
            output_dir: Directory to save outputs
            **kwargs: Additional transcribe parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)
        
        print(f"\nTranscribing: {audio_path.name}")
        
        # Transcribe
        result = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Prepare output data
        output_data = {
            "file": str(audio_path),
            "transcribed_at": datetime.now().isoformat(),
            "language": result["language"],
            "language_probability": result.get("language_probability", 0),
            "duration": result["segments"][-1]["end"] if result["segments"] else 0,
            "text": result["text"],
            "segments": result["segments"]
        }
        
        # Save outputs
        base_name = audio_path.stem
        
        # Save as text
        text_file = output_path / f"{base_name}.txt"
        with open(text_file, "w", encoding="utf-8") as f:
            f.write(result["text"])
        
        # Save as JSON
        json_file = output_path / f"{base_name}.json"
        with open(json_file, "w", encoding="utf-8") as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        # Save as SRT
        srt_file = output_path / f"{base_name}.srt"
        self._save_srt(result["segments"], srt_file)
        
        print(f"βœ“ Transcription saved:")
        print(f"  - Text: {text_file}")
        print(f"  - JSON: {json_file}")
        print(f"  - SRT: {srt_file}")
        
        return output_data
    
    def _save_srt(self, segments, output_path):
        """Save segments as SRT subtitle file."""
        with open(output_path, "w", encoding="utf-8") as f:
            for i, segment in enumerate(segments, start=1):
                start = self._format_srt_time(segment["start"])
                end = self._format_srt_time(segment["end"])
                text = segment["text"].strip()
                f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format seconds to SRT timestamp."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Usage
if __name__ == "__main__":
    transcriber = WhisperTranscriber(model_size="base")
    
    result = transcriber.transcribe_file(
        "meeting.mp3",
        output_dir="transcripts",
        language="en",
        temperature=0.0
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"\nTranscription preview:")
    print(result['text'][:200] + "...")

Step 10: Best Practices

1. Choose the Right Model

# For speed (testing, demos)
model = whisper.load_model("tiny")

# For balance (general use)
model = whisper.load_model("base")  # or "small"

# For accuracy (production)
model = whisper.load_model("medium")  # or "large-v2"

2. Specify Language When Known

# Faster and more accurate
result = model.transcribe("audio.mp3", language="en")

# Instead of auto-detection
result = model.transcribe("audio.mp3")  # Slower

3. Use Appropriate Temperature

# Recommended for most cases
result = model.transcribe("audio.mp3", temperature=0.0)

# For creative content (not recommended for transcription)
result = model.transcribe("audio.mp3", temperature=0.2)

4. Provide Context with Initial Prompt

# Technical content
result = model.transcribe(
    "meeting.mp3",
    initial_prompt="This meeting discusses software architecture, APIs, and deployment strategies."
)

# Medical content
result = model.transcribe(
    "consultation.mp3",
    initial_prompt="This is a medical consultation about patient symptoms and treatment."
)

5. Reuse Model Instances

# Load once, reuse multiple times
model = whisper.load_model("base")

# Process multiple files
for audio_file in ["file1.mp3", "file2.mp3", "file3.mp3"]:
    result = model.transcribe(audio_file)
    # Process result...

6. Handle Long Audio Files

For very long audio files, consider chunking:
import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=600000):  # 10 minutes
    """Transcribe long audio by splitting into chunks."""
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    duration_ms = len(audio)
    
    all_text = []
    all_segments = []
    
    # Process in chunks
    for i in range(0, duration_ms, chunk_length_ms):
        chunk = audio[i:i + chunk_length_ms]
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        all_text.append(result["text"])
        all_segments.extend(result["segments"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return {
        "text": " ".join(all_text),
        "segments": all_segments
    }

Common Issues and Solutions

Issue 1: FFmpeg Not Found

Error: FileNotFoundError: ffmpeg
Solution:
# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Verify
ffmpeg -version

Issue 2: Out of Memory

Error: RuntimeError: CUDA out of memory or system runs out of RAM
Solutions:
# Use smaller model
model = whisper.load_model("base")  # Instead of "large"

# Or use CPU
import torch
model = whisper.load_model("base", device="cpu")

# Or process in chunks (see above)

Issue 3: Slow Transcription

Problem: Transcription is very slow
Solutions:
# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)

# Use smaller model
model = whisper.load_model("tiny")  # or "base"

# Reduce beam size (faster but slightly less accurate)
result = model.transcribe("audio.mp3", beam_size=1)

Issue 4: Poor Accuracy

Problem: Transcription has many errors
Solutions:
# Use larger model
model = whisper.load_model("medium")  # or "large"

# Specify language
result = model.transcribe("audio.mp3", language="en")

# Provide context
result = model.transcribe(
    "audio.mp3",
    initial_prompt="Context about the audio content..."
)

# Use optimal settings
result = model.transcribe(
    "audio.mp3",
    temperature=0.0,
    beam_size=5,
    best_of=5
)

Use Cases

1. Podcast Transcription

model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3", language="en")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

2. YouTube Subtitle Generation

model = whisper.load_model("base")
result = model.transcribe("video.mp4", language="en")

# Generate SRT
# (Use CLI: whisper video.mp4 --output_format srt)

3. Meeting Notes

model = whisper.load_model("base")
result = model.transcribe(
    "meeting.mp3",
    language="en",
    initial_prompt="This is a business meeting discussing project updates and deadlines."
)

# Save with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.0f}s] {segment['text']}")

4. Interview Transcription

model = whisper.load_model("medium")
result = model.transcribe("interview.mp3", language="en")

# Export for editing
with open("interview.txt", "w") as f:
    for segment in result["segments"]:
        f.write(f"[{segment['start']:.2f}s] {segment['text']}\n")

5. Multilingual Content Translation

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Whisper vs Alternatives

FeatureWhisperCloud APIsFaster-Whisper
CostFreePaid per minuteFree
Offlineβœ…βŒβœ…
SpeedMediumFastFast (2-4Γ—)
AccuracyHighHighHigh (same)
SetupEasyVery EasyEasy
Real-timeβŒβœ…βŒ
Privacyβœ… Local❌ Cloudβœ… Local
Choose Whisper when:
  • You want free, offline transcription
  • Privacy is important
  • You have control over infrastructure
  • Processing batch files or archived content
Choose Cloud APIs when:
  • You need real-time transcription
  • You want managed infrastructure
  • You have budget for API costs
  • You need enterprise support

Next Steps

Now that you've learned the basics, explore:

Conclusion

OpenAI Whisper is one of the most powerful open-source speech-to-text models available today. With strong multilingual support, high transcription accuracy, and complete offline capability, it's an excellent choice for developers and content creators who want full control over their transcription workflow.
Key takeaways:
  • Whisper supports 99+ languages with high accuracy
  • Choose the right model size for your needs
  • Specify language when known for better performance
  • Use word timestamps for precise timing
  • Reuse model instances for multiple files
  • Consider faster-whisper for production deployments
Whether you're transcribing podcasts, generating subtitles, or processing meeting recordings, Whisper provides a robust, free, and privacy-preserving solution for speech-to-text transcription.

Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast productionβ€”start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineSpeech to Text for MeetingsFast Speech to TextTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website