πŸŽ‰ We're live! All services are free during our trial periodβ€”pricing plans coming soon.

Whisper Python Example: Complete Guide to Speech-to-Text Transcription

Whisper Python Example: Complete Guide to Speech-to-Text Transcription

Eric King

Eric King

Author


Whisper Python Example: Complete Guide to Speech-to-Text Transcription

OpenAI Whisper is one of the most powerful open-source speech recognition models available today. In this comprehensive guide, you'll learn how to use Whisper with Python to transcribe audio files into text with high accuracy.
This tutorial is perfect for:
  • Developers building speech-to-text features
  • Data scientists working with audio data
  • Anyone looking for a complete Whisper Python example

What Is OpenAI Whisper?

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual audio data. It can:
  • Transcribe speech in 99+ languages
  • Detect language automatically
  • Translate speech to English
  • Handle noisy audio and accents
  • Process long-form audio files

Prerequisites

Before you start, ensure you have:
  • Python 3.8+ installed
  • pip package manager
  • FFmpeg installed (for audio processing)
  • (Optional) NVIDIA GPU for faster processing

Step 1: Install Whisper

Install the OpenAI Whisper package using pip:
pip install openai-whisper

Install FFmpeg

macOS (using Homebrew):
brew install ffmpeg
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
Windows: Download FFmpeg from ffmpeg.org and add it to your PATH.

Step 2: Basic Whisper Python Example

Here's a simple Python script to transcribe an audio file:
import whisper

# Load the Whisper model
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print the transcription
print(result["text"])
Output:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

Step 3: Complete Python Example with Error Handling

Here's a more robust example with proper error handling:
import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "sample_audio.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\nTranscription:")
        print(result["text"])

Step 4: Advanced Example with Language Detection

Whisper can automatically detect the language, but you can also specify it:
import whisper

model = whisper.load_model("base")

# Auto-detect language
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Transcription: {result['text']}")

# Specify language explicitly
result_en = model.transcribe("audio.mp3", language="en")
result_zh = model.transcribe("audio.mp3", language="zh")

Step 5: Get Timestamps and Segments

Whisper provides detailed segment information with timestamps:
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\nSegments with Timestamps:")
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s - {end:.2f}s] {text}")
Output:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

Segments with Timestamps:
[0.00s - 2.50s] Hello everyone, welcome to today's meeting.
[2.50s - 5.80s] We will discuss the project timeline.

Step 6: Translate Audio to English

Whisper can translate non-English speech directly to English:
import whisper

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated text:")
print(result["text"])

Step 7: Process Multiple Audio Files

Here's how to transcribe multiple files in batch:
import whisper
import os
from pathlib import Path

def batch_transcribe(audio_directory, model_size="base", output_dir="transcriptions"):
    """
    Transcribe all audio files in a directory.
    
    Args:
        audio_directory (str): Directory containing audio files
        model_size (str): Whisper model size
        output_dir (str): Directory to save transcriptions
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model once
    model = whisper.load_model(model_size)
    
    # Supported audio formats
    audio_extensions = ['.mp3', '.wav', '.m4a', '.flac', '.ogg']
    
    # Process each audio file
    audio_files = [
        f for f in os.listdir(audio_directory)
        if any(f.lower().endswith(ext) for ext in audio_extensions)
    ]
    
    for audio_file in audio_files:
        audio_path = os.path.join(audio_directory, audio_file)
        print(f"\nProcessing: {audio_file}")
        
        try:
            result = model.transcribe(audio_path)
            
            # Save transcription to file
            output_file = os.path.join(
                output_dir,
                Path(audio_file).stem + ".txt"
            )
            
            with open(output_file, "w", encoding="utf-8") as f:
                f.write(result["text"])
            
            print(f"βœ“ Saved: {output_file}")
            
        except Exception as e:
            print(f"βœ— Error processing {audio_file}: {str(e)}")

# Example usage
batch_transcribe("audio_files/", model_size="base")

Step 8: Export to SRT Subtitle Format

Create SRT subtitle files from transcriptions:
import whisper

def transcribe_to_srt(audio_path, output_path, model_size="base"):
    """
    Transcribe audio and save as SRT subtitle file.
    
    Args:
        audio_path (str): Path to audio file
        output_path (str): Path to save SRT file
        model_size (str): Whisper model size
    """
    model = whisper.load_model(model_size)
    result = model.transcribe(audio_path)
    
    # Generate SRT content
    srt_content = ""
    for i, segment in enumerate(result["segments"], start=1):
        start_time = format_timestamp(segment["start"])
        end_time = format_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_content += f"{i}\n"
        srt_content += f"{start_time} --> {end_time}\n"
        srt_content += f"{text}\n\n"
    
    # Save SRT file
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(srt_content)
    
    print(f"SRT file saved: {output_path}")

def format_timestamp(seconds):
    """Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Example usage
transcribe_to_srt("video.mp4", "subtitles.srt", model_size="base")

Whisper Model Sizes Comparison

Choose the right model size based on your needs:
ModelParametersSpeedAccuracyMemoryUse Case
tiny39M⭐⭐⭐⭐⭐⭐⭐~1GBFast testing, simple audio
base74M⭐⭐⭐⭐⭐⭐⭐~1GBGeneral purpose
small244M⭐⭐⭐⭐⭐⭐⭐~2GBBalanced performance
medium769M⭐⭐⭐⭐⭐⭐⭐~5GBHigh accuracy needed
large1550M⭐⭐⭐⭐⭐⭐⭐~10GBBest accuracy, noisy audio

Best Practices for Whisper Python

1. Choose the Right Model Size

# Fast and lightweight
model = whisper.load_model("tiny")  # Good for testing

# Balanced
model = whisper.load_model("base")  # Good for most cases

# High accuracy
model = whisper.load_model("medium")  # For important transcriptions

2. Handle Long Audio Files

For very long audio files, consider chunking:
import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=60000):
    """
    Transcribe long audio by splitting into chunks.
    
    Args:
        audio_path: Path to audio file
        chunk_length_ms: Length of each chunk in milliseconds
    """
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(audio), chunk_length_ms):
        chunks.append(audio[i:i + chunk_length_ms])
    
    # Transcribe each chunk
    full_text = []
    for i, chunk in enumerate(chunks):
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        full_text.append(result["text"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return " ".join(full_text)

3. Use GPU for Faster Processing

If you have an NVIDIA GPU:
import whisper

# Whisper will automatically use GPU if available
model = whisper.load_model("base", device="cuda")

4. Specify Language for Better Accuracy

# If you know the language, specify it
result = model.transcribe("audio.mp3", language="en")

Common Use Cases

Podcast Transcription

import whisper

model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

Meeting Notes

import whisper
from datetime import datetime

model = whisper.load_model("base")
result = model.transcribe("meeting_recording.mp3")

# Create formatted meeting notes
notes = f"""
Meeting Notes - {datetime.now().strftime('%Y-%m-%d')}
========================================

{result["text"]}
"""

with open("meeting_notes.txt", "w") as f:
    f.write(notes)

Video Subtitles

import whisper

model = whisper.load_model("base")
result = model.transcribe("video.mp4")

# Generate VTT subtitle file
vtt_content = "WEBVTT\n\n"
for segment in result["segments"]:
    start = format_vtt_timestamp(segment["start"])
    end = format_vtt_timestamp(segment["end"])
    text = segment["text"].strip()
    vtt_content += f"{start} --> {end}\n{text}\n\n"

with open("subtitles.vtt", "w") as f:
    f.write(vtt_content)

Troubleshooting Common Issues

Issue 1: FFmpeg Not Found

Error: FileNotFoundError: ffmpeg
Solution:
# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows
# Download from ffmpeg.org and add to PATH

Issue 2: Out of Memory

Error: RuntimeError: CUDA out of memory
Solution:
# Use a smaller model
model = whisper.load_model("tiny")  # Instead of "large"

# Or use CPU
model = whisper.load_model("base", device="cpu")

Issue 3: Slow Processing

Solutions:
  • Use a smaller model (tiny or base)
  • Enable GPU acceleration
  • Process audio in chunks
  • Use multiprocessing for batch jobs

Performance Tips

  1. Use GPU when available - 10-50x faster than CPU
  2. Choose appropriate model size - Don't use "large" for simple tasks
  3. Pre-process audio - Remove silence, normalize volume
  4. Batch process - Load model once, process multiple files
  5. Use threading - For I/O-bound operations

Whisper Python vs Other Solutions

FeatureWhisper PythonGoogle Speech-to-TextAssemblyAI
CostFree (local)Paid per minutePaid per minute
Offlineβœ…βŒβŒ
AccuracyHighHighHigh
SetupMediumEasyEasy
Long audioβœ…βœ…βœ…
Multilingualβœ…βœ…βœ…

Complete Example: Production-Ready Script

Here's a complete, production-ready example:
#!/usr/bin/env python3
"""
Production-ready Whisper transcription script.
"""

import whisper
import argparse
import os
import json
from pathlib import Path
from datetime import datetime

def transcribe_file(
    audio_path,
    model_size="base",
    language=None,
    output_format="txt",
    output_dir=None
):
    """
    Transcribe an audio file with comprehensive output options.
    
    Args:
        audio_path: Path to audio file
        model_size: Whisper model size
        language: Language code (optional, auto-detected if None)
        output_format: Output format (txt, json, srt, vtt)
        output_dir: Output directory (default: same as audio file)
    """
    # Validate input file
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    
    # Set output directory
    if output_dir is None:
        output_dir = os.path.dirname(audio_path)
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model
    print(f"Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)
    
    # Transcribe
    print(f"Transcribing: {audio_path}")
    transcribe_kwargs = {}
    if language:
        transcribe_kwargs["language"] = language
    
    result = model.transcribe(audio_path, **transcribe_kwargs)
    
    # Generate output filename
    base_name = Path(audio_path).stem
    output_path = os.path.join(output_dir, base_name)
    
    # Save based on format
    if output_format == "txt":
        with open(f"{output_path}.txt", "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    elif output_format == "json":
        with open(f"{output_path}.json", "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    elif output_format == "srt":
        srt_content = generate_srt(result["segments"])
        with open(f"{output_path}.srt", "w", encoding="utf-8") as f:
            f.write(srt_content)
    
    elif output_format == "vtt":
        vtt_content = generate_vtt(result["segments"])
        with open(f"{output_path}.vtt", "w", encoding="utf-8") as f:
            f.write(vtt_content)
    
    print(f"βœ“ Transcription saved: {output_path}.{output_format}")
    print(f"  Language: {result['language']}")
    print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
    
    return result

def generate_srt(segments):
    """Generate SRT subtitle content."""
    srt = ""
    for i, segment in enumerate(segments, start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        srt += f"{i}\n{start} --> {end}\n{text}\n\n"
    return srt

def generate_vtt(segments):
    """Generate VTT subtitle content."""
    vtt = "WEBVTT\n\n"
    for segment in segments:
        start = format_vtt_timestamp(segment["start"])
        end = format_vtt_timestamp(segment["end"])
        text = segment["text"].strip()
        vtt += f"{start} --> {end}\n{text}\n\n"
    return vtt

def format_timestamp(seconds):
    """Format timestamp for SRT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

def format_vtt_timestamp(seconds):
    """Format timestamp for VTT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

def main():
    parser = argparse.ArgumentParser(
        description="Transcribe audio files using OpenAI Whisper"
    )
    parser.add_argument("audio", help="Path to audio file")
    parser.add_argument(
        "--model",
        default="base",
        choices=["tiny", "base", "small", "medium", "large"],
        help="Whisper model size"
    )
    parser.add_argument(
        "--language",
        default=None,
        help="Language code (e.g., 'en', 'zh', 'es')"
    )
    parser.add_argument(
        "--output-format",
        default="txt",
        choices=["txt", "json", "srt", "vtt"],
        help="Output format"
    )
    parser.add_argument(
        "--output-dir",
        default=None,
        help="Output directory"
    )
    
    args = parser.parse_args()
    
    transcribe_file(
        args.audio,
        model_size=args.model,
        language=args.language,
        output_format=args.output_format,
        output_dir=args.output_dir
    )

if __name__ == "__main__":
    main()
Usage:
# Basic usage
python transcribe.py audio.mp3

# With options
python transcribe.py audio.mp3 --model medium --language en --output-format srt

# Save to specific directory
python transcribe.py audio.mp3 --output-dir ./transcriptions

Conclusion

This comprehensive Whisper Python example guide covers everything you need to get started with speech-to-text transcription using OpenAI Whisper. Whether you're transcribing podcasts, meetings, or creating subtitles, Whisper provides a powerful, free solution for converting audio to text.
Key Takeaways:
  • Whisper is free and open-source
  • Supports 99+ languages
  • Works offline (no API calls needed)
  • High accuracy for most use cases
  • Easy to integrate into Python projects
For production use cases requiring real-time transcription or API access, consider using cloud-based solutions like SayToWords, which provides Whisper-powered transcription via API.

Ready to get started? Install Whisper and try transcribing your first audio file today!

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast productionβ€”start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineSpeech to Text for MeetingsFast Speech to TextTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website