Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

2026-01-17SpeechToText Whisper Tutorial Performance

Eric King

Author

Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

Faster-whisper is a high-performance reimplementation of OpenAI's Whisper model using CTranslate2, a fast transformer inference engine. It provides 2-4× faster transcription with similar accuracy, making it ideal for production deployments and batch processing.

This comprehensive guide covers everything you need to know about faster-whisper, including installation, usage examples, performance optimization, and when to choose it over the standard OpenAI Whisper.

What Is Faster-Whisper?

Faster-whisper is an optimized implementation of OpenAI Whisper that uses CTranslate2 for faster inference. It maintains the same accuracy as the original Whisper while significantly improving speed and reducing memory usage.

Key Features

2-4× faster inference compared to OpenAI Whisper
Lower memory usage with quantization support
Same accuracy as original Whisper models
GPU and CPU support with optimized backends
Batch processing for multiple files
Word-level timestamps support
Quantization options (FP32, FP16, INT8, INT8_FLOAT16)
Voice activity detection (VAD) filtering

How It Works

Faster-whisper converts Whisper models to CTranslate2 format, which uses optimized C++ code for inference. This provides:

Faster matrix operations with optimized BLAS libraries
Better memory management with reduced overhead
Quantization support for lower memory usage
Batch processing for throughput optimization

Faster-Whisper vs OpenAI Whisper

Performance Comparison

Feature	OpenAI Whisper	Faster-Whisper
Speed	Baseline	2-4× faster
Memory Usage	Higher	Lower (with quantization)
Accuracy	High	Same (identical models)
GPU Support	Yes	Yes (optimized)
CPU Support	Yes	Yes (optimized)
Quantization	Limited	Full support (INT8, FP16)
Batch Processing	Manual	Built-in support
Installation	Simple	Simple (includes CTranslate2)

When to Use Faster-Whisper

Choose faster-whisper when:

You need faster transcription for production workloads
Processing multiple files in batch
Running on resource-constrained systems (use INT8)
Building real-time or near-real-time applications
Need lower memory usage for deployment

Stick with OpenAI Whisper when:

You need maximum compatibility with existing code
Using fine-tuned models (faster-whisper requires conversion)
Prefer simpler API (though faster-whisper is similar)
Working with experimental features first available in OpenAI Whisper

Installation

Prerequisites

Python 3.9+ (required)
FFmpeg (optional - faster-whisper uses PyAV, but FFmpeg may be needed for some formats)
NVIDIA GPU (optional, for GPU acceleration)

Basic Installation

Install faster-whisper using pip:

pip install faster-whisper

This automatically installs:

faster-whisper package
ctranslate2 (CTranslate2 inference engine)
pyav (audio decoding, replaces FFmpeg dependency)

GPU Installation (NVIDIA CUDA)

For GPU acceleration, you need CUDA libraries:

CUDA 12 (Recommended):

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

Set the library path:

export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

CUDA 11 (Legacy):

If you have CUDA 11, use an older CTranslate2 version:

pip install ctranslate2==3.24.0 faster-whisper

Verify Installation

from faster_whisper import WhisperModel

# Test basic import
print("Faster-whisper installed successfully!")

Basic Usage

Simple Transcription

from faster_whisper import WhisperModel

# Load model (automatically downloads if not present)
model = WhisperModel("base", device="cpu", compute_type="int8")

# Transcribe audio
segments, info = model.transcribe("audio.mp3")

# Print detected language
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

# Print transcription
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Get Full Text

from faster_whisper import WhisperModel

model = WhisperModel("base")
segments, info = model.transcribe("audio.mp3")

# Collect all text
full_text = " ".join([segment.text for segment in segments])
print(full_text)

With Word Timestamps

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe(
    "audio.mp3",
    word_timestamps=True,
    beam_size=5
)

for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
    
    # Word-level timestamps
    for word in segment.words:
        print(f"  {word.word} [{word.start:.2f}s - {word.end:.2f}s]")

Device and Compute Type Options

Device Options

device="cpu" - CPU inference (works everywhere)
device="cuda" - GPU inference (requires NVIDIA GPU and CUDA)

Compute Types

Choose based on your hardware and speed/accuracy trade-offs:

Compute Type	Speed	Memory	Accuracy	Use Case
`int8`	Fastest	Lowest	Slightly lower	CPU, resource-constrained
`int8_float16`	Very fast	Low	High	GPU with limited VRAM
`float16`	Fast	Medium	High	GPU (recommended)
`float32`	Slowest	Highest	Highest	Maximum accuracy

Examples by Hardware

CPU (Intel/AMD):

# Best for CPU: INT8
model = WhisperModel("base", device="cpu", compute_type="int8")

GPU (NVIDIA):

# Best for GPU: FP16
model = WhisperModel("large-v2", device="cuda", compute_type="float16")

GPU with Limited VRAM:

# Use INT8_FLOAT16 for large models
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")

Maximum Accuracy:

# Use FP32 (slower but most accurate)
model = WhisperModel("large-v2", device="cuda", compute_type="float32")

Advanced Features

1. Batch Processing

Process multiple audio files efficiently:

from faster_whisper import WhisperModel
from pathlib import Path

model = WhisperModel("base", device="cuda", compute_type="float16")

audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in audio_files:
    print(f"Transcribing: {audio_file}")
    segments, info = model.transcribe(audio_file)
    
    text = " ".join([seg.text for seg in segments])
    print(f"Result: {text[:100]}...")
    print()

2. Voice Activity Detection (VAD)

Filter out silence and non-speech segments:

from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,  # Enable VAD filtering
    vad_parameters=dict(
        min_silence_duration_ms=500,  # Minimum silence duration
        threshold=0.5  # VAD threshold
    )
)

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

3. Language Specification

Specify language to improve accuracy and speed:

from faster_whisper import WhisperModel

model = WhisperModel("base")

# Specify language (faster and more accurate)
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # English
)

# Or let it auto-detect
segments, info = model.transcribe("audio.mp3")  # Auto-detect
print(f"Detected: {info.language}")

4. Beam Size and Other Parameters

from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,  # Higher = more accurate but slower (default: 5)
    best_of=5,    # Number of candidates to consider
    temperature=0.0,  # Lower = more deterministic
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."
)

5. Custom Model Paths

Use local models or custom converted models:

from faster_whisper import WhisperModel

# Use local model directory
model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8",
    download_root="./models"  # Custom download directory
)

# Or specify full path to converted model
model = WhisperModel(
    "/path/to/converted/model",
    device="cuda",
    compute_type="float16"
)

Performance Benchmarks

GPU Performance (NVIDIA RTX 3070 Ti)

Transcribing ~13 minutes of audio:

Setup	Time	VRAM Usage	Speedup
OpenAI Whisper (FP16, beam=5)	~2m 23s	~4708 MB	Baseline
Faster-whisper (FP16, beam=5)	~1m 03s	~4525 MB	2.3× faster
Faster-whisper (INT8, beam=5)	~59s	~2926 MB	2.4× faster
Faster-whisper (FP16, batch=8)	~17s	~6090 MB	8.4× faster
Faster-whisper (INT8, batch=8)	~16s	~4500 MB	8.9× faster

CPU Performance (Intel Core i7-12700K)

Setup	Time	RAM Usage	Speedup
OpenAI Whisper (FP32, beam=5)	~6m 58s	~2335 MB	Baseline
Faster-whisper (FP32, beam=5)	~2m 37s	~2257 MB	2.7× faster
Faster-whisper (INT8, beam=5)	~1m 42s	~1477 MB	4.1× faster
Faster-whisper (FP32, batch=8)	~1m 06s	~4230 MB	6.3× faster
Faster-whisper (INT8, batch=8)	~51s	~3608 MB	8.2× faster

Key Insights

Batch processing provides the biggest speedup (8×+ on GPU)
INT8 quantization reduces memory by ~40% with minimal accuracy loss
GPU acceleration is essential for large models and batch processing
CPU with INT8 is viable for smaller models and single-file processing

Complete Example: Production-Ready Transcription

from faster_whisper import WhisperModel
from pathlib import Path
import json
from datetime import datetime

class TranscriptionService:
    """Production-ready transcription service using faster-whisper."""
    
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        """Initialize the transcription service."""
        print(f"Loading model: {model_size} on {device} ({compute_type})")
        self.model = WhisperModel(
            model_size,
            device=device,
            compute_type=compute_type
        )
        print("Model loaded successfully!")
    
    def transcribe_file(self, audio_path, output_format="txt", **kwargs):
        """
        Transcribe an audio file.
        
        Args:
            audio_path: Path to audio file
            output_format: Output format (txt, json, srt, vtt)
            **kwargs: Additional transcription parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        print(f"Transcribing: {audio_path.name}")
        
        # Transcribe
        segments, info = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Collect results
        result = {
            "file": str(audio_path),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
            "segments": []
        }
        
        full_text_parts = []
        for segment in segments:
            segment_data = {
                "start": segment.start,
                "end": segment.end,
                "text": segment.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end,
                        "probability": word.probability
                    }
                    for word in segment.words
                ]
            }
            result["segments"].append(segment_data)
            full_text_parts.append(segment.text)
        
        result["text"] = " ".join(full_text_parts)
        
        # Save based on format
        output_path = audio_path.parent / f"{audio_path.stem}_transcript"
        
        if output_format == "txt":
            self._save_txt(result, output_path.with_suffix(".txt"))
        elif output_format == "json":
            self._save_json(result, output_path.with_suffix(".json"))
        elif output_format == "srt":
            self._save_srt(result, output_path.with_suffix(".srt"))
        elif output_format == "vtt":
            self._save_vtt(result, output_path.with_suffix(".vtt"))
        
        print(f"✓ Transcription saved: {output_path}.{output_format}")
        return result
    
    def _save_txt(self, result, path):
        """Save as plain text."""
        with open(path, "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    def _save_json(self, result, path):
        """Save as JSON."""
        with open(path, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    def _save_srt(self, result, path):
        """Save as SRT subtitles."""
        with open(path, "w", encoding="utf-8") as f:
            for i, seg in enumerate(result["segments"], start=1):
                start = self._format_srt_time(seg["start"])
                end = self._format_srt_time(seg["end"])
                f.write(f"{i}\n{start} --> {end}\n{seg['text']}\n\n")
    
    def _save_vtt(self, result, path):
        """Save as WebVTT."""
        with open(path, "w", encoding="utf-8") as f:
            f.write("WEBVTT\n\n")
            for seg in result["segments"]:
                start = self._format_vtt_time(seg["start"])
                end = self._format_vtt_time(seg["end"])
                f.write(f"{start} --> {end}\n{seg['text']}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format time for SRT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
    
    def _format_vtt_time(self, seconds):
        """Format time for VTT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

# Usage
if __name__ == "__main__":
    # Initialize service
    service = TranscriptionService(
        model_size="base",
        device="cpu",  # Change to "cuda" for GPU
        compute_type="int8"  # Use "float16" for GPU
    )
    
    # Transcribe file
    result = service.transcribe_file(
        "meeting.mp3",
        output_format="json",
        beam_size=5,
        language="en"
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"Text: {result['text'][:200]}...")

Best Practices

1. Choose the Right Model Size

# For speed (CPU)
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# For balance
model = WhisperModel("base", device="cpu", compute_type="int8")

# For accuracy (GPU recommended)
model = WhisperModel("large-v2", device="cuda", compute_type="float16")

2. Optimize for Your Hardware

CPU-only systems:

model = WhisperModel("base", device="cpu", compute_type="int8")

GPU with sufficient VRAM:

model = WhisperModel("large-v2", device="cuda", compute_type="float16")

GPU with limited VRAM:

model = WhisperModel("medium", device="cuda", compute_type="int8_float16")

3. Use Batch Processing for Multiple Files

# Process multiple files efficiently
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
model = WhisperModel("base", device="cuda", compute_type="float16")

for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)
    # Process results...

4. Enable VAD for Noisy Audio

segments, info = model.transcribe(
    "noisy_audio.mp3",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=1000,
        threshold=0.5
    )
)

5. Specify Language When Known

# Faster and more accurate when language is known
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # Specify instead of auto-detect
)

6. Reuse Model Instances

# Load model once, reuse for multiple files
model = WhisperModel("base")

# Process multiple files with same model
for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)

Migration from OpenAI Whisper

Code Comparison

OpenAI Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
text = " ".join([seg.text for seg in segments])
print(text)

Key Differences

Model Loading: WhisperModel() instead of whisper.load_model()
Return Format: Returns (segments, info) tuple instead of dict
Segments: Iterator of segment objects instead of list
Device/Compute Type: Must specify device and compute_type
Text Access: Need to join segments for full text

Migration Helper Function

def convert_to_whisper_format(segments, info):
    """Convert faster-whisper output to OpenAI Whisper format."""
    return {
        "text": " ".join([seg.text for seg in segments]),
        "language": info.language,
        "segments": [
            {
                "id": i,
                "start": seg.start,
                "end": seg.end,
                "text": seg.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end
                    }
                    for word in seg.words
                ] if hasattr(seg, 'words') else []
            }
            for i, seg in enumerate(segments)
        ]
    }

# Usage
segments, info = model.transcribe("audio.mp3", word_timestamps=True)
result = convert_to_whisper_format(segments, info)
# Now compatible with OpenAI Whisper format

Troubleshooting

Issue 1: CUDA Out of Memory

Problem: GPU runs out of memory with large models.

Solutions:

# Use smaller model
model = WhisperModel("base", device="cuda", compute_type="float16")

# Or use INT8 quantization
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")

# Or use CPU
model = WhisperModel("large-v2", device="cpu", compute_type="int8")

Issue 2: Slow CPU Performance

Problem: Transcription is slow on CPU.

Solutions:

# Use INT8 quantization
model = WhisperModel("base", device="cpu", compute_type="int8")

# Use smaller model
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# Reduce beam size
segments, info = model.transcribe("audio.mp3", beam_size=1)

Issue 3: CUDA Libraries Not Found

Problem: RuntimeError: CUDA runtime not found

Solution:

# Install CUDA libraries
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

# Set library path
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

Issue 4: Model Download Fails

Problem: Model download times out or fails.

Solution:

# Specify download directory
model = WhisperModel(
    "base",
    download_root="./models",  # Custom directory
    local_files_only=False
)

# Or download manually from Hugging Face
# Then use local path
model = WhisperModel("/path/to/local/model")

When to Use Faster-Whisper

Use Faster-Whisper When:

✅ Production deployments requiring speed
✅ Batch processing multiple files
✅ Resource-constrained environments (use INT8)
✅ Real-time or near-real-time applications
✅ GPU acceleration is available
✅ Lower memory usage is important

Use OpenAI Whisper When:

✅ Maximum compatibility with existing code
✅ Fine-tuned models (easier integration)
✅ Simpler API preference
✅ Experimental features first available in OpenAI Whisper
✅ Learning/development (more documentation/examples)

Conclusion

Faster-whisper provides significant performance improvements over OpenAI Whisper while maintaining the same accuracy. With proper configuration, you can achieve 2-4× speedup on CPU and up to 8× speedup on GPU with batch processing.

Key takeaways:

Use INT8 for CPU and resource-constrained systems
Use FP16 for GPU with sufficient VRAM
Enable batch processing for multiple files
Specify language when known for better performance
Reuse model instances for multiple transcriptions

For more information about Whisper transcription, check out our guides on Whisper Python Example, Whisper Accuracy Tips, and Whisper Transcript Formatting.

Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.

Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

What Is Faster-Whisper?

Key Features

How It Works

Faster-Whisper vs OpenAI Whisper

Performance Comparison

When to Use Faster-Whisper

Installation

Prerequisites

Basic Installation

GPU Installation (NVIDIA CUDA)

Verify Installation

Basic Usage

Simple Transcription

Get Full Text

With Word Timestamps

Device and Compute Type Options

Device Options

Compute Types

Examples by Hardware

Advanced Features

1. Batch Processing

2. Voice Activity Detection (VAD)

3. Language Specification

4. Beam Size and Other Parameters

5. Custom Model Paths

Performance Benchmarks

GPU Performance (NVIDIA RTX 3070 Ti)

CPU Performance (Intel Core i7-12700K)

Key Insights

Complete Example: Production-Ready Transcription

Best Practices

1. Choose the Right Model Size

2. Optimize for Your Hardware

3. Use Batch Processing for Multiple Files

4. Enable VAD for Noisy Audio

5. Specify Language When Known

6. Reuse Model Instances

Migration from OpenAI Whisper

Code Comparison

Key Differences

Migration Helper Function

Troubleshooting

Issue 1: CUDA Out of Memory

Issue 2: Slow CPU Performance

Issue 3: CUDA Libraries Not Found

Issue 4: Model Download Fails

When to Use Faster-Whisper

Use Faster-Whisper When:

Use OpenAI Whisper When:

Conclusion

Related Posts

What Is Speech to Text and How to Use It: A Complete Beginner's Guide

How to Convert Audio to Text Online: Free & Accurate Methods (2026 Guide)

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Try It Free Now