
Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2
Eric King
Author
Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2
Faster-whisper is a high-performance reimplementation of OpenAI's Whisper model using CTranslate2, a fast transformer inference engine. It provides 2-4× faster transcription with similar accuracy, making it ideal for production deployments and batch processing.
This comprehensive guide covers everything you need to know about faster-whisper, including installation, usage examples, performance optimization, and when to choose it over the standard OpenAI Whisper.
What Is Faster-Whisper?
Faster-whisper is an optimized implementation of OpenAI Whisper that uses CTranslate2 for faster inference. It maintains the same accuracy as the original Whisper while significantly improving speed and reducing memory usage.
Key Features
- 2-4× faster inference compared to OpenAI Whisper
- Lower memory usage with quantization support
- Same accuracy as original Whisper models
- GPU and CPU support with optimized backends
- Batch processing for multiple files
- Word-level timestamps support
- Quantization options (FP32, FP16, INT8, INT8_FLOAT16)
- Voice activity detection (VAD) filtering
How It Works
Faster-whisper converts Whisper models to CTranslate2 format, which uses optimized C++ code for inference. This provides:
- Faster matrix operations with optimized BLAS libraries
- Better memory management with reduced overhead
- Quantization support for lower memory usage
- Batch processing for throughput optimization
Faster-Whisper vs OpenAI Whisper
Performance Comparison
| Feature | OpenAI Whisper | Faster-Whisper |
|---|---|---|
| Speed | Baseline | 2-4× faster |
| Memory Usage | Higher | Lower (with quantization) |
| Accuracy | High | Same (identical models) |
| GPU Support | Yes | Yes (optimized) |
| CPU Support | Yes | Yes (optimized) |
| Quantization | Limited | Full support (INT8, FP16) |
| Batch Processing | Manual | Built-in support |
| Installation | Simple | Simple (includes CTranslate2) |
When to Use Faster-Whisper
Choose faster-whisper when:
- You need faster transcription for production workloads
- Processing multiple files in batch
- Running on resource-constrained systems (use INT8)
- Building real-time or near-real-time applications
- Need lower memory usage for deployment
Stick with OpenAI Whisper when:
- You need maximum compatibility with existing code
- Using fine-tuned models (faster-whisper requires conversion)
- Prefer simpler API (though faster-whisper is similar)
- Working with experimental features first available in OpenAI Whisper
Installation
Prerequisites
- Python 3.9+ (required)
- FFmpeg (optional - faster-whisper uses PyAV, but FFmpeg may be needed for some formats)
- NVIDIA GPU (optional, for GPU acceleration)
Basic Installation
Install faster-whisper using pip:
pip install faster-whisper
This automatically installs:
faster-whisperpackagectranslate2(CTranslate2 inference engine)pyav(audio decoding, replaces FFmpeg dependency)
GPU Installation (NVIDIA CUDA)
For GPU acceleration, you need CUDA libraries:
CUDA 12 (Recommended):
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
Set the library path:
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')
CUDA 11 (Legacy):
If you have CUDA 11, use an older CTranslate2 version:
pip install ctranslate2==3.24.0 faster-whisper
Verify Installation
from faster_whisper import WhisperModel
# Test basic import
print("Faster-whisper installed successfully!")
Basic Usage
Simple Transcription
from faster_whisper import WhisperModel
# Load model (automatically downloads if not present)
model = WhisperModel("base", device="cpu", compute_type="int8")
# Transcribe audio
segments, info = model.transcribe("audio.mp3")
# Print detected language
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
# Print transcription
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Get Full Text
from faster_whisper import WhisperModel
model = WhisperModel("base")
segments, info = model.transcribe("audio.mp3")
# Collect all text
full_text = " ".join([segment.text for segment in segments])
print(full_text)
With Word Timestamps
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe(
"audio.mp3",
word_timestamps=True,
beam_size=5
)
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
# Word-level timestamps
for word in segment.words:
print(f" {word.word} [{word.start:.2f}s - {word.end:.2f}s]")
Device and Compute Type Options
Device Options
device="cpu"- CPU inference (works everywhere)device="cuda"- GPU inference (requires NVIDIA GPU and CUDA)
Compute Types
Choose based on your hardware and speed/accuracy trade-offs:
| Compute Type | Speed | Memory | Accuracy | Use Case |
|---|---|---|---|---|
int8 | Fastest | Lowest | Slightly lower | CPU, resource-constrained |
int8_float16 | Very fast | Low | High | GPU with limited VRAM |
float16 | Fast | Medium | High | GPU (recommended) |
float32 | Slowest | Highest | Highest | Maximum accuracy |
Examples by Hardware
CPU (Intel/AMD):
# Best for CPU: INT8
model = WhisperModel("base", device="cpu", compute_type="int8")
GPU (NVIDIA):
# Best for GPU: FP16
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
GPU with Limited VRAM:
# Use INT8_FLOAT16 for large models
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")
Maximum Accuracy:
# Use FP32 (slower but most accurate)
model = WhisperModel("large-v2", device="cuda", compute_type="float32")
Advanced Features
1. Batch Processing
Process multiple audio files efficiently:
from faster_whisper import WhisperModel
from pathlib import Path
model = WhisperModel("base", device="cuda", compute_type="float16")
audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
for audio_file in audio_files:
print(f"Transcribing: {audio_file}")
segments, info = model.transcribe(audio_file)
text = " ".join([seg.text for seg in segments])
print(f"Result: {text[:100]}...")
print()
2. Voice Activity Detection (VAD)
Filter out silence and non-speech segments:
from faster_whisper import WhisperModel
model = WhisperModel("base")
segments, info = model.transcribe(
"audio.mp3",
vad_filter=True, # Enable VAD filtering
vad_parameters=dict(
min_silence_duration_ms=500, # Minimum silence duration
threshold=0.5 # VAD threshold
)
)
for segment in segments:
print(f"[{segment.start:.2f}s] {segment.text}")
3. Language Specification
Specify language to improve accuracy and speed:
from faster_whisper import WhisperModel
model = WhisperModel("base")
# Specify language (faster and more accurate)
segments, info = model.transcribe(
"audio.mp3",
language="en" # English
)
# Or let it auto-detect
segments, info = model.transcribe("audio.mp3") # Auto-detect
print(f"Detected: {info.language}")
4. Beam Size and Other Parameters
from faster_whisper import WhisperModel
model = WhisperModel("base")
segments, info = model.transcribe(
"audio.mp3",
beam_size=5, # Higher = more accurate but slower (default: 5)
best_of=5, # Number of candidates to consider
temperature=0.0, # Lower = more deterministic
condition_on_previous_text=True, # Use context from previous segments
initial_prompt="This is a technical meeting about AI and machine learning."
)
5. Custom Model Paths
Use local models or custom converted models:
from faster_whisper import WhisperModel
# Use local model directory
model = WhisperModel(
"base",
device="cpu",
compute_type="int8",
download_root="./models" # Custom download directory
)
# Or specify full path to converted model
model = WhisperModel(
"/path/to/converted/model",
device="cuda",
compute_type="float16"
)
Performance Benchmarks
GPU Performance (NVIDIA RTX 3070 Ti)
Transcribing ~13 minutes of audio:
| Setup | Time | VRAM Usage | Speedup |
|---|---|---|---|
| OpenAI Whisper (FP16, beam=5) | ~2m 23s | ~4708 MB | Baseline |
| Faster-whisper (FP16, beam=5) | ~1m 03s | ~4525 MB | 2.3× faster |
| Faster-whisper (INT8, beam=5) | ~59s | ~2926 MB | 2.4× faster |
| Faster-whisper (FP16, batch=8) | ~17s | ~6090 MB | 8.4× faster |
| Faster-whisper (INT8, batch=8) | ~16s | ~4500 MB | 8.9× faster |
CPU Performance (Intel Core i7-12700K)
| Setup | Time | RAM Usage | Speedup |
|---|---|---|---|
| OpenAI Whisper (FP32, beam=5) | ~6m 58s | ~2335 MB | Baseline |
| Faster-whisper (FP32, beam=5) | ~2m 37s | ~2257 MB | 2.7× faster |
| Faster-whisper (INT8, beam=5) | ~1m 42s | ~1477 MB | 4.1× faster |
| Faster-whisper (FP32, batch=8) | ~1m 06s | ~4230 MB | 6.3× faster |
| Faster-whisper (INT8, batch=8) | ~51s | ~3608 MB | 8.2× faster |
Key Insights
- Batch processing provides the biggest speedup (8×+ on GPU)
- INT8 quantization reduces memory by ~40% with minimal accuracy loss
- GPU acceleration is essential for large models and batch processing
- CPU with INT8 is viable for smaller models and single-file processing
Complete Example: Production-Ready Transcription
from faster_whisper import WhisperModel
from pathlib import Path
import json
from datetime import datetime
class TranscriptionService:
"""Production-ready transcription service using faster-whisper."""
def __init__(self, model_size="base", device="cpu", compute_type="int8"):
"""Initialize the transcription service."""
print(f"Loading model: {model_size} on {device} ({compute_type})")
self.model = WhisperModel(
model_size,
device=device,
compute_type=compute_type
)
print("Model loaded successfully!")
def transcribe_file(self, audio_path, output_format="txt", **kwargs):
"""
Transcribe an audio file.
Args:
audio_path: Path to audio file
output_format: Output format (txt, json, srt, vtt)
**kwargs: Additional transcription parameters
"""
audio_path = Path(audio_path)
if not audio_path.exists():
raise FileNotFoundError(f"Audio file not found: {audio_path}")
print(f"Transcribing: {audio_path.name}")
# Transcribe
segments, info = self.model.transcribe(
str(audio_path),
word_timestamps=True,
**kwargs
)
# Collect results
result = {
"file": str(audio_path),
"language": info.language,
"language_probability": info.language_probability,
"duration": info.duration,
"segments": []
}
full_text_parts = []
for segment in segments:
segment_data = {
"start": segment.start,
"end": segment.end,
"text": segment.text,
"words": [
{
"word": word.word,
"start": word.start,
"end": word.end,
"probability": word.probability
}
for word in segment.words
]
}
result["segments"].append(segment_data)
full_text_parts.append(segment.text)
result["text"] = " ".join(full_text_parts)
# Save based on format
output_path = audio_path.parent / f"{audio_path.stem}_transcript"
if output_format == "txt":
self._save_txt(result, output_path.with_suffix(".txt"))
elif output_format == "json":
self._save_json(result, output_path.with_suffix(".json"))
elif output_format == "srt":
self._save_srt(result, output_path.with_suffix(".srt"))
elif output_format == "vtt":
self._save_vtt(result, output_path.with_suffix(".vtt"))
print(f"✓ Transcription saved: {output_path}.{output_format}")
return result
def _save_txt(self, result, path):
"""Save as plain text."""
with open(path, "w", encoding="utf-8") as f:
f.write(result["text"])
def _save_json(self, result, path):
"""Save as JSON."""
with open(path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
def _save_srt(self, result, path):
"""Save as SRT subtitles."""
with open(path, "w", encoding="utf-8") as f:
for i, seg in enumerate(result["segments"], start=1):
start = self._format_srt_time(seg["start"])
end = self._format_srt_time(seg["end"])
f.write(f"{i}\n{start} --> {end}\n{seg['text']}\n\n")
def _save_vtt(self, result, path):
"""Save as WebVTT."""
with open(path, "w", encoding="utf-8") as f:
f.write("WEBVTT\n\n")
for seg in result["segments"]:
start = self._format_vtt_time(seg["start"])
end = self._format_vtt_time(seg["end"])
f.write(f"{start} --> {end}\n{seg['text']}\n\n")
def _format_srt_time(self, seconds):
"""Format time for SRT."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def _format_vtt_time(self, seconds):
"""Format time for VTT."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
# Usage
if __name__ == "__main__":
# Initialize service
service = TranscriptionService(
model_size="base",
device="cpu", # Change to "cuda" for GPU
compute_type="int8" # Use "float16" for GPU
)
# Transcribe file
result = service.transcribe_file(
"meeting.mp3",
output_format="json",
beam_size=5,
language="en"
)
print(f"\nLanguage: {result['language']}")
print(f"Duration: {result['duration']:.2f}s")
print(f"Text: {result['text'][:200]}...")
Best Practices
1. Choose the Right Model Size
# For speed (CPU)
model = WhisperModel("tiny", device="cpu", compute_type="int8")
# For balance
model = WhisperModel("base", device="cpu", compute_type="int8")
# For accuracy (GPU recommended)
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
2. Optimize for Your Hardware
CPU-only systems:
model = WhisperModel("base", device="cpu", compute_type="int8")
GPU with sufficient VRAM:
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
GPU with limited VRAM:
model = WhisperModel("medium", device="cuda", compute_type="int8_float16")
3. Use Batch Processing for Multiple Files
# Process multiple files efficiently
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
model = WhisperModel("base", device="cuda", compute_type="float16")
for audio_file in audio_files:
segments, info = model.transcribe(audio_file)
# Process results...
4. Enable VAD for Noisy Audio
segments, info = model.transcribe(
"noisy_audio.mp3",
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=1000,
threshold=0.5
)
)
5. Specify Language When Known
# Faster and more accurate when language is known
segments, info = model.transcribe(
"audio.mp3",
language="en" # Specify instead of auto-detect
)
6. Reuse Model Instances
# Load model once, reuse for multiple files
model = WhisperModel("base")
# Process multiple files with same model
for audio_file in audio_files:
segments, info = model.transcribe(audio_file)
Migration from OpenAI Whisper
Code Comparison
OpenAI Whisper:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Faster-whisper:
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
text = " ".join([seg.text for seg in segments])
print(text)
Key Differences
- Model Loading:
WhisperModel()instead ofwhisper.load_model() - Return Format: Returns
(segments, info)tuple instead of dict - Segments: Iterator of segment objects instead of list
- Device/Compute Type: Must specify device and compute_type
- Text Access: Need to join segments for full text
Migration Helper Function
def convert_to_whisper_format(segments, info):
"""Convert faster-whisper output to OpenAI Whisper format."""
return {
"text": " ".join([seg.text for seg in segments]),
"language": info.language,
"segments": [
{
"id": i,
"start": seg.start,
"end": seg.end,
"text": seg.text,
"words": [
{
"word": word.word,
"start": word.start,
"end": word.end
}
for word in seg.words
] if hasattr(seg, 'words') else []
}
for i, seg in enumerate(segments)
]
}
# Usage
segments, info = model.transcribe("audio.mp3", word_timestamps=True)
result = convert_to_whisper_format(segments, info)
# Now compatible with OpenAI Whisper format
Troubleshooting
Issue 1: CUDA Out of Memory
Problem: GPU runs out of memory with large models.
Solutions:
# Use smaller model
model = WhisperModel("base", device="cuda", compute_type="float16")
# Or use INT8 quantization
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")
# Or use CPU
model = WhisperModel("large-v2", device="cpu", compute_type="int8")
Issue 2: Slow CPU Performance
Problem: Transcription is slow on CPU.
Solutions:
# Use INT8 quantization
model = WhisperModel("base", device="cpu", compute_type="int8")
# Use smaller model
model = WhisperModel("tiny", device="cpu", compute_type="int8")
# Reduce beam size
segments, info = model.transcribe("audio.mp3", beam_size=1)
Issue 3: CUDA Libraries Not Found
Problem:
RuntimeError: CUDA runtime not foundSolution:
# Install CUDA libraries
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
# Set library path
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')
Issue 4: Model Download Fails
Problem: Model download times out or fails.
Solution:
# Specify download directory
model = WhisperModel(
"base",
download_root="./models", # Custom directory
local_files_only=False
)
# Or download manually from Hugging Face
# Then use local path
model = WhisperModel("/path/to/local/model")
When to Use Faster-Whisper
Use Faster-Whisper When:
✅ Production deployments requiring speed
✅ Batch processing multiple files
✅ Resource-constrained environments (use INT8)
✅ Real-time or near-real-time applications
✅ GPU acceleration is available
✅ Lower memory usage is important
✅ Batch processing multiple files
✅ Resource-constrained environments (use INT8)
✅ Real-time or near-real-time applications
✅ GPU acceleration is available
✅ Lower memory usage is important
Use OpenAI Whisper When:
✅ Maximum compatibility with existing code
✅ Fine-tuned models (easier integration)
✅ Simpler API preference
✅ Experimental features first available in OpenAI Whisper
✅ Learning/development (more documentation/examples)
✅ Fine-tuned models (easier integration)
✅ Simpler API preference
✅ Experimental features first available in OpenAI Whisper
✅ Learning/development (more documentation/examples)
Conclusion
Faster-whisper provides significant performance improvements over OpenAI Whisper while maintaining the same accuracy. With proper configuration, you can achieve 2-4× speedup on CPU and up to 8× speedup on GPU with batch processing.
Key takeaways:
- Use INT8 for CPU and resource-constrained systems
- Use FP16 for GPU with sufficient VRAM
- Enable batch processing for multiple files
- Specify language when known for better performance
- Reuse model instances for multiple transcriptions
For more information about Whisper transcription, check out our guides on Whisper Python Example, Whisper Accuracy Tips, and Whisper Transcript Formatting.
Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.
