🎉 We're live! All services are free during our trial period—pricing plans coming soon.

Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2


Faster-Whisper Guide: Faster Speech-to-Text with CTranslate2

Faster-whisper is a high-performance reimplementation of OpenAI's Whisper model using CTranslate2, a fast transformer inference engine. It provides 2-4× faster transcription with similar accuracy, making it ideal for production deployments and batch processing.
This comprehensive guide covers everything you need to know about faster-whisper, including installation, usage examples, performance optimization, and when to choose it over the standard OpenAI Whisper.

What Is Faster-Whisper?

Faster-whisper is an optimized implementation of OpenAI Whisper that uses CTranslate2 for faster inference. It maintains the same accuracy as the original Whisper while significantly improving speed and reducing memory usage.

Key Features

  • 2-4× faster inference compared to OpenAI Whisper
  • Lower memory usage with quantization support
  • Same accuracy as original Whisper models
  • GPU and CPU support with optimized backends
  • Batch processing for multiple files
  • Word-level timestamps support
  • Quantization options (FP32, FP16, INT8, INT8_FLOAT16)
  • Voice activity detection (VAD) filtering

How It Works

Faster-whisper converts Whisper models to CTranslate2 format, which uses optimized C++ code for inference. This provides:
  • Faster matrix operations with optimized BLAS libraries
  • Better memory management with reduced overhead
  • Quantization support for lower memory usage
  • Batch processing for throughput optimization

Faster-Whisper vs OpenAI Whisper

Performance Comparison

FeatureOpenAI WhisperFaster-Whisper
SpeedBaseline2-4× faster
Memory UsageHigherLower (with quantization)
AccuracyHighSame (identical models)
GPU SupportYesYes (optimized)
CPU SupportYesYes (optimized)
QuantizationLimitedFull support (INT8, FP16)
Batch ProcessingManualBuilt-in support
InstallationSimpleSimple (includes CTranslate2)

When to Use Faster-Whisper

Choose faster-whisper when:
  • You need faster transcription for production workloads
  • Processing multiple files in batch
  • Running on resource-constrained systems (use INT8)
  • Building real-time or near-real-time applications
  • Need lower memory usage for deployment
Stick with OpenAI Whisper when:
  • You need maximum compatibility with existing code
  • Using fine-tuned models (faster-whisper requires conversion)
  • Prefer simpler API (though faster-whisper is similar)
  • Working with experimental features first available in OpenAI Whisper

Installation

Prerequisites

  • Python 3.9+ (required)
  • FFmpeg (optional - faster-whisper uses PyAV, but FFmpeg may be needed for some formats)
  • NVIDIA GPU (optional, for GPU acceleration)

Basic Installation

Install faster-whisper using pip:
pip install faster-whisper
This automatically installs:
  • faster-whisper package
  • ctranslate2 (CTranslate2 inference engine)
  • pyav (audio decoding, replaces FFmpeg dependency)

GPU Installation (NVIDIA CUDA)

For GPU acceleration, you need CUDA libraries:
CUDA 12 (Recommended):
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
Set the library path:
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')
CUDA 11 (Legacy):
If you have CUDA 11, use an older CTranslate2 version:
pip install ctranslate2==3.24.0 faster-whisper

Verify Installation

from faster_whisper import WhisperModel

# Test basic import
print("Faster-whisper installed successfully!")

Basic Usage

Simple Transcription

from faster_whisper import WhisperModel

# Load model (automatically downloads if not present)
model = WhisperModel("base", device="cpu", compute_type="int8")

# Transcribe audio
segments, info = model.transcribe("audio.mp3")

# Print detected language
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

# Print transcription
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Get Full Text

from faster_whisper import WhisperModel

model = WhisperModel("base")
segments, info = model.transcribe("audio.mp3")

# Collect all text
full_text = " ".join([segment.text for segment in segments])
print(full_text)

With Word Timestamps

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe(
    "audio.mp3",
    word_timestamps=True,
    beam_size=5
)

for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
    
    # Word-level timestamps
    for word in segment.words:
        print(f"  {word.word} [{word.start:.2f}s - {word.end:.2f}s]")

Device and Compute Type Options

Device Options

  • device="cpu" - CPU inference (works everywhere)
  • device="cuda" - GPU inference (requires NVIDIA GPU and CUDA)

Compute Types

Choose based on your hardware and speed/accuracy trade-offs:
Compute TypeSpeedMemoryAccuracyUse Case
int8FastestLowestSlightly lowerCPU, resource-constrained
int8_float16Very fastLowHighGPU with limited VRAM
float16FastMediumHighGPU (recommended)
float32SlowestHighestHighestMaximum accuracy

Examples by Hardware

CPU (Intel/AMD):
# Best for CPU: INT8
model = WhisperModel("base", device="cpu", compute_type="int8")
GPU (NVIDIA):
# Best for GPU: FP16
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
GPU with Limited VRAM:
# Use INT8_FLOAT16 for large models
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")
Maximum Accuracy:
# Use FP32 (slower but most accurate)
model = WhisperModel("large-v2", device="cuda", compute_type="float32")

Advanced Features

1. Batch Processing

Process multiple audio files efficiently:
from faster_whisper import WhisperModel
from pathlib import Path

model = WhisperModel("base", device="cuda", compute_type="float16")

audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in audio_files:
    print(f"Transcribing: {audio_file}")
    segments, info = model.transcribe(audio_file)
    
    text = " ".join([seg.text for seg in segments])
    print(f"Result: {text[:100]}...")
    print()

2. Voice Activity Detection (VAD)

Filter out silence and non-speech segments:
from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,  # Enable VAD filtering
    vad_parameters=dict(
        min_silence_duration_ms=500,  # Minimum silence duration
        threshold=0.5  # VAD threshold
    )
)

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

3. Language Specification

Specify language to improve accuracy and speed:
from faster_whisper import WhisperModel

model = WhisperModel("base")

# Specify language (faster and more accurate)
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # English
)

# Or let it auto-detect
segments, info = model.transcribe("audio.mp3")  # Auto-detect
print(f"Detected: {info.language}")

4. Beam Size and Other Parameters

from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,  # Higher = more accurate but slower (default: 5)
    best_of=5,    # Number of candidates to consider
    temperature=0.0,  # Lower = more deterministic
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."
)

5. Custom Model Paths

Use local models or custom converted models:
from faster_whisper import WhisperModel

# Use local model directory
model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8",
    download_root="./models"  # Custom download directory
)

# Or specify full path to converted model
model = WhisperModel(
    "/path/to/converted/model",
    device="cuda",
    compute_type="float16"
)

Performance Benchmarks

GPU Performance (NVIDIA RTX 3070 Ti)

Transcribing ~13 minutes of audio:
SetupTimeVRAM UsageSpeedup
OpenAI Whisper (FP16, beam=5)~2m 23s~4708 MBBaseline
Faster-whisper (FP16, beam=5)~1m 03s~4525 MB2.3× faster
Faster-whisper (INT8, beam=5)~59s~2926 MB2.4× faster
Faster-whisper (FP16, batch=8)~17s~6090 MB8.4× faster
Faster-whisper (INT8, batch=8)~16s~4500 MB8.9× faster

CPU Performance (Intel Core i7-12700K)

SetupTimeRAM UsageSpeedup
OpenAI Whisper (FP32, beam=5)~6m 58s~2335 MBBaseline
Faster-whisper (FP32, beam=5)~2m 37s~2257 MB2.7× faster
Faster-whisper (INT8, beam=5)~1m 42s~1477 MB4.1× faster
Faster-whisper (FP32, batch=8)~1m 06s~4230 MB6.3× faster
Faster-whisper (INT8, batch=8)~51s~3608 MB8.2× faster

Key Insights

  • Batch processing provides the biggest speedup (8×+ on GPU)
  • INT8 quantization reduces memory by ~40% with minimal accuracy loss
  • GPU acceleration is essential for large models and batch processing
  • CPU with INT8 is viable for smaller models and single-file processing

Complete Example: Production-Ready Transcription

from faster_whisper import WhisperModel
from pathlib import Path
import json
from datetime import datetime

class TranscriptionService:
    """Production-ready transcription service using faster-whisper."""
    
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        """Initialize the transcription service."""
        print(f"Loading model: {model_size} on {device} ({compute_type})")
        self.model = WhisperModel(
            model_size,
            device=device,
            compute_type=compute_type
        )
        print("Model loaded successfully!")
    
    def transcribe_file(self, audio_path, output_format="txt", **kwargs):
        """
        Transcribe an audio file.
        
        Args:
            audio_path: Path to audio file
            output_format: Output format (txt, json, srt, vtt)
            **kwargs: Additional transcription parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        print(f"Transcribing: {audio_path.name}")
        
        # Transcribe
        segments, info = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Collect results
        result = {
            "file": str(audio_path),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
            "segments": []
        }
        
        full_text_parts = []
        for segment in segments:
            segment_data = {
                "start": segment.start,
                "end": segment.end,
                "text": segment.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end,
                        "probability": word.probability
                    }
                    for word in segment.words
                ]
            }
            result["segments"].append(segment_data)
            full_text_parts.append(segment.text)
        
        result["text"] = " ".join(full_text_parts)
        
        # Save based on format
        output_path = audio_path.parent / f"{audio_path.stem}_transcript"
        
        if output_format == "txt":
            self._save_txt(result, output_path.with_suffix(".txt"))
        elif output_format == "json":
            self._save_json(result, output_path.with_suffix(".json"))
        elif output_format == "srt":
            self._save_srt(result, output_path.with_suffix(".srt"))
        elif output_format == "vtt":
            self._save_vtt(result, output_path.with_suffix(".vtt"))
        
        print(f"✓ Transcription saved: {output_path}.{output_format}")
        return result
    
    def _save_txt(self, result, path):
        """Save as plain text."""
        with open(path, "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    def _save_json(self, result, path):
        """Save as JSON."""
        with open(path, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    def _save_srt(self, result, path):
        """Save as SRT subtitles."""
        with open(path, "w", encoding="utf-8") as f:
            for i, seg in enumerate(result["segments"], start=1):
                start = self._format_srt_time(seg["start"])
                end = self._format_srt_time(seg["end"])
                f.write(f"{i}\n{start} --> {end}\n{seg['text']}\n\n")
    
    def _save_vtt(self, result, path):
        """Save as WebVTT."""
        with open(path, "w", encoding="utf-8") as f:
            f.write("WEBVTT\n\n")
            for seg in result["segments"]:
                start = self._format_vtt_time(seg["start"])
                end = self._format_vtt_time(seg["end"])
                f.write(f"{start} --> {end}\n{seg['text']}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format time for SRT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
    
    def _format_vtt_time(self, seconds):
        """Format time for VTT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

# Usage
if __name__ == "__main__":
    # Initialize service
    service = TranscriptionService(
        model_size="base",
        device="cpu",  # Change to "cuda" for GPU
        compute_type="int8"  # Use "float16" for GPU
    )
    
    # Transcribe file
    result = service.transcribe_file(
        "meeting.mp3",
        output_format="json",
        beam_size=5,
        language="en"
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"Text: {result['text'][:200]}...")

Best Practices

1. Choose the Right Model Size

# For speed (CPU)
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# For balance
model = WhisperModel("base", device="cpu", compute_type="int8")

# For accuracy (GPU recommended)
model = WhisperModel("large-v2", device="cuda", compute_type="float16")

2. Optimize for Your Hardware

CPU-only systems:
model = WhisperModel("base", device="cpu", compute_type="int8")
GPU with sufficient VRAM:
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
GPU with limited VRAM:
model = WhisperModel("medium", device="cuda", compute_type="int8_float16")

3. Use Batch Processing for Multiple Files

# Process multiple files efficiently
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
model = WhisperModel("base", device="cuda", compute_type="float16")

for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)
    # Process results...

4. Enable VAD for Noisy Audio

segments, info = model.transcribe(
    "noisy_audio.mp3",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=1000,
        threshold=0.5
    )
)

5. Specify Language When Known

# Faster and more accurate when language is known
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # Specify instead of auto-detect
)

6. Reuse Model Instances

# Load model once, reuse for multiple files
model = WhisperModel("base")

# Process multiple files with same model
for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)

Migration from OpenAI Whisper

Code Comparison

OpenAI Whisper:
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Faster-whisper:
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
text = " ".join([seg.text for seg in segments])
print(text)

Key Differences

  1. Model Loading: WhisperModel() instead of whisper.load_model()
  2. Return Format: Returns (segments, info) tuple instead of dict
  3. Segments: Iterator of segment objects instead of list
  4. Device/Compute Type: Must specify device and compute_type
  5. Text Access: Need to join segments for full text

Migration Helper Function

def convert_to_whisper_format(segments, info):
    """Convert faster-whisper output to OpenAI Whisper format."""
    return {
        "text": " ".join([seg.text for seg in segments]),
        "language": info.language,
        "segments": [
            {
                "id": i,
                "start": seg.start,
                "end": seg.end,
                "text": seg.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end
                    }
                    for word in seg.words
                ] if hasattr(seg, 'words') else []
            }
            for i, seg in enumerate(segments)
        ]
    }

# Usage
segments, info = model.transcribe("audio.mp3", word_timestamps=True)
result = convert_to_whisper_format(segments, info)
# Now compatible with OpenAI Whisper format

Troubleshooting

Issue 1: CUDA Out of Memory

Problem: GPU runs out of memory with large models.
Solutions:
# Use smaller model
model = WhisperModel("base", device="cuda", compute_type="float16")

# Or use INT8 quantization
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")

# Or use CPU
model = WhisperModel("large-v2", device="cpu", compute_type="int8")

Issue 2: Slow CPU Performance

Problem: Transcription is slow on CPU.
Solutions:
# Use INT8 quantization
model = WhisperModel("base", device="cpu", compute_type="int8")

# Use smaller model
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# Reduce beam size
segments, info = model.transcribe("audio.mp3", beam_size=1)

Issue 3: CUDA Libraries Not Found

Problem: RuntimeError: CUDA runtime not found
Solution:
# Install CUDA libraries
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

# Set library path
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

Issue 4: Model Download Fails

Problem: Model download times out or fails.
Solution:
# Specify download directory
model = WhisperModel(
    "base",
    download_root="./models",  # Custom directory
    local_files_only=False
)

# Or download manually from Hugging Face
# Then use local path
model = WhisperModel("/path/to/local/model")

When to Use Faster-Whisper

Use Faster-Whisper When:

Production deployments requiring speed
Batch processing multiple files
Resource-constrained environments (use INT8)
Real-time or near-real-time applications
GPU acceleration is available
Lower memory usage is important

Use OpenAI Whisper When:

Maximum compatibility with existing code
Fine-tuned models (easier integration)
Simpler API preference
Experimental features first available in OpenAI Whisper
Learning/development (more documentation/examples)

Conclusion

Faster-whisper provides significant performance improvements over OpenAI Whisper while maintaining the same accuracy. With proper configuration, you can achieve 2-4× speedup on CPU and up to 8× speedup on GPU with batch processing.
Key takeaways:
  • Use INT8 for CPU and resource-constrained systems
  • Use FP16 for GPU with sufficient VRAM
  • Enable batch processing for multiple files
  • Specify language when known for better performance
  • Reuse model instances for multiple transcriptions
For more information about Whisper transcription, check out our guides on Whisper Python Example, Whisper Accuracy Tips, and Whisper Transcript Formatting.

Looking for a professional speech-to-text solution? Visit SayToWords to explore our AI transcription platform with optimized performance and multiple output formats.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineSpeech to Text for MeetingsFast Speech to TextTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website