Faster-Whisper 指南:用 CTranslate2 加速语音转文字

Faster-Whisper 指南:用 CTranslate2 加速语音转文字


Faster-Whisper 指南:用 CTranslate2 加速语音转文字

Faster-whisper 是使用 CTranslate2(快速 Transformer 推理引擎)对 OpenAI Whisper 模型的高性能再实现。在精度相近的前提下,可实现 2–4 倍更快的转写,适合生产环境与批量处理。
本指南介绍 faster-whisper 的安装、示例、性能优化,以及相对标准 OpenAI Whisper 的选型。

什么是 Faster-whisper?

Faster-whisper 是借助 CTranslate2 加速推理的 OpenAI Whisper 优化实现,在保持与原版相同精度的同时显著提升速度并降低内存占用。

主要特性

  • 相比 OpenAI Whisper 推理快 2–4 倍
  • 支持量化,内存占用更低
  • 与原始 Whisper 模型 精度一致
  • GPU 与 CPU 均支持,后端已优化
  • 支持多文件 批处理
  • 词级时间戳
  • 量化选项(FP32、FP16、INT8、INT8_FLOAT16)
  • 语音活动检测(VAD) 过滤

工作原理

Faster-whisper 将 Whisper 模型转换为 CTranslate2 格式,使用针对推理优化的 C++ 代码执行,从而带来:
  • 借助优化 BLAS 的 更快矩阵运算
  • 降低开销的 更好内存管理
  • 量化 以降低内存
  • 批处理 以提升吞吐

Faster-whisper 与 OpenAI Whisper

性能对比

特性OpenAI WhisperFaster-whisper
速度基准快 2–4 倍
内存较高较低(量化后)
精度相同(模型一致)
GPU支持支持(已优化)
CPU支持支持(已优化)
量化有限完整(INT8、FP16 等)
批处理需手动内置
安装简单简单(含 CTranslate2)

何时选用 Faster-whisper

适合 faster-whisper 的情况:
  • 生产负载需要 更快转写
  • 需要 批量 处理多文件
  • 运行在 资源受限 环境(使用 INT8)
  • 构建 实时或近实时 应用
  • 部署时期望 更低内存
继续用 OpenAI Whisper 的情况:
  • 需要与现有代码 最大兼容
  • 使用 微调模型(faster-whisper 需转换)
  • 更偏好 更简单的 API(faster-whisper 也较接近)
  • 需要先在 OpenAI Whisper 中出现的 实验功能

安装

前提

  • Python 3.9+(必需)
  • FFmpeg(可选:faster-whisper 使用 PyAV,部分格式仍可能需要 FFmpeg)
  • NVIDIA GPU(可选,用于 GPU 加速)

基础安装

使用 pip 安装 faster-whisper:
pip install faster-whisper
会自动安装:
  • faster-whisper
  • ctranslate2(CTranslate2 推理引擎)
  • pyav(音频解码,替代 FFmpeg 依赖)

GPU 安装(NVIDIA CUDA)

GPU 加速需要 CUDA 库。
CUDA 12(推荐):
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
设置库路径:
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')
CUDA 11(旧版):
若使用 CUDA 11,请安装较旧的 CTranslate2 版本:
pip install ctranslate2==3.24.0 faster-whisper

验证安装

from faster_whisper import WhisperModel

# Test basic import
print("Faster-whisper installed successfully!")

基本用法

简单转写

from faster_whisper import WhisperModel

# Load model (automatically downloads if not present)
model = WhisperModel("base", device="cpu", compute_type="int8")

# Transcribe audio
segments, info = model.transcribe("audio.mp3")

# Print detected language
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

# Print transcription
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

获取全文

from faster_whisper import WhisperModel

model = WhisperModel("base")
segments, info = model.transcribe("audio.mp3")

# Collect all text
full_text = " ".join([segment.text for segment in segments])
print(full_text)

词级时间戳

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe(
    "audio.mp3",
    word_timestamps=True,
    beam_size=5
)

for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
    
    # Word-level timestamps
    for word in segment.words:
        print(f"  {word.word} [{word.start:.2f}s - {word.end:.2f}s]")

设备与计算类型

设备选项

  • device="cpu" — CPU 推理(通用)
  • device="cuda" — GPU 推理(需 NVIDIA GPU 与 CUDA)

计算类型

按硬件与速度/精度权衡选择:
计算类型速度内存精度适用场景
int8最快最低略低CPU、资源紧张
int8_float16很快显存有限的 GPU
float16GPU(推荐)
float32最慢最高最高最高精度

按硬件示例

CPU(Intel/AMD):
# Best for CPU: INT8
model = WhisperModel("base", device="cpu", compute_type="int8")
GPU(NVIDIA):
# Best for GPU: FP16
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
显存有限的 GPU:
# Use INT8_FLOAT16 for large models
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")
最高精度:
# Use FP32 (slower but most accurate)
model = WhisperModel("large-v2", device="cuda", compute_type="float32")

高级功能

1. 批处理

高效处理多个音频文件:
from faster_whisper import WhisperModel
from pathlib import Path

model = WhisperModel("base", device="cuda", compute_type="float16")

audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in audio_files:
    print(f"Transcribing: {audio_file}")
    segments, info = model.transcribe(audio_file)
    
    text = " ".join([seg.text for seg in segments])
    print(f"Result: {text[:100]}...")
    print()

2. 语音活动检测(VAD)

过滤静音与非语音片段:
from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,  # Enable VAD filtering
    vad_parameters=dict(
        min_silence_duration_ms=500,  # Minimum silence duration
        threshold=0.5  # VAD threshold
    )
)

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

3. 指定语言

指定语言可提升精度与速度:
from faster_whisper import WhisperModel

model = WhisperModel("base")

# Specify language (faster and more accurate)
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # English
)

# Or let it auto-detect
segments, info = model.transcribe("audio.mp3")  # Auto-detect
print(f"Detected: {info.language}")

4. Beam 大小及其他参数

from faster_whisper import WhisperModel

model = WhisperModel("base")

segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,  # Higher = more accurate but slower (default: 5)
    best_of=5,    # Number of candidates to consider
    temperature=0.0,  # Lower = more deterministic
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."
)

5. 自定义模型路径

使用本地或已转换模型:
from faster_whisper import WhisperModel

# Use local model directory
model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8",
    download_root="./models"  # Custom download directory
)

# Or specify full path to converted model
model = WhisperModel(
    "/path/to/converted/model",
    device="cuda",
    compute_type="float16"
)

性能基准

GPU(NVIDIA RTX 3070 Ti)

转写约 13 分钟音频:
配置时间显存占用加速
OpenAI Whisper (FP16, beam=5)~2m 23s~4708 MB基准
Faster-whisper (FP16, beam=5)~1m 03s~4525 MB2.3× 更快
Faster-whisper (INT8, beam=5)~59s~2926 MB2.4× 更快
Faster-whisper (FP16, batch=8)~17s~6090 MB8.4× 更快
Faster-whisper (INT8, batch=8)~16s~4500 MB8.9× 更快

CPU(Intel Core i7-12700K)

配置时间内存占用加速
OpenAI Whisper (FP32, beam=5)~6m 58s~2335 MB基准
Faster-whisper (FP32, beam=5)~2m 37s~2257 MB2.7× 更快
Faster-whisper (INT8, beam=5)~1m 42s~1477 MB4.1× 更快
Faster-whisper (FP32, batch=8)~1m 06s~4230 MB6.3× 更快
Faster-whisper (INT8, batch=8)~51s~3608 MB8.2× 更快

要点

  • 批处理 带来最大加速(GPU 上常超 8×)
  • INT8 量化 约省 40% 内存,精度损失很小
  • 大模型与批任务 GPU 加速 很关键
  • 小模型、单文件时 CPU + INT8 也可行

完整示例:生产级转写

from faster_whisper import WhisperModel
from pathlib import Path
import json
from datetime import datetime

class TranscriptionService:
    """Production-ready transcription service using faster-whisper."""
    
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        """Initialize the transcription service."""
        print(f"Loading model: {model_size} on {device} ({compute_type})")
        self.model = WhisperModel(
            model_size,
            device=device,
            compute_type=compute_type
        )
        print("Model loaded successfully!")
    
    def transcribe_file(self, audio_path, output_format="txt", **kwargs):
        """
        Transcribe an audio file.
        
        Args:
            audio_path: Path to audio file
            output_format: Output format (txt, json, srt, vtt)
            **kwargs: Additional transcription parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        print(f"Transcribing: {audio_path.name}")
        
        # Transcribe
        segments, info = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Collect results
        result = {
            "file": str(audio_path),
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
            "segments": []
        }
        
        full_text_parts = []
        for segment in segments:
            segment_data = {
                "start": segment.start,
                "end": segment.end,
                "text": segment.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end,
                        "probability": word.probability
                    }
                    for word in segment.words
                ]
            }
            result["segments"].append(segment_data)
            full_text_parts.append(segment.text)
        
        result["text"] = " ".join(full_text_parts)
        
        # Save based on format
        output_path = audio_path.parent / f"{audio_path.stem}_transcript"
        
        if output_format == "txt":
            self._save_txt(result, output_path.with_suffix(".txt"))
        elif output_format == "json":
            self._save_json(result, output_path.with_suffix(".json"))
        elif output_format == "srt":
            self._save_srt(result, output_path.with_suffix(".srt"))
        elif output_format == "vtt":
            self._save_vtt(result, output_path.with_suffix(".vtt"))
        
        print(f"✓ Transcription saved: {output_path}.{output_format}")
        return result
    
    def _save_txt(self, result, path):
        """Save as plain text."""
        with open(path, "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    def _save_json(self, result, path):
        """Save as JSON."""
        with open(path, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    def _save_srt(self, result, path):
        """Save as SRT subtitles."""
        with open(path, "w", encoding="utf-8") as f:
            for i, seg in enumerate(result["segments"], start=1):
                start = self._format_srt_time(seg["start"])
                end = self._format_srt_time(seg["end"])
                f.write(f"{i}\n{start} --> {end}\n{seg['text']}\n\n")
    
    def _save_vtt(self, result, path):
        """Save as WebVTT."""
        with open(path, "w", encoding="utf-8") as f:
            f.write("WEBVTT\n\n")
            for seg in result["segments"]:
                start = self._format_vtt_time(seg["start"])
                end = self._format_vtt_time(seg["end"])
                f.write(f"{start} --> {end}\n{seg['text']}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format time for SRT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
    
    def _format_vtt_time(self, seconds):
        """Format time for VTT."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

# Usage
if __name__ == "__main__":
    # Initialize service
    service = TranscriptionService(
        model_size="base",
        device="cpu",  # Change to "cuda" for GPU
        compute_type="int8"  # Use "float16" for GPU
    )
    
    # Transcribe file
    result = service.transcribe_file(
        "meeting.mp3",
        output_format="json",
        beam_size=5,
        language="en"
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"Text: {result['text'][:200]}...")

最佳实践

1. 选择合适模型体量

# For speed (CPU)
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# For balance
model = WhisperModel("base", device="cpu", compute_type="int8")

# For accuracy (GPU recommended)
model = WhisperModel("large-v2", device="cuda", compute_type="float16")

2. 针对硬件优化

仅 CPU:
model = WhisperModel("base", device="cpu", compute_type="int8")
显存充足的 GPU:
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
显存紧张:
model = WhisperModel("medium", device="cuda", compute_type="int8_float16")

3. 多文件用批处理

# Process multiple files efficiently
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
model = WhisperModel("base", device="cuda", compute_type="float16")

for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)
    # Process results...

4. 嘈杂音频开启 VAD

segments, info = model.transcribe(
    "noisy_audio.mp3",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=1000,
        threshold=0.5
    )
)

5. 已知语言时指定

# Faster and more accurate when language is known
segments, info = model.transcribe(
    "audio.mp3",
    language="en"  # Specify instead of auto-detect
)

6. 复用模型实例

# Load model once, reuse for multiple files
model = WhisperModel("base")

# Process multiple files with same model
for audio_file in audio_files:
    segments, info = model.transcribe(audio_file)

从 OpenAI Whisper 迁移

代码对比

OpenAI Whisper:
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Faster-whisper:
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
text = " ".join([seg.text for seg in segments])
print(text)

主要差异

  1. 加载模型: WhisperModel()whisper.load_model()
  2. 返回值: 元组 (segments, info) 与字典
  3. 分段: 段对象迭代器与列表
  4. 设备/计算类型: 需显式指定 devicecompute_type
  5. 全文: 需拼接各段

迁移辅助函数

def convert_to_whisper_format(segments, info):
    """Convert faster-whisper output to OpenAI Whisper format."""
    return {
        "text": " ".join([seg.text for seg in segments]),
        "language": info.language,
        "segments": [
            {
                "id": i,
                "start": seg.start,
                "end": seg.end,
                "text": seg.text,
                "words": [
                    {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end
                    }
                    for word in seg.words
                ] if hasattr(seg, 'words') else []
            }
            for i, seg in enumerate(segments)
        ]
    }

# Usage
segments, info = model.transcribe("audio.mp3", word_timestamps=True)
result = convert_to_whisper_format(segments, info)
# Now compatible with OpenAI Whisper format

故障排除

问题 1:CUDA 显存不足

现象: 大模型下 GPU 内存耗尽。
处理:
# Use smaller model
model = WhisperModel("base", device="cuda", compute_type="float16")

# Or use INT8 quantization
model = WhisperModel("large-v2", device="cuda", compute_type="int8_float16")

# Or use CPU
model = WhisperModel("large-v2", device="cpu", compute_type="int8")

问题 2:CPU 很慢

现象: CPU 上转写缓慢。
处理:
# Use INT8 quantization
model = WhisperModel("base", device="cpu", compute_type="int8")

# Use smaller model
model = WhisperModel("tiny", device="cpu", compute_type="int8")

# Reduce beam size
segments, info = model.transcribe("audio.mp3", beam_size=1)

问题 3:找不到 CUDA 库

现象: RuntimeError: CUDA runtime not found
处理:
# Install CUDA libraries
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

# Set library path
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

问题 4:模型下载失败

现象: 超时或失败。
处理:
# Specify download directory
model = WhisperModel(
    "base",
    download_root="./models",  # Custom directory
    local_files_only=False
)

# Or download manually from Hugging Face
# Then use local path
model = WhisperModel("/path/to/local/model")

选型建议

使用 Faster-whisper 当:

生产部署 重视速度
批处理 多文件
资源受限(用 INT8)
实时或近实时
✅ 有 GPU 加速
✅ 重视 更低内存

使用 OpenAI Whisper 当:

✅ 需要 最大兼容
微调模型(集成更简单)
✅ 偏好 更简单 API
实验功能 先在 OpenAI 侧
学习/开发(文档与示例更多)

总结

Faster-whisper 在保持与 OpenAI Whisper 相同精度的同时显著提升性能。合理配置下,CPU 可获 约 2–4 倍 加速,批处理时 GPU 可达 约 8 倍
要点:
  • CPU 与受限环境用 INT8
  • 显存充足 GPU 用 FP16
  • 多文件启用 批处理
  • 已知语言时 指定语言
  • 多次转写 复用模型实例

需要专业语音转文字方案?访问 SayToWords 了解我们的 AI 转写平台,性能优化并支持多种输出格式。

立即免費試用

現在就體驗我們的 AI 語音與音視頻服務!不僅可以享受高精度語音轉文字、多語言翻譯與智能說話人識別,還能自動生成視頻字幕、智能編輯音視頻內容並進行聲畫同步分析,全面覆蓋會議記錄、短視頻創作、播客製作等場景——立即開始免費試用吧!

在线声音转文字免费声音转文字声音转文字转换器声音转文字 MP3声音转文字 WAV声音转文字(带时间戳)会议声音转文字Sound to Text Multi Language声音转文字字幕转换WAV为文字语音转文字在线语音转文字语音转文字转换MP3为文字语音录音转文字在线语音输入带时间戳的语音转文字实时语音转文字长音频语音转文字视频语音转文字YouTube语音转文字视频编辑语音转文字字幕语音转文字播客语音转文字采访语音转文字访谈音频转文字录音语音转文字会议语音转文字讲座语音转文字语音笔记转文字多语言语音转文字高准确度语音转文字快速语音转文字Premiere Pro 语音转文字替代方案DaVinci 语音转文字替代方案VEED 语音转文字替代方案InVideo 语音转文字替代方案Otter.ai 语音转文字替代方案Descript 语音转文字替代方案Trint 语音转文字替代方案Rev 语音转文字替代方案Sonix 语音转文字替代方案Happy Scribe 语音转文字替代方案Zoom 语音转文字替代方案Google Meet 语音转文字替代方案Microsoft Teams 语音转文字替代方案Fireflies.ai 语音转文字替代方案Fathom 语音转文字替代方案FlexClip 语音转文字替代方案Kapwing 语音转文字替代方案Canva 语音转文字替代方案长音频语音转文字AI语音转文字免费语音转文字无广告语音转文字噪音音频语音转文字带时间戳的语音转文字从音频生成字幕播客转录在线转录客户通话TikTok语音转文字TikTok音频转文字YouTube语音转文字YouTube音频转文字语音备忘录转文字WhatsApp语音消息转文字Telegram语音转文字Discord通话转录Twitch语音转文字Skype语音转文字Messenger语音转文字LINE语音消息转文字Vlog转录转文字讲道音频转文字语音转文字音频转文字语音笔记转文字语音输入会议语音输入YouTube语音输入说话打字免提打字语音转文字语音转文字在线语音转文字Online Transcription Software会议语音转文字快速语音转文字Real Time Speech to TextLive Transcription AppTikTok语音转文字TikTok音频转文字说话转文字语音转文字Talk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for Meetings音频转文字声音转文字语音写作工具语音写作工具语音听写法律转录工具医疗语音听写工具日语音频转录韩语会议转录会议转录工具会议音频转文字讲座转文字转换器讲座音频转文字视频转文字转录TikTok字幕生成器呼叫中心转录Reels音频转文字工具MP3转录为文字WAV文件转录为文字CapCut语音转文字CapCut语音转文字英语语音转文字英语音频转文字西班牙语语音转文字法语语音转文字法语音频转文字德语语音转文字德语音频转文字日语语音转文字日语音频转文字韩语语音转文字韩语音频转文字葡萄牙语语音转文字阿拉伯语语音转文字中文语音转文字印地语语音转文字俄语语音转文字网页语音输入工具语音输入网站