
OpenAI Whisper 教程:语音转文字转录完整指南
Eric King
Author
OpenAI Whisper 教程:语音转文字转录完整指南
OpenAI Whisper 是一款开源的**自动语音识别(ASR)**模型,用于语音转文字转录与语音翻译。它支持多种语言,对口音与背景噪声鲁棒性强,广泛用于播客、会议、访谈与视频字幕。
本教程将系统讲解从安装到进阶用法,帮助你全面上手 Whisper。
什么是 OpenAI Whisper?
Whisper 在 68 万小时多语言音频数据上训练,因此在真实、不完美的音频上表现尤为突出,是目前最准确的开源语音识别模型之一。
主要特性
- 多语言支持 — 99+ 种语言
- 语音转文字 — 将音频转为文本
- 语音翻译 — 将语音直接译为英文
- 语言检测 — 自动识别说话语言
- 时间戳 — 词级与片段级时间戳
- 开源免费 — MIT 许可,无 API 费用
- 可离线 — 在本地机器上运行
- 多格式 — 支持多种音视频格式
Whisper 模型尺寸说明
Whisper 提供多种模型尺寸,用于在速度与准确率之间取舍:
| 模型 | 参数量 | 速度 | 准确率 | 内存 | 适用场景 |
|---|---|---|---|---|---|
| tiny | 39M | ⭐⭐⭐⭐⭐ | ⭐⭐ | ~1 GB | 快速测试、演示 |
| base | 74M | ⭐⭐⭐⭐ | ⭐⭐⭐ | ~1 GB | 简单音频、轻量任务 |
| small | 244M | ⭐⭐⭐ | ⭐⭐⭐⭐ | ~2 GB | 通用、均衡 |
| medium | 769M | ⭐⭐ | ⭐⭐⭐⭐⭐ | ~5 GB | 嘈杂音频、高准确率 |
| large | 1550M | ⭐ | ⭐⭐⭐⭐⭐⭐ | ~10 GB | 最高准确率、生产环境 |
建议:
- 追求速度: 使用
tiny或base - 追求均衡: 使用
small或medium - 追求准确率: 使用
large或large-v3 - 生产环境: 多数场景使用
medium或large-v2
环境要求
在使用 Whisper 前,请确认已具备:
- Python 3.8 及以上(推荐 Python 3.9+)
- pip 包管理器
- 已安装 FFmpeg(用于音视频处理)
- (可选) 配备 CUDA 的 NVIDIA GPU 以加速
- (可选)
base模型建议 4GB+ 内存,large建议 10GB+
第一步:安装
安装 Whisper
使用 pip 安装 OpenAI Whisper:
pip install openai-whisper
或指定版本:
pip install openai-whisper==20231117
安装 FFmpeg
FFmpeg 用于解码音视频文件,为必需依赖。
macOS(Homebrew):
brew install ffmpeg
Ubuntu / Debian:
sudo apt update
sudo apt install ffmpeg
Windows:
- 从 ffmpeg.org 下载 FFmpeg
- 解压并将其加入系统 PATH
- 或使用:
choco install ffmpeg(Chocolatey)
验证安装:
ffmpeg -version
whisper --version
第二步:基础用法 — Python
简单转录
最简单的音频转录方式如下:
import whisper
# Load model (downloads automatically on first use)
model = whisper.load_model("base")
# Transcribe audio file
result = model.transcribe("audio.mp3")
# Print transcription
print(result["text"])
输出:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.
带错误处理的完整示例
import whisper
import os
def transcribe_audio(audio_path, model_size="base"):
"""
Transcribe an audio file using Whisper.
Args:
audio_path (str): Path to the audio file
model_size (str): Whisper model size (tiny, base, small, medium, large)
Returns:
dict: Transcription result with text and segments
"""
try:
# Check if audio file exists
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Load the Whisper model
print(f"Loading Whisper model: {model_size}")
model = whisper.load_model(model_size)
# Transcribe the audio
print(f"Transcribing: {audio_path}")
result = model.transcribe(audio_path)
print(f"✓ Transcription complete!")
print(f" Language: {result['language']}")
print(f" Duration: {result['segments'][-1]['end']:.2f}s")
return result
except Exception as e:
print(f"Error during transcription: {str(e)}")
return None
# Example usage
if __name__ == "__main__":
audio_file = "meeting.mp3"
result = transcribe_audio(audio_file, model_size="base")
if result:
print("\n" + "="*50)
print("TRANSCRIPTION:")
print("="*50)
print(result["text"])
第三步:语言检测与指定
自动检测语言
Whisper 会自动检测语言:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Language probability: {result.get('language_probability', 0):.2%}")
print(f"\nTranscription:\n{result['text']}")
指定语言(更快、更准)
在已知语言时显式指定可提升速度与准确率:
import whisper
model = whisper.load_model("base")
# Specify language
result_en = model.transcribe("audio.mp3", language="en") # English
result_zh = model.transcribe("audio.mp3", language="zh") # Chinese
result_es = model.transcribe("audio.mp3", language="es") # Spanish
result_fr = model.transcribe("audio.mp3", language="fr") # French
result_de = model.transcribe("audio.mp3", language="de") # German
result_ja = model.transcribe("audio.mp3", language="ja") # Japanese
print(result_en["text"])
支持的语言:
Whisper 支持 99+ 种语言。常见语言代码:
en- Englishzh- Chinesees- Spanishfr- Frenchde- Germanja- Japaneseko- Koreanpt- Portugueseru- Russianit- Italian
第四步:时间戳与分段
访问带时间戳的分段
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
# Print full transcription
print("Full Text:")
print(result["text"])
# Print segments with timestamps
print("\n" + "="*50)
print("Segments with Timestamps:")
print("="*50)
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"].strip()
print(f"[{start:6.2f}s - {end:6.2f}s] {text}")
输出:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.
==================================================
Segments with Timestamps:
==================================================
[ 0.00s - 5.20s] Hello everyone, welcome to today's meeting.
[ 5.20s - 12.50s] We will discuss the project timeline.
将时间戳格式化为时码
def format_timestamp(seconds):
"""Format seconds to HH:MM:SS."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
for segment in result["segments"]:
start_time = format_timestamp(segment["start"])
end_time = format_timestamp(segment["end"])
print(f"[{start_time} - {end_time}] {segment['text']}")
词级时间戳
启用词级时间戳以获得更精细的时间对齐:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"audio.mp3",
word_timestamps=True # Enable word-level timestamps
)
for segment in result["segments"]:
print(f"\n[{segment['start']:.2f}s - {segment['end']:.2f}s]")
print(f"Text: {segment['text']}")
# Word-level timestamps
if "words" in segment:
print("Words:")
for word in segment["words"]:
print(f" {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
第五步:语音翻译
Whisper 可将非英语语音直接译为英语:
import whisper
model = whisper.load_model("base")
# Translate to English (regardless of source language)
result = model.transcribe("spanish_audio.mp3", task="translate")
print("Translated to English:")
print(result["text"])
# Original transcription (in original language)
result_original = model.transcribe("spanish_audio.mp3", task="transcribe")
print("\nOriginal language transcription:")
print(result_original["text"])
典型用途:
- 国际会议
- 多语言内容处理
- 内容本地化
- 语言学习材料
第六步:进阶参数
温度与束宽(beam size)
用于在转录质量与速度之间权衡:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"audio.mp3",
temperature=0.0, # Lower = more deterministic (0.0 recommended)
beam_size=5, # Higher = more accurate but slower (default: 5)
best_of=5, # Number of candidates to consider
patience=1.0, # Beam search patience
condition_on_previous_text=True, # Use context from previous segments
initial_prompt="This is a technical meeting about AI and machine learning." # Context prompt
)
温度取值说明
temperature=0.0— 最确定,推荐temperature=0.2-0.4— 略有随机性temperature=1.0— 更“发散”,准确率通常下降
用 initial_prompt 提供上下文
提供语境有助于提升准确率:
result = model.transcribe(
"technical_meeting.mp3",
initial_prompt="This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
)
result = model.transcribe(
"medical_audio.mp3",
initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)
第七步:命令行(CLI)
Whisper 提供功能完善的命令行接口:
基础 CLI
whisper audio.mp3
指定模型
whisper audio.mp3 --model small
whisper audio.mp3 --model medium
whisper audio.mp3 --model large-v2
指定语言
whisper audio.mp3 --language en
whisper audio.mp3 --language zh
输出格式
# SRT subtitles
whisper audio.mp3 --output_format srt
# VTT subtitles
whisper audio.mp3 --output_format vtt
# Text file
whisper audio.mp3 --output_format txt
# JSON (with all metadata)
whisper audio.mp3 --output_format json
# TSV (tab-separated values)
whisper audio.mp3 --output_format tsv
进阶 CLI 选项
# Full example with all options
whisper audio.mp3 \
--model medium \
--language en \
--task transcribe \
--output_format srt \
--output_dir ./transcripts \
--verbose True \
--temperature 0.0 \
--beam_size 5 \
--best_of 5 \
--fp16 True
CLI 参数速查
| 选项 | 说明 | 默认值 |
|---|---|---|
--model | 模型尺寸(tiny、base、small、medium、large) | base |
--language | 语言代码(en、zh、es 等) | 自动检测 |
--task | transcribe 或 translate | transcribe |
--output_format | 输出格式(txt、srt、vtt、json、tsv) | txt |
--output_dir | 输出目录 | 当前目录 |
--temperature | 采样温度 | 0.0 |
--beam_size | 束搜索宽度 | 5 |
--best_of | 候选数量 | 5 |
--fp16 | 使用 FP16 精度(GPU) | True |
--verbose | 详细日志 | False |
第八步:支持的音视频格式
通过 FFmpeg,Whisper 支持大多数常见格式:
支持的格式
- 音频: MP3、WAV、M4A、FLAC、OGG、AAC、WMA
- 视频: MP4、AVI、MKV、MOV、WebM、FLV
- 流式: 可处理音频流
格式示例
import whisper
model = whisper.load_model("base")
# Audio formats
model.transcribe("audio.mp3")
model.transcribe("audio.wav")
model.transcribe("audio.m4a")
model.transcribe("audio.flac")
# Video formats (extracts audio automatically)
model.transcribe("video.mp4")
model.transcribe("video.mkv")
model.transcribe("video.webm")
第九步:完整生产示例
下面是一个可直接用于生产的完整示例:
import whisper
import json
from pathlib import Path
from datetime import datetime
class WhisperTranscriber:
"""Production-ready Whisper transcription service."""
def __init__(self, model_size="base"):
"""Initialize transcriber with specified model."""
print(f"Loading Whisper model: {model_size}")
self.model = whisper.load_model(model_size)
print("✓ Model loaded successfully")
def transcribe_file(self, audio_path, output_dir="transcripts", **kwargs):
"""
Transcribe audio file and save results.
Args:
audio_path: Path to audio file
output_dir: Directory to save outputs
**kwargs: Additional transcribe parameters
"""
audio_path = Path(audio_path)
if not audio_path.exists():
raise FileNotFoundError(f"Audio file not found: {audio_path}")
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
print(f"\nTranscribing: {audio_path.name}")
# Transcribe
result = self.model.transcribe(
str(audio_path),
word_timestamps=True,
**kwargs
)
# Prepare output data
output_data = {
"file": str(audio_path),
"transcribed_at": datetime.now().isoformat(),
"language": result["language"],
"language_probability": result.get("language_probability", 0),
"duration": result["segments"][-1]["end"] if result["segments"] else 0,
"text": result["text"],
"segments": result["segments"]
}
# Save outputs
base_name = audio_path.stem
# Save as text
text_file = output_path / f"{base_name}.txt"
with open(text_file, "w", encoding="utf-8") as f:
f.write(result["text"])
# Save as JSON
json_file = output_path / f"{base_name}.json"
with open(json_file, "w", encoding="utf-8") as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
# Save as SRT
srt_file = output_path / f"{base_name}.srt"
self._save_srt(result["segments"], srt_file)
print(f"✓ Transcription saved:")
print(f" - Text: {text_file}")
print(f" - JSON: {json_file}")
print(f" - SRT: {srt_file}")
return output_data
def _save_srt(self, segments, output_path):
"""Save segments as SRT subtitle file."""
with open(output_path, "w", encoding="utf-8") as f:
for i, segment in enumerate(segments, start=1):
start = self._format_srt_time(segment["start"])
end = self._format_srt_time(segment["end"])
text = segment["text"].strip()
f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
def _format_srt_time(self, seconds):
"""Format seconds to SRT timestamp."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
# Usage
if __name__ == "__main__":
transcriber = WhisperTranscriber(model_size="base")
result = transcriber.transcribe_file(
"meeting.mp3",
output_dir="transcripts",
language="en",
temperature=0.0
)
print(f"\nLanguage: {result['language']}")
print(f"Duration: {result['duration']:.2f}s")
print(f"\nTranscription preview:")
print(result['text'][:200] + "...")
第十步:最佳实践
1. 选择合适的模型
# For speed (testing, demos)
model = whisper.load_model("tiny")
# For balance (general use)
model = whisper.load_model("base") # or "small"
# For accuracy (production)
model = whisper.load_model("medium") # or "large-v2"
2. 已知语言时尽量指定
# Faster and more accurate
result = model.transcribe("audio.mp3", language="en")
# Instead of auto-detection
result = model.transcribe("audio.mp3") # Slower
3. 使用合适的温度
# Recommended for most cases
result = model.transcribe("audio.mp3", temperature=0.0)
# For creative content (not recommended for transcription)
result = model.transcribe("audio.mp3", temperature=0.2)
4. 用 initial_prompt 提供上下文
# Technical content
result = model.transcribe(
"meeting.mp3",
initial_prompt="This meeting discusses software architecture, APIs, and deployment strategies."
)
# Medical content
result = model.transcribe(
"consultation.mp3",
initial_prompt="This is a medical consultation about patient symptoms and treatment."
)
5. 复用模型实例
# Load once, reuse multiple times
model = whisper.load_model("base")
# Process multiple files
for audio_file in ["file1.mp3", "file2.mp3", "file3.mp3"]:
result = model.transcribe(audio_file)
# Process result...
6. 处理超长音频
对极长音频,可考虑分块处理:
import whisper
from pydub import AudioSegment
def transcribe_long_audio(audio_path, chunk_length_ms=600000): # 10 minutes
"""Transcribe long audio by splitting into chunks."""
model = whisper.load_model("base")
# Load audio
audio = AudioSegment.from_file(audio_path)
duration_ms = len(audio)
all_text = []
all_segments = []
# Process in chunks
for i in range(0, duration_ms, chunk_length_ms):
chunk = audio[i:i + chunk_length_ms]
chunk_path = f"chunk_{i}.wav"
chunk.export(chunk_path, format="wav")
result = model.transcribe(chunk_path)
all_text.append(result["text"])
all_segments.extend(result["segments"])
# Clean up chunk file
os.remove(chunk_path)
return {
"text": " ".join(all_text),
"segments": all_segments
}
常见问题与解决
问题 1:找不到 FFmpeg
错误:
FileNotFoundError: ffmpeg解决:
# Install FFmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# Verify
ffmpeg -version
问题 2:内存不足
错误:
RuntimeError: CUDA out of memory 或系统内存耗尽处理:
# Use smaller model
model = whisper.load_model("base") # Instead of "large"
# Or use CPU
import torch
model = whisper.load_model("base", device="cpu")
# Or process in chunks (see above)
问题 3:转录很慢
现象: 转录速度非常慢
处理:
# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)
# Use smaller model
model = whisper.load_model("tiny") # or "base"
# Reduce beam size (faster but slightly less accurate)
result = model.transcribe("audio.mp3", beam_size=1)
问题 4:准确率低
现象: 转录错误较多
处理:
# Use larger model
model = whisper.load_model("medium") # or "large"
# Specify language
result = model.transcribe("audio.mp3", language="en")
# Provide context
result = model.transcribe(
"audio.mp3",
initial_prompt="Context about the audio content..."
)
# Use optimal settings
result = model.transcribe(
"audio.mp3",
temperature=0.0,
beam_size=5,
best_of=5
)
应用场景
1. 播客转录
model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3", language="en")
# Save transcript
with open("podcast_transcript.txt", "w") as f:
f.write(result["text"])
2. 生成 YouTube 字幕
model = whisper.load_model("base")
result = model.transcribe("video.mp4", language="en")
# Generate SRT
# (Use CLI: whisper video.mp4 --output_format srt)
3. 会议记录
model = whisper.load_model("base")
result = model.transcribe(
"meeting.mp3",
language="en",
initial_prompt="This is a business meeting discussing project updates and deadlines."
)
# Save with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.0f}s] {segment['text']}")
4. 访谈转录
model = whisper.load_model("medium")
result = model.transcribe("interview.mp3", language="en")
# Export for editing
with open("interview.txt", "w") as f:
for segment in result["segments"]:
f.write(f"[{segment['start']:.2f}s] {segment['text']}\n")
5. 多语言内容翻译
model = whisper.load_model("base")
# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"]) # English translation
Whisper 与其他方案对比
| 特性 | Whisper | 云端 API | Faster-Whisper |
|---|---|---|---|
| 成本 | 免费 | 按分钟计费 | 免费 |
| 离线 | ✅ | ❌ | ✅ |
| 速度 | 中等 | 快 | 快(约 2–4 倍) |
| 准确率 | 高 | 高 | 高(相当) |
| 部署 | 简单 | 非常简单 | 简单 |
| 实时 | ❌ | ✅ | ❌ |
| 隐私 | ✅ 本地 | ❌ 云端 | ✅ 本地 |
适合选 Whisper 的情况:
- 需要免费、离线转录
- 隐私要求高
- 需要自主掌控基础设施
- 处理批量文件或归档内容
适合选云端 API 的情况:
- 需要实时转录
- 希望托管式基础设施
- 有预算支付 API 费用
- 需要企业级支持
延伸阅读
掌握基础后,可继续阅读:
- Whisper Python 示例 — 更详细的 Python 示例
- Faster-Whisper 指南 — 约 2–4 倍速度的转录
- Whisper 准确率技巧 — 提升转录质量
- Whisper 转录稿格式化 — 格式化输出(SRT、VTT、JSON)
- 会议场景下的 Whisper — 面向会议的转录
总结
OpenAI Whisper 是当今最强大的开源语音转文字模型之一。凭借出色的多语言支持、较高的转录准确率与完整的离线能力,非常适合希望完全掌控转录流程的开发者和内容创作者。
要点回顾:
- Whisper 支持 99+ 种语言,准确率表现良好
- 按需求选择模型尺寸
- 已知语言时显式指定可提升表现
- 使用词级时间戳获得精细时间对齐
- 多文件处理时复用模型实例
- 生产部署可考虑 faster-whisper
无论是播客转录、生成字幕,还是处理会议录音,Whisper 都能提供稳健、免费且注重隐私的语音转文字方案。
需要专业的语音转文字方案?访问 SayToWords,了解我们的 AI 转录平台——优化性能并支持多种输出格式。