Whisper Python 示例:语音转文字完整指南

Whisper Python 示例:语音转文字完整指南

Eric King

Eric King

Author


Whisper Python 示例:语音转文字完整指南

OpenAI Whisper 是目前最强大的开源语音识别模型之一。在本指南中,你将学习如何使用 Whisper 与 Python,将音频文件高精度地转写成文本。
本教程适合:
  • 正在开发语音转文字功能的开发者
  • 处理音频数据的数据科学从业者
  • 需要完整 Whisper Python 示例 的读者

什么是 OpenAI Whisper?

Whisper 是在 68 万小时多语言音频上训练的自动语音识别(ASR)系统。它可以:
  • 支持 99+ 种语言的语音转写
  • 自动检测语言
  • 将语音翻译为英语
  • 处理嘈杂音频与口音
  • 处理长音频文件

前置条件

开始之前,请确保已具备:
  • 已安装 Python 3.8+
  • 包管理工具 pip
  • 已安装 FFmpeg(用于音频处理)
  • (可选)用于加速的 NVIDIA GPU

第 1 步:安装 Whisper

使用 pip 安装 OpenAI Whisper 包:
pip install openai-whisper

安装 FFmpeg

macOS(使用 Homebrew):
brew install ffmpeg
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
Windows: 请从 ffmpeg.org 下载 FFmpeg,并添加到 PATH。

第 2 步:基础 Whisper Python 示例

下面是一个用于转写音频文件的简单 Python 脚本:
import whisper

# Load the Whisper model
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print the transcription
print(result["text"])
输出:
Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

第 3 步:带错误处理的完整 Python 示例

这是一个更稳健、包含完善错误处理的示例:
import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "sample_audio.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\nTranscription:")
        print(result["text"])

第 4 步:语言检测进阶示例

Whisper 可以自动检测语言,你也可以手动指定:
import whisper

model = whisper.load_model("base")

# Auto-detect language
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Transcription: {result['text']}")

# Specify language explicitly
result_en = model.transcribe("audio.mp3", language="en")
result_zh = model.transcribe("audio.mp3", language="zh")

第 5 步:获取时间戳与分段信息

Whisper 提供带时间戳的详细分段信息:
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\nSegments with Timestamps:")
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s - {end:.2f}s] {text}")
输出:
Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

Segments with Timestamps:
[0.00s - 2.50s] Hello everyone, welcome to today's meeting.
[2.50s - 5.80s] We will discuss the project timeline.

第 6 步:将音频翻译为英语

Whisper 可以直接将非英语语音翻译为英语:
import whisper

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated text:")
print(result["text"])

第 7 步:批量处理多个音频文件

以下介绍如何批量转写多个文件:
import whisper
import os
from pathlib import Path

def batch_transcribe(audio_directory, model_size="base", output_dir="transcriptions"):
    """
    Transcribe all audio files in a directory.
    
    Args:
        audio_directory (str): Directory containing audio files
        model_size (str): Whisper model size
        output_dir (str): Directory to save transcriptions
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model once
    model = whisper.load_model(model_size)
    
    # Supported audio formats
    audio_extensions = ['.mp3', '.wav', '.m4a', '.flac', '.ogg']
    
    # Process each audio file
    audio_files = [
        f for f in os.listdir(audio_directory)
        if any(f.lower().endswith(ext) for ext in audio_extensions)
    ]
    
    for audio_file in audio_files:
        audio_path = os.path.join(audio_directory, audio_file)
        print(f"\nProcessing: {audio_file}")
        
        try:
            result = model.transcribe(audio_path)
            
            # Save transcription to file
            output_file = os.path.join(
                output_dir,
                Path(audio_file).stem + ".txt"
            )
            
            with open(output_file, "w", encoding="utf-8") as f:
                f.write(result["text"])
            
            print(f"✓ Saved: {output_file}")
            
        except Exception as e:
            print(f"✗ Error processing {audio_file}: {str(e)}")

# Example usage
batch_transcribe("audio_files/", model_size="base")

第 8 步:导出为 SRT 字幕格式

根据转写结果创建 SRT 字幕文件:
import whisper

def transcribe_to_srt(audio_path, output_path, model_size="base"):
    """
    Transcribe audio and save as SRT subtitle file.
    
    Args:
        audio_path (str): Path to audio file
        output_path (str): Path to save SRT file
        model_size (str): Whisper model size
    """
    model = whisper.load_model(model_size)
    result = model.transcribe(audio_path)
    
    # Generate SRT content
    srt_content = ""
    for i, segment in enumerate(result["segments"], start=1):
        start_time = format_timestamp(segment["start"])
        end_time = format_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_content += f"{i}\n"
        srt_content += f"{start_time} --> {end_time}\n"
        srt_content += f"{text}\n\n"
    
    # Save SRT file
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(srt_content)
    
    print(f"SRT file saved: {output_path}")

def format_timestamp(seconds):
    """Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Example usage
transcribe_to_srt("video.mp4", "subtitles.srt", model_size="base")

Whisper 模型尺寸对比

根据需求选择合适的模型尺寸:
模型参数量速度准确度内存适用场景
tiny39M⭐⭐⭐⭐⭐⭐⭐~1GB快速测试、简单音频
base74M⭐⭐⭐⭐⭐⭐⭐~1GB通用
small244M⭐⭐⭐⭐⭐⭐⭐~2GB平衡
medium769M⭐⭐⭐⭐⭐⭐⭐~5GB需要高准确度
large1550M⭐⭐⭐⭐⭐⭐~10GB最佳准确度、嘈杂环境

Whisper Python 最佳实践

1. 选择合适的模型尺寸

# Fast and lightweight
model = whisper.load_model("tiny")  # Good for testing

# Balanced
model = whisper.load_model("base")  # Good for most cases

# High accuracy
model = whisper.load_model("medium")  # For important transcriptions

2. 处理长音频

对非常长的音频,可考虑分块处理:
import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=60000):
    """
    Transcribe long audio by splitting into chunks.
    
    Args:
        audio_path: Path to audio file
        chunk_length_ms: Length of each chunk in milliseconds
    """
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(audio), chunk_length_ms):
        chunks.append(audio[i:i + chunk_length_ms])
    
    # Transcribe each chunk
    full_text = []
    for i, chunk in enumerate(chunks):
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        full_text.append(result["text"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return " ".join(full_text)

3. 使用 GPU 加速

如果你拥有 NVIDIA GPU:
import whisper

# Whisper will automatically use GPU if available
model = whisper.load_model("base", device="cuda")

4. 指定语言以提高准确度

# If you know the language, specify it
result = model.transcribe("audio.mp3", language="en")

常见使用场景

播客转写

import whisper

model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

会议记录

import whisper
from datetime import datetime

model = whisper.load_model("base")
result = model.transcribe("meeting_recording.mp3")

# Create formatted meeting notes
notes = f"""
Meeting Notes - {datetime.now().strftime('%Y-%m-%d')}
========================================

{result["text"]}
"""

with open("meeting_notes.txt", "w") as f:
    f.write(notes)

视频字幕

import whisper

model = whisper.load_model("base")
result = model.transcribe("video.mp4")

# Generate VTT subtitle file
vtt_content = "WEBVTT\n\n"
for segment in result["segments"]:
    start = format_vtt_timestamp(segment["start"])
    end = format_vtt_timestamp(segment["end"])
    text = segment["text"].strip()
    vtt_content += f"{start} --> {end}\n{text}\n\n"

with open("subtitles.vtt", "w") as f:
    f.write(vtt_content)

常见问题排查

问题 1:找不到 FFmpeg

错误: FileNotFoundError: ffmpeg
解决方案:
# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows
# Download from ffmpeg.org and add to PATH

问题 2:显存不足

错误: RuntimeError: CUDA out of memory
解决方案:
# Use a smaller model
model = whisper.load_model("tiny")  # Instead of "large"

# Or use CPU
model = whisper.load_model("base", device="cpu")

问题 3:处理速度慢

解决方法:
  • 使用更小的模型(tiny 或 base)
  • 启用 GPU 加速
  • 分块处理音频
  • 批量任务使用多进程

性能建议

  1. 尽量使用 GPU — 可比 CPU 快 10–50 倍
  2. 选择合适模型 — 简单任务不必使用「large」
  3. 预处理音频 — 去静音、音量归一化
  4. 批量处理 — 模型只加载一次,处理多文件
  5. 使用线程 — 适合 I/O 密集型操作

Whisper Python 与其他方案对比

功能Whisper PythonGoogle Speech-to-TextAssemblyAI
成本免费(本地)按分钟计费按分钟计费
离线
准确度
部署难度中等简单简单
长音频
多语言

完整示例:可用于生产的脚本

下面是一个完整、可用于生产环境的示例:
#!/usr/bin/env python3
"""
Production-ready Whisper transcription script.
"""

import whisper
import argparse
import os
import json
from pathlib import Path
from datetime import datetime

def transcribe_file(
    audio_path,
    model_size="base",
    language=None,
    output_format="txt",
    output_dir=None
):
    """
    Transcribe an audio file with comprehensive output options.
    
    Args:
        audio_path: Path to audio file
        model_size: Whisper model size
        language: Language code (optional, auto-detected if None)
        output_format: Output format (txt, json, srt, vtt)
        output_dir: Output directory (default: same as audio file)
    """
    # Validate input file
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    
    # Set output directory
    if output_dir is None:
        output_dir = os.path.dirname(audio_path)
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model
    print(f"Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)
    
    # Transcribe
    print(f"Transcribing: {audio_path}")
    transcribe_kwargs = {}
    if language:
        transcribe_kwargs["language"] = language
    
    result = model.transcribe(audio_path, **transcribe_kwargs)
    
    # Generate output filename
    base_name = Path(audio_path).stem
    output_path = os.path.join(output_dir, base_name)
    
    # Save based on format
    if output_format == "txt":
        with open(f"{output_path}.txt", "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    elif output_format == "json":
        with open(f"{output_path}.json", "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    elif output_format == "srt":
        srt_content = generate_srt(result["segments"])
        with open(f"{output_path}.srt", "w", encoding="utf-8") as f:
            f.write(srt_content)
    
    elif output_format == "vtt":
        vtt_content = generate_vtt(result["segments"])
        with open(f"{output_path}.vtt", "w", encoding="utf-8") as f:
            f.write(vtt_content)
    
    print(f"✓ Transcription saved: {output_path}.{output_format}")
    print(f"  Language: {result['language']}")
    print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
    
    return result

def generate_srt(segments):
    """Generate SRT subtitle content."""
    srt = ""
    for i, segment in enumerate(segments, start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        srt += f"{i}\n{start} --> {end}\n{text}\n\n"
    return srt

def generate_vtt(segments):
    """Generate VTT subtitle content."""
    vtt = "WEBVTT\n\n"
    for segment in segments:
        start = format_vtt_timestamp(segment["start"])
        end = format_vtt_timestamp(segment["end"])
        text = segment["text"].strip()
        vtt += f"{start} --> {end}\n{text}\n\n"
    return vtt

def format_timestamp(seconds):
    """Format timestamp for SRT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

def format_vtt_timestamp(seconds):
    """Format timestamp for VTT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

def main():
    parser = argparse.ArgumentParser(
        description="Transcribe audio files using OpenAI Whisper"
    )
    parser.add_argument("audio", help="Path to audio file")
    parser.add_argument(
        "--model",
        default="base",
        choices=["tiny", "base", "small", "medium", "large"],
        help="Whisper model size"
    )
    parser.add_argument(
        "--language",
        default=None,
        help="Language code (e.g., 'en', 'zh', 'es')"
    )
    parser.add_argument(
        "--output-format",
        default="txt",
        choices=["txt", "json", "srt", "vtt"],
        help="Output format"
    )
    parser.add_argument(
        "--output-dir",
        default=None,
        help="Output directory"
    )
    
    args = parser.parse_args()
    
    transcribe_file(
        args.audio,
        model_size=args.model,
        language=args.language,
        output_format=args.output_format,
        output_dir=args.output_dir
    )

if __name__ == "__main__":
    main()
用法:
# Basic usage
python transcribe.py audio.mp3

# With options
python transcribe.py audio.mp3 --model medium --language en --output-format srt

# Save to specific directory
python transcribe.py audio.mp3 --output-dir ./transcriptions

总结

本 Whisper Python 示例指南涵盖使用 OpenAI Whisper 进行语音转文字入门所需的全部内容。无论是播客、会议还是字幕制作,Whisper 都提供了强大且免费的音频转文本方案。
要点:
  • Whisper 免费且开源
  • 支持 99+ 种语言
  • 可离线运行(无需调用 API)
  • 在大多数场景下准确度很高
  • 易于集成到 Python 项目
若在生产环境中需要实时转写或 API 访问,可考虑 SayToWords 等云端方案,其通过 API 提供基于 Whisper 的转写服务。

准备开始了吗? 安装 Whisper,今天就转写你的第一个音频文件。

立即免費試用

現在就體驗我們的 AI 語音與音視頻服務!不僅可以享受高精度語音轉文字、多語言翻譯與智能說話人識別,還能自動生成視頻字幕、智能編輯音視頻內容並進行聲畫同步分析,全面覆蓋會議記錄、短視頻創作、播客製作等場景——立即開始免費試用吧!

在线声音转文字免费声音转文字声音转文字转换器声音转文字 MP3声音转文字 WAV声音转文字(带时间戳)会议声音转文字Sound to Text Multi Language声音转文字字幕转换WAV为文字语音转文字在线语音转文字语音转文字转换MP3为文字语音录音转文字在线语音输入带时间戳的语音转文字实时语音转文字长音频语音转文字视频语音转文字YouTube语音转文字视频编辑语音转文字字幕语音转文字播客语音转文字采访语音转文字访谈音频转文字录音语音转文字会议语音转文字讲座语音转文字语音笔记转文字多语言语音转文字高准确度语音转文字快速语音转文字Premiere Pro 语音转文字替代方案DaVinci 语音转文字替代方案VEED 语音转文字替代方案InVideo 语音转文字替代方案Otter.ai 语音转文字替代方案Descript 语音转文字替代方案Trint 语音转文字替代方案Rev 语音转文字替代方案Sonix 语音转文字替代方案Happy Scribe 语音转文字替代方案Zoom 语音转文字替代方案Google Meet 语音转文字替代方案Microsoft Teams 语音转文字替代方案Fireflies.ai 语音转文字替代方案Fathom 语音转文字替代方案FlexClip 语音转文字替代方案Kapwing 语音转文字替代方案Canva 语音转文字替代方案长音频语音转文字AI语音转文字免费语音转文字无广告语音转文字噪音音频语音转文字带时间戳的语音转文字从音频生成字幕播客转录在线转录客户通话TikTok语音转文字TikTok音频转文字YouTube语音转文字YouTube音频转文字语音备忘录转文字WhatsApp语音消息转文字Telegram语音转文字Discord通话转录Twitch语音转文字Skype语音转文字Messenger语音转文字LINE语音消息转文字Vlog转录转文字讲道音频转文字语音转文字音频转文字语音笔记转文字语音输入会议语音输入YouTube语音输入说话打字免提打字语音转文字语音转文字在线语音转文字Online Transcription Software会议语音转文字快速语音转文字Real Time Speech to TextLive Transcription AppTikTok语音转文字TikTok音频转文字说话转文字语音转文字Talk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for Meetings音频转文字声音转文字语音写作工具语音写作工具语音听写法律转录工具医疗语音听写工具日语音频转录韩语会议转录会议转录工具会议音频转文字讲座转文字转换器讲座音频转文字视频转文字转录TikTok字幕生成器呼叫中心转录Reels音频转文字工具MP3转录为文字WAV文件转录为文字CapCut语音转文字CapCut语音转文字英语语音转文字英语音频转文字西班牙语语音转文字法语语音转文字法语音频转文字德语语音转文字德语音频转文字日语语音转文字日语音频转文字韩语语音转文字韩语音频转文字葡萄牙语语音转文字阿拉伯语语音转文字中文语音转文字印地语语音转文字俄语语音转文字网页语音输入工具语音输入网站