OpenAI Whisper チュートリアル：音声文字起こしの完全ガイド

OpenAI Whisper は、音声の文字起こしと音声翻訳のためのオープンソースの**自動音声認識（ASR）**モデルです。多言語に対応し、アクセントや背景ノイズにも強く、ポッドキャスト、会議、インタビュー、動画字幕などで広く使われています。

この包括的なチュートリアルでは、インストールから高度な使い方まで、Whisper を始めるために必要なことを順に説明します。

OpenAI Whisper とは？

Whisper は68 万時間以上の多言語音声データで学習されており、現実の不完全な音声に特に強いのが特徴です。利用可能なオープンソースの音声認識モデルの中でも、最も精度が高い部類に入ります。

主な機能

多言語対応 — 99 言語以上
音声テキスト化 — 音声をテキストに変換
音声翻訳 — 音声を直接英語に翻訳
言語検出 — 話されている言語を自動検出
タイムスタンプ — 単語単位・セグメント単位
オープンソースで無料 — MIT ライセンス、API 料金なし
オフライン対応 — ローカルマシンで実行可能
多様なフォーマット — 各種音声・動画形式に対応

Whisper のモデルサイズ

Whisper は速度と精度のバランスを取るため、複数のモデルサイズを提供します。

モデル	パラメータ数	速度	精度	メモリ	用途
tiny	39M	⭐⭐⭐⭐⭐	⭐⭐	約 1 GB	高速テスト、デモ
base	74M	⭐⭐⭐⭐	⭐⭐⭐	約 1 GB	シンプルな音声、短時間タスク
small	244M	⭐⭐⭐	⭐⭐⭐⭐	約 2 GB	汎用、バランス重視
medium	769M	⭐⭐	⭐⭐⭐⭐⭐	約 5 GB	ノイズの多い音声、高精度
large	1550M	⭐	⭐⭐⭐⭐⭐⭐	約 10 GB	最高精度、本番向け

おすすめ:

速度重視: tiny または base
バランス: small または medium
精度重視: large または large-v3
本番環境: 多くの場合 medium または large-v2

前提条件

Whisper を使う前に、次を用意してください。

Python 3.8 以降（3.9 以上を推奨）
パッケージマネージャー pip
FFmpeg のインストール（音声・動画処理用）
（任意）CUDA 対応の NVIDIA GPU（高速化用）
（任意）base モデルで 4GB 以上の RAM、large では 10GB 以上

ステップ 1：インストール

Whisper のインストール

pip で OpenAI Whisper パッケージをインストールします。

pip install openai-whisper

特定のバージョンを指定する場合:

pip install openai-whisper==20231117

FFmpeg のインストール

音声・動画ファイルのデコードに FFmpeg が必要です。

macOS（Homebrew）:

brew install ffmpeg

Ubuntu / Debian:

sudo apt update
sudo apt install ffmpeg

Windows:

ffmpeg.org から FFmpeg をダウンロード
展開し、システムの PATH に追加
または Chocolatey で: choco install ffmpeg

インストールの確認:

ffmpeg -version
whisper --version

ステップ 2：基本的な使い方 — Python

シンプルな文字起こし

最もシンプルな音声の文字起こし例です。

import whisper

# Load model (downloads automatically on first use)
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print transcription
print(result["text"])

出力:

Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

エラー処理付きの完全な例

import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        print(f"✓ Transcription complete!")
        print(f"  Language: {result['language']}")
        print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "meeting.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\n" + "="*50)
        print("TRANSCRIPTION:")
        print("="*50)
        print(result["text"])

ステップ 3：言語の検出と指定

言語の自動検出

Whisper は言語を自動で検出します。

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

print(f"Detected language: {result['language']}")
print(f"Language probability: {result.get('language_probability', 0):.2%}")
print(f"\nTranscription:\n{result['text']}")

言語を指定する（高速かつ高精度）

言語が分かっている場合は指定すると、速度と精度が向上します。

import whisper

model = whisper.load_model("base")

# Specify language
result_en = model.transcribe("audio.mp3", language="en")  # English
result_zh = model.transcribe("audio.mp3", language="zh")   # Chinese
result_es = model.transcribe("audio.mp3", language="es")  # Spanish
result_fr = model.transcribe("audio.mp3", language="fr")  # French
result_de = model.transcribe("audio.mp3", language="de")  # German
result_ja = model.transcribe("audio.mp3", language="ja")   # Japanese

print(result_en["text"])

対応言語: Whisper は 99 言語以上に対応しています。よく使う言語コード:

en — 英語
zh — 中国語
es — スペイン語
fr — フランス語
de — ドイツ語
ja — 日本語
ko — 韓国語
pt — ポルトガル語
ru — ロシア語
it — イタリア語

ステップ 4：タイムスタンプとセグメント

タイムスタンプ付きセグメントへのアクセス

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\n" + "="*50)
print("Segments with Timestamps:")
print("="*50)

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:6.2f}s - {end:6.2f}s] {text}")

出力:

Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

==================================================
Segments with Timestamps:
==================================================
[  0.00s -   5.20s] Hello everyone, welcome to today's meeting.
[  5.20s -  12.50s] We will discuss the project timeline.

タイムスタンプをタイムコード形式に

def format_timestamp(seconds):
    """Format seconds to HH:MM:SS."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"

for segment in result["segments"]:
    start_time = format_timestamp(segment["start"])
    end_time = format_timestamp(segment["end"])
    print(f"[{start_time} - {end_time}] {segment['text']}")

単語レベルのタイムスタンプ

精密なタイミングのために単語レベルのタイムスタンプを有効にします。

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    word_timestamps=True  # Enable word-level timestamps
)

for segment in result["segments"]:
    print(f"\n[{segment['start']:.2f}s - {segment['end']:.2f}s]")
    print(f"Text: {segment['text']}")
    
    # Word-level timestamps
    if "words" in segment:
        print("Words:")
        for word in segment["words"]:
            print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

ステップ 5：音声翻訳

Whisper は英語以外の音声を直接英語に翻訳できます。

import whisper

model = whisper.load_model("base")

# Translate to English (regardless of source language)
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated to English:")
print(result["text"])

# Original transcription (in original language)
result_original = model.transcribe("spanish_audio.mp3", task="transcribe")
print("\nOriginal language transcription:")
print(result_original["text"])

用途の例:

国際会議
多言語コンテンツの処理
コンテンツのローカライズ
語学学習教材

ステップ 6：高度なパラメータ

temperature とビームサイズ

文字起こしの品質と速度を調整します。

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=0.0,        # Lower = more deterministic (0.0 recommended)
    beam_size=5,            # Higher = more accurate but slower (default: 5)
    best_of=5,              # Number of candidates to consider
    patience=1.0,           # Beam search patience
    condition_on_previous_text=True,  # Use context from previous segments
    initial_prompt="This is a technical meeting about AI and machine learning."  # Context prompt
)

temperature の値

temperature=0.0 — 最も決定的。推奨
temperature=0.2-0.4 — やや変動が大きい
temperature=1.0 — 創造性は増すが精度は下がりがち

文脈用の initial prompt

精度向上のために文脈を与えます。

result = model.transcribe(
    "technical_meeting.mp3",
    initial_prompt="This meeting discusses API endpoints, microservices, Kubernetes, and CI/CD pipelines."
)

result = model.transcribe(
    "medical_audio.mp3",
    initial_prompt="This is a medical consultation discussing patient symptoms and treatment options."
)

ステップ 7：コマンドラインインターフェース（CLI）

Whisper には強力なコマンドラインインターフェースがあります。

CLI の基本

whisper audio.mp3

モデルの指定

whisper audio.mp3 --model small
whisper audio.mp3 --model medium
whisper audio.mp3 --model large-v2

言語の指定

whisper audio.mp3 --language en
whisper audio.mp3 --language zh

出力フォーマット

# SRT subtitles
whisper audio.mp3 --output_format srt

# VTT subtitles
whisper audio.mp3 --output_format vtt

# Text file
whisper audio.mp3 --output_format txt

# JSON (with all metadata)
whisper audio.mp3 --output_format json

# TSV (tab-separated values)
whisper audio.mp3 --output_format tsv

CLI の高度なオプション

# Full example with all options
whisper audio.mp3 \
  --model medium \
  --language en \
  --task transcribe \
  --output_format srt \
  --output_dir ./transcripts \
  --verbose True \
  --temperature 0.0 \
  --beam_size 5 \
  --best_of 5 \
  --fp16 True

CLI オプション一覧

オプション	説明	既定値
`--model`	モデルサイズ（tiny, base, small, medium, large）	`base`
`--language`	言語コード（en, zh, es など）	自動検出
`--task`	`transcribe` または `translate`	`transcribe`
`--output_format`	出力形式（txt, srt, vtt, json, tsv）	`txt`
`--output_dir`	出力ディレクトリ	カレントディレクトリ
`--temperature`	サンプリングの temperature	`0.0`
`--beam_size`	ビームサーチのビームサイズ	`5`
`--best_of`	候補の数	`5`
`--fp16`	FP16 精度の使用（GPU）	`True`
`--verbose`	詳細出力	`False`

ステップ 8：対応する音声・動画フォーマット

FFmpeg 経由で Whisper は一般的な形式のほとんどに対応します。

対応フォーマット

音声: MP3, WAV, M4A, FLAC, OGG, AAC, WMA
動画: MP4, AVI, MKV, MOV, WebM, FLV
ストリーミング: 音声ストリームの処理も可能

フォーマットの例

import whisper

model = whisper.load_model("base")

# Audio formats
model.transcribe("audio.mp3")
model.transcribe("audio.wav")
model.transcribe("audio.m4a")
model.transcribe("audio.flac")

# Video formats (extracts audio automatically)
model.transcribe("video.mp4")
model.transcribe("video.mkv")
model.transcribe("video.webm")

ステップ 9：本番向けの完全な例

本番環境を想定した完全な例です。

import whisper
import json
from pathlib import Path
from datetime import datetime

class WhisperTranscriber:
    """Production-ready Whisper transcription service."""
    
    def __init__(self, model_size="base"):
        """Initialize transcriber with specified model."""
        print(f"Loading Whisper model: {model_size}")
        self.model = whisper.load_model(model_size)
        print("✓ Model loaded successfully")
    
    def transcribe_file(self, audio_path, output_dir="transcripts", **kwargs):
        """
        Transcribe audio file and save results.
        
        Args:
            audio_path: Path to audio file
            output_dir: Directory to save outputs
            **kwargs: Additional transcribe parameters
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)
        
        print(f"\nTranscribing: {audio_path.name}")
        
        # Transcribe
        result = self.model.transcribe(
            str(audio_path),
            word_timestamps=True,
            **kwargs
        )
        
        # Prepare output data
        output_data = {
            "file": str(audio_path),
            "transcribed_at": datetime.now().isoformat(),
            "language": result["language"],
            "language_probability": result.get("language_probability", 0),
            "duration": result["segments"][-1]["end"] if result["segments"] else 0,
            "text": result["text"],
            "segments": result["segments"]
        }
        
        # Save outputs
        base_name = audio_path.stem
        
        # Save as text
        text_file = output_path / f"{base_name}.txt"
        with open(text_file, "w", encoding="utf-8") as f:
            f.write(result["text"])
        
        # Save as JSON
        json_file = output_path / f"{base_name}.json"
        with open(json_file, "w", encoding="utf-8") as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        # Save as SRT
        srt_file = output_path / f"{base_name}.srt"
        self._save_srt(result["segments"], srt_file)
        
        print(f"✓ Transcription saved:")
        print(f"  - Text: {text_file}")
        print(f"  - JSON: {json_file}")
        print(f"  - SRT: {srt_file}")
        
        return output_data
    
    def _save_srt(self, segments, output_path):
        """Save segments as SRT subtitle file."""
        with open(output_path, "w", encoding="utf-8") as f:
            for i, segment in enumerate(segments, start=1):
                start = self._format_srt_time(segment["start"])
                end = self._format_srt_time(segment["end"])
                text = segment["text"].strip()
                f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
    
    def _format_srt_time(self, seconds):
        """Format seconds to SRT timestamp."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Usage
if __name__ == "__main__":
    transcriber = WhisperTranscriber(model_size="base")
    
    result = transcriber.transcribe_file(
        "meeting.mp3",
        output_dir="transcripts",
        language="en",
        temperature=0.0
    )
    
    print(f"\nLanguage: {result['language']}")
    print(f"Duration: {result['duration']:.2f}s")
    print(f"\nTranscription preview:")
    print(result['text'][:200] + "...")

ステップ 10：ベストプラクティス

1. 適切なモデルを選ぶ

# For speed (testing, demos)
model = whisper.load_model("tiny")

# For balance (general use)
model = whisper.load_model("base")  # or "small"

# For accuracy (production)
model = whisper.load_model("medium")  # or "large-v2"

2. 言語が分かる場合は指定する

# Faster and more accurate
result = model.transcribe("audio.mp3", language="en")

# Instead of auto-detection
result = model.transcribe("audio.mp3")  # Slower

3. 適切な temperature

# Recommended for most cases
result = model.transcribe("audio.mp3", temperature=0.0)

# For creative content (not recommended for transcription)
result = model.transcribe("audio.mp3", temperature=0.2)

4. initial prompt で文脈を与える

# Technical content
result = model.transcribe(
    "meeting.mp3",
    initial_prompt="This meeting discusses software architecture, APIs, and deployment strategies."
)

# Medical content
result = model.transcribe(
    "consultation.mp3",
    initial_prompt="This is a medical consultation about patient symptoms and treatment."
)

5. モデルインスタンスの再利用

# Load once, reuse multiple times
model = whisper.load_model("base")

# Process multiple files
for audio_file in ["file1.mp3", "file2.mp3", "file3.mp3"]:
    result = model.transcribe(audio_file)
    # Process result...

6. 長い音声ファイルの扱い

非常に長い音声はチャンク分割を検討します。

import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=600000):  # 10 minutes
    """Transcribe long audio by splitting into chunks."""
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    duration_ms = len(audio)
    
    all_text = []
    all_segments = []
    
    # Process in chunks
    for i in range(0, duration_ms, chunk_length_ms):
        chunk = audio[i:i + chunk_length_ms]
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        all_text.append(result["text"])
        all_segments.extend(result["segments"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return {
        "text": " ".join(all_text),
        "segments": all_segments
    }

よくある問題と対処法

問題 1: FFmpeg が見つからない

エラー: FileNotFoundError: ffmpeg

対処:

# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Verify
ffmpeg -version

問題 2: メモリ不足

エラー: RuntimeError: CUDA out of memory、またはシステムの RAM 不足

対処:

# Use smaller model
model = whisper.load_model("base")  # Instead of "large"

# Or use CPU
import torch
model = whisper.load_model("base", device="cpu")

# Or process in chunks (see above)

問題 3: 文字起こしが遅い

症状: 文字起こしに時間がかかりすぎる

対処:

# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)

# Use smaller model
model = whisper.load_model("tiny")  # or "base"

# Reduce beam size (faster but slightly less accurate)
result = model.transcribe("audio.mp3", beam_size=1)

問題 4: 精度が低い

症状: 誤認識が多い

対処:

# Use larger model
model = whisper.load_model("medium")  # or "large"

# Specify language
result = model.transcribe("audio.mp3", language="en")

# Provide context
result = model.transcribe(
    "audio.mp3",
    initial_prompt="Context about the audio content..."
)

# Use optimal settings
result = model.transcribe(
    "audio.mp3",
    temperature=0.0,
    beam_size=5,
    best_of=5
)

ユースケース

1. ポッドキャストの文字起こし

model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3", language="en")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

2. YouTube 用字幕の生成

model = whisper.load_model("base")
result = model.transcribe("video.mp4", language="en")

# Generate SRT
# (Use CLI: whisper video.mp4 --output_format srt)

3. 会議メモ

model = whisper.load_model("base")
result = model.transcribe(
    "meeting.mp3",
    language="en",
    initial_prompt="This is a business meeting discussing project updates and deadlines."
)

# Save with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.0f}s] {segment['text']}")

4. インタビューの文字起こし

model = whisper.load_model("medium")
result = model.transcribe("interview.mp3", language="en")

# Export for editing
with open("interview.txt", "w") as f:
    for segment in result["segments"]:
        f.write(f"[{segment['start']:.2f}s] {segment['text']}\n")

5. 多言語コンテンツの英訳

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Whisper と代替手段の比較

項目	Whisper	クラウド API	Faster-Whisper
コスト	無料	分単位の課金	無料
オフライン	✅	❌	✅
速度	中程度	速い	速い（2〜4 倍）
精度	高い	高い	高い（同等）
セットアップ	容易	とても容易	容易
リアルタイム	❌	✅	❌
プライバシー	✅ ローカル	❌ クラウド	✅ ローカル

Whisper を選ぶ場合:

無料でオフラインの文字起こしが欲しい
プライバシーが重要
インフラを自分で管理したい
バッチ処理やアーカイブコンテンツを扱う

クラウド API を選ぶ場合:

リアルタイムの文字起こしが必要
マネージドなインフラが欲しい
API コストの予算がある
エンタープライズサポートが必要

次のステップ

基本を押さえたら、次も参照してください。

Whisper Python Example — より詳しい Python の例
Faster-Whisper Guide — 2〜4 倍速い文字起こし
Whisper Accuracy Tips — 品質向上のヒント
Whisper Transcript Formatting — 出力の整形（SRT、VTT、JSON）
Whisper for Meetings — 会議向けの文字起こし

まとめ

OpenAI Whisper は、現時点で利用できる強力なオープンソースの音声文字起こしモデルのひとつです。多言語対応、高い精度、完全なオフライン実行により、文字起こしワークフローを自分でコントロールしたい開発者やクリエイターに適しています。

要点:

Whisper は 99 言語以上を高精度でサポート
用途に合わせてモデルサイズを選ぶ
分かるときは言語を指定してパフォーマンスを上げる
精密なタイミングには単語タイムスタンプ
複数ファイルではモデルインスタンスを再利用
本番デプロイでは faster-whisper も検討

ポッドキャスト、字幕、会議録音など、Whisper は堅牢で無料かつプライバシーを守れる音声文字起こしの選択肢です。

プロ向けの音声文字起こしをお探しですか？SayToWords で、最適化されたパフォーマンスと複数の出力形式に対応した AI 文字起こしプラットフォームをご覧ください。

OpenAI Whisper チュートリアル：音声文字起こしの完全ガイド

OpenAI Whisper チュートリアル：音声文字起こしの完全ガイド

OpenAI Whisper とは？

主な機能

Whisper のモデルサイズ

前提条件

ステップ 1：インストール

Whisper のインストール

FFmpeg のインストール

ステップ 2：基本的な使い方 — Python

シンプルな文字起こし

エラー処理付きの完全な例

ステップ 3：言語の検出と指定

言語の自動検出

言語を指定する（高速かつ高精度）

ステップ 4：タイムスタンプとセグメント

タイムスタンプ付きセグメントへのアクセス

タイムスタンプをタイムコード形式に

単語レベルのタイムスタンプ

ステップ 5：音声翻訳

ステップ 6：高度なパラメータ

temperature とビームサイズ

temperature の値

文脈用の initial prompt

ステップ 7：コマンドラインインターフェース（CLI）

CLI の基本

モデルの指定

言語の指定

出力フォーマット

CLI の高度なオプション

CLI オプション一覧

ステップ 8：対応する音声・動画フォーマット

対応フォーマット

フォーマットの例

ステップ 9：本番向けの完全な例

ステップ 10：ベストプラクティス

1. 適切なモデルを選ぶ

2. 言語が分かる場合は指定する

3. 適切な temperature

4. initial prompt で文脈を与える

5. モデルインスタンスの再利用

6. 長い音声ファイルの扱い

よくある問題と対処法

問題 1: FFmpeg が見つからない

問題 2: メモリ不足

問題 3: 文字起こしが遅い

問題 4: 精度が低い

ユースケース

1. ポッドキャストの文字起こし

2. YouTube 用字幕の生成

3. 会議メモ

4. インタビューの文字起こし

5. 多言語コンテンツの英訳

Whisper と代替手段の比較

次のステップ

まとめ

関連記事

音声認識（スピーチ・トゥ・テキスト）とは？使い方の完全ガイド【初心者向け】

音声をオンラインでテキスト化する方法：無料で高精度な手法（2026年ガイド）

STTのための背景ノイズ除去方法：音声テキスト変換向けノイズリダクション完全ガイド

今すぐ無料で試す