Whisper Python 예제: 음성-텍스트 변환 완전 가이드

OpenAI Whisper는 현재 사용할 수 있는 가장 강력한 오픈 소스 음성 인식 모델 중 하나입니다. 이 가이드에서는 Python으로 Whisper를 사용해 오디오 파일을 높은 정확도로 텍스트로 전사하는 방법을 설명합니다.

이 튜토리얼은 다음에 적합합니다.

음성-텍스트 기능을 만드는 개발자
오디오 데이터를 다루는 데이터 과학자
완전한 Whisper Python 예제를 찾는 모든 분

OpenAI Whisper란?

Whisper는 68만 시간의 다국어 오디오로 학습된 자동 음성 인식(ASR) 시스템입니다. 다음을 수행할 수 있습니다.

99개 이상의 언어로 음성 전사
언어 자동 감지
음성을 영어로 번역
노이즈가 있는 오디오와 억양 처리
긴 형식 오디오 파일 처리

사전 요구 사항

시작하기 전에 다음을 준비하세요.

Python 3.8+ 설치
패키지 관리자 pip
오디오 처리용 FFmpeg 설치
(선택) 더 빠른 처리를 위한 NVIDIA GPU

1단계: Whisper 설치

pip으로 OpenAI Whisper 패키지를 설치합니다.

pip install openai-whisper

FFmpeg 설치

macOS(Homebrew 사용):

brew install ffmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

Windows: ffmpeg.org에서 FFmpeg를 다운로드해 PATH에 추가하세요.

2단계: 기본 Whisper Python 예제

오디오 파일을 전사하는 간단한 Python 스크립트입니다.

import whisper

# Load the Whisper model
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Print the transcription
print(result["text"])

출력:

Hello everyone, welcome to today's meeting. We will discuss the project timeline and upcoming milestones.

3단계: 오류 처리가 포함된 완전한 Python 예제

적절한 오류 처리를 포함한 더 견고한 예제입니다.

import whisper
import os

def transcribe_audio(audio_path, model_size="base"):
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path (str): Path to the audio file
        model_size (str): Whisper model size (tiny, base, small, medium, large)
    
    Returns:
        dict: Transcription result with text and segments
    """
    try:
        # Check if audio file exists
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Load the Whisper model
        print(f"Loading Whisper model: {model_size}")
        model = whisper.load_model(model_size)
        
        # Transcribe the audio
        print(f"Transcribing: {audio_path}")
        result = model.transcribe(audio_path)
        
        return result
    
    except Exception as e:
        print(f"Error during transcription: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    audio_file = "sample_audio.mp3"
    result = transcribe_audio(audio_file, model_size="base")
    
    if result:
        print("\nTranscription:")
        print(result["text"])

4단계: 언어 감지가 포함된 고급 예제

Whisper는 언어를 자동으로 감지하지만 직접 지정할 수도 있습니다.

import whisper

model = whisper.load_model("base")

# Auto-detect language
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Transcription: {result['text']}")

# Specify language explicitly
result_en = model.transcribe("audio.mp3", language="en")
result_zh = model.transcribe("audio.mp3", language="zh")

5단계: 타임스탬프 및 세그먼트 가져오기

Whisper는 타임스탬프가 포함된 상세한 세그먼트 정보를 제공합니다.

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")

# Print full transcription
print("Full Text:")
print(result["text"])

# Print segments with timestamps
print("\nSegments with Timestamps:")
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

출력:

Full Text:
Hello everyone, welcome to today's meeting. We will discuss the project timeline.

Segments with Timestamps:
[0.00s - 2.50s] Hello everyone, welcome to today's meeting.
[2.50s - 5.80s] We will discuss the project timeline.

6단계: 오디오를 영어로 번역

Whisper는 영어가 아닌 음성을 바로 영어로 번역할 수 있습니다.

import whisper

model = whisper.load_model("base")

# Translate to English
result = model.transcribe("spanish_audio.mp3", task="translate")

print("Translated text:")
print(result["text"])

7단계: 여러 오디오 파일 처리

여러 파일을 일괄로 전사하는 방법입니다.

import whisper
import os
from pathlib import Path

def batch_transcribe(audio_directory, model_size="base", output_dir="transcriptions"):
    """
    Transcribe all audio files in a directory.
    
    Args:
        audio_directory (str): Directory containing audio files
        model_size (str): Whisper model size
        output_dir (str): Directory to save transcriptions
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model once
    model = whisper.load_model(model_size)
    
    # Supported audio formats
    audio_extensions = ['.mp3', '.wav', '.m4a', '.flac', '.ogg']
    
    # Process each audio file
    audio_files = [
        f for f in os.listdir(audio_directory)
        if any(f.lower().endswith(ext) for ext in audio_extensions)
    ]
    
    for audio_file in audio_files:
        audio_path = os.path.join(audio_directory, audio_file)
        print(f"\nProcessing: {audio_file}")
        
        try:
            result = model.transcribe(audio_path)
            
            # Save transcription to file
            output_file = os.path.join(
                output_dir,
                Path(audio_file).stem + ".txt"
            )
            
            with open(output_file, "w", encoding="utf-8") as f:
                f.write(result["text"])
            
            print(f"✓ Saved: {output_file}")
            
        except Exception as e:
            print(f"✗ Error processing {audio_file}: {str(e)}")

# Example usage
batch_transcribe("audio_files/", model_size="base")

8단계: SRT 자막 형식으로 내보내기

전사 결과에서 SRT 자막 파일을 만듭니다.

import whisper

def transcribe_to_srt(audio_path, output_path, model_size="base"):
    """
    Transcribe audio and save as SRT subtitle file.
    
    Args:
        audio_path (str): Path to audio file
        output_path (str): Path to save SRT file
        model_size (str): Whisper model size
    """
    model = whisper.load_model(model_size)
    result = model.transcribe(audio_path)
    
    # Generate SRT content
    srt_content = ""
    for i, segment in enumerate(result["segments"], start=1):
        start_time = format_timestamp(segment["start"])
        end_time = format_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_content += f"{i}\n"
        srt_content += f"{start_time} --> {end_time}\n"
        srt_content += f"{text}\n\n"
    
    # Save SRT file
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(srt_content)
    
    print(f"SRT file saved: {output_path}")

def format_timestamp(seconds):
    """Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Example usage
transcribe_to_srt("video.mp4", "subtitles.srt", model_size="base")

Whisper 모델 크기 비교

필요에 맞는 모델 크기를 선택하세요.

모델	파라미터	속도	정확도	메모리	사용 사례
tiny	39M	⭐⭐⭐⭐⭐	⭐⭐	~1GB	빠른 테스트, 단순 오디오
base	74M	⭐⭐⭐⭐	⭐⭐⭐	~1GB	일반 용도
small	244M	⭐⭐⭐	⭐⭐⭐⭐	~2GB	균형
medium	769M	⭐⭐	⭐⭐⭐⭐⭐	~5GB	높은 정확도 필요
large	1550M	⭐	⭐⭐⭐⭐⭐⭐	~10GB	최고 정확도, 노이즈 환경

Whisper Python 모범 사례

1. 적절한 모델 크기 선택

# Fast and lightweight
model = whisper.load_model("tiny")  # Good for testing

# Balanced
model = whisper.load_model("base")  # Good for most cases

# High accuracy
model = whisper.load_model("medium")  # For important transcriptions

2. 긴 오디오 파일 처리

매우 긴 오디오는 청크로 나누는 것을 고려하세요.

import whisper
from pydub import AudioSegment

def transcribe_long_audio(audio_path, chunk_length_ms=60000):
    """
    Transcribe long audio by splitting into chunks.
    
    Args:
        audio_path: Path to audio file
        chunk_length_ms: Length of each chunk in milliseconds
    """
    model = whisper.load_model("base")
    
    # Load audio
    audio = AudioSegment.from_file(audio_path)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(audio), chunk_length_ms):
        chunks.append(audio[i:i + chunk_length_ms])
    
    # Transcribe each chunk
    full_text = []
    for i, chunk in enumerate(chunks):
        chunk_path = f"chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")
        
        result = model.transcribe(chunk_path)
        full_text.append(result["text"])
        
        # Clean up chunk file
        os.remove(chunk_path)
    
    return " ".join(full_text)

3. 더 빠른 처리를 위해 GPU 사용

NVIDIA GPU가 있는 경우:

import whisper

# Whisper will automatically use GPU if available
model = whisper.load_model("base", device="cuda")

4. 정확도 향상을 위해 언어 지정

# If you know the language, specify it
result = model.transcribe("audio.mp3", language="en")

일반적인 사용 사례

팟캐스트 전사

import whisper

model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3")

# Save transcript
with open("podcast_transcript.txt", "w") as f:
    f.write(result["text"])

회의록

import whisper
from datetime import datetime

model = whisper.load_model("base")
result = model.transcribe("meeting_recording.mp3")

# Create formatted meeting notes
notes = f"""
Meeting Notes - {datetime.now().strftime('%Y-%m-%d')}
========================================

{result["text"]}
"""

with open("meeting_notes.txt", "w") as f:
    f.write(notes)

동영상 자막

import whisper

model = whisper.load_model("base")
result = model.transcribe("video.mp4")

# Generate VTT subtitle file
vtt_content = "WEBVTT\n\n"
for segment in result["segments"]:
    start = format_vtt_timestamp(segment["start"])
    end = format_vtt_timestamp(segment["end"])
    text = segment["text"].strip()
    vtt_content += f"{start} --> {end}\n{text}\n\n"

with open("subtitles.vtt", "w") as f:
    f.write(vtt_content)

자주 발생하는 문제 해결

문제 1: FFmpeg를 찾을 수 없음

오류: FileNotFoundError: ffmpeg

해결:

# Install FFmpeg
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows
# Download from ffmpeg.org and add to PATH

문제 2: 메모리 부족

오류: RuntimeError: CUDA out of memory

해결:

# Use a smaller model
model = whisper.load_model("tiny")  # Instead of "large"

# Or use CPU
model = whisper.load_model("base", device="cpu")

문제 3: 처리 속도가 느림

해결:

더 작은 모델(tiny 또는 base) 사용
GPU 가속 사용
오디오를 청크로 처리
배치 작업에는 멀티프로세싱 사용

성능 팁

가능하면 GPU 사용 — CPU보다 10~50배 빠를 수 있음
적절한 모델 크기 — 단순 작업에 "large"는 피하기
오디오 전처리 — 무음 제거, 볼륨 정규화
배치 처리 — 모델은 한 번만 로드
스레딩 — I/O 위주 작업에 유용

Whisper Python과 다른 솔루션 비교

항목	Whisper Python	Google Speech-to-Text	AssemblyAI
비용	무료(로컬)	분당 과금	분당 과금
오프라인	✅	❌	❌
정확도	높음	높음	높음
설정	보통	쉬움	쉬움
긴 오디오	✅	✅	✅
다국어	✅	✅	✅

완전한 예제: 프로덕션 준비 스크립트

프로덕션에 사용할 수 있는 완전한 예제입니다.

#!/usr/bin/env python3
"""
Production-ready Whisper transcription script.
"""

import whisper
import argparse
import os
import json
from pathlib import Path
from datetime import datetime

def transcribe_file(
    audio_path,
    model_size="base",
    language=None,
    output_format="txt",
    output_dir=None
):
    """
    Transcribe an audio file with comprehensive output options.
    
    Args:
        audio_path: Path to audio file
        model_size: Whisper model size
        language: Language code (optional, auto-detected if None)
        output_format: Output format (txt, json, srt, vtt)
        output_dir: Output directory (default: same as audio file)
    """
    # Validate input file
    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    
    # Set output directory
    if output_dir is None:
        output_dir = os.path.dirname(audio_path)
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model
    print(f"Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)
    
    # Transcribe
    print(f"Transcribing: {audio_path}")
    transcribe_kwargs = {}
    if language:
        transcribe_kwargs["language"] = language
    
    result = model.transcribe(audio_path, **transcribe_kwargs)
    
    # Generate output filename
    base_name = Path(audio_path).stem
    output_path = os.path.join(output_dir, base_name)
    
    # Save based on format
    if output_format == "txt":
        with open(f"{output_path}.txt", "w", encoding="utf-8") as f:
            f.write(result["text"])
    
    elif output_format == "json":
        with open(f"{output_path}.json", "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)
    
    elif output_format == "srt":
        srt_content = generate_srt(result["segments"])
        with open(f"{output_path}.srt", "w", encoding="utf-8") as f:
            f.write(srt_content)
    
    elif output_format == "vtt":
        vtt_content = generate_vtt(result["segments"])
        with open(f"{output_path}.vtt", "w", encoding="utf-8") as f:
            f.write(vtt_content)
    
    print(f"✓ Transcription saved: {output_path}.{output_format}")
    print(f"  Language: {result['language']}")
    print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
    
    return result

def generate_srt(segments):
    """Generate SRT subtitle content."""
    srt = ""
    for i, segment in enumerate(segments, start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        srt += f"{i}\n{start} --> {end}\n{text}\n\n"
    return srt

def generate_vtt(segments):
    """Generate VTT subtitle content."""
    vtt = "WEBVTT\n\n"
    for segment in segments:
        start = format_vtt_timestamp(segment["start"])
        end = format_vtt_timestamp(segment["end"])
        text = segment["text"].strip()
        vtt += f"{start} --> {end}\n{text}\n\n"
    return vtt

def format_timestamp(seconds):
    """Format timestamp for SRT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

def format_vtt_timestamp(seconds):
    """Format timestamp for VTT."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

def main():
    parser = argparse.ArgumentParser(
        description="Transcribe audio files using OpenAI Whisper"
    )
    parser.add_argument("audio", help="Path to audio file")
    parser.add_argument(
        "--model",
        default="base",
        choices=["tiny", "base", "small", "medium", "large"],
        help="Whisper model size"
    )
    parser.add_argument(
        "--language",
        default=None,
        help="Language code (e.g., 'en', 'zh', 'es')"
    )
    parser.add_argument(
        "--output-format",
        default="txt",
        choices=["txt", "json", "srt", "vtt"],
        help="Output format"
    )
    parser.add_argument(
        "--output-dir",
        default=None,
        help="Output directory"
    )
    
    args = parser.parse_args()
    
    transcribe_file(
        args.audio,
        model_size=args.model,
        language=args.language,
        output_format=args.output_format,
        output_dir=args.output_dir
    )

if __name__ == "__main__":
    main()

사용법:

# Basic usage
python transcribe.py audio.mp3

# With options
python transcribe.py audio.mp3 --model medium --language en --output-format srt

# Save to specific directory
python transcribe.py audio.mp3 --output-dir ./transcriptions

결론

이 Whisper Python 예제 가이드는 OpenAI Whisper로 음성-텍스트 변환을 시작하는 데 필요한 내용을 담고 있습니다. 팟캐스트, 회의, 자막 제작 등 Whisper는 오디오를 텍스트로 바꾸는 강력하고 무료인 솔루션입니다.

핵심 요약:

Whisper는 무료이며 오픈 소스입니다
99개 이상의 언어를 지원합니다
오프라인에서 동작합니다(API 호출 불필요)
대부분의 용도에서 높은 정확도
Python 프로젝트에 쉽게 통합

실시간 전사나 API 접근이 필요한 프로덕션 환경에서는 SayToWords와 같이 Whisper 기반 API를 제공하는 클라우드 솔루션을 고려해 보세요.

지금 시작할까요? Whisper를 설치하고 오늘 첫 오디오 파일을 전사해 보세요.

Whisper Python 예제: 음성-텍스트 변환 완전 가이드

Whisper Python 예제: 음성-텍스트 변환 완전 가이드

OpenAI Whisper란?

사전 요구 사항

1단계: Whisper 설치

FFmpeg 설치

2단계: 기본 Whisper Python 예제

3단계: 오류 처리가 포함된 완전한 Python 예제

4단계: 언어 감지가 포함된 고급 예제

5단계: 타임스탬프 및 세그먼트 가져오기

6단계: 오디오를 영어로 번역

7단계: 여러 오디오 파일 처리

8단계: SRT 자막 형식으로 내보내기

Whisper 모델 크기 비교

Whisper Python 모범 사례

1. 적절한 모델 크기 선택

2. 긴 오디오 파일 처리

3. 더 빠른 처리를 위해 GPU 사용

4. 정확도 향상을 위해 언어 지정

일반적인 사용 사례

팟캐스트 전사

회의록

동영상 자막

자주 발생하는 문제 해결

문제 1: FFmpeg를 찾을 수 없음

문제 2: 메모리 부족

문제 3: 처리 속도가 느림

성능 팁

Whisper Python과 다른 솔루션 비교

완전한 예제: 프로덕션 준비 스크립트

결론

관련 게시물

음성-텍스트 변환이란 무엇이며 어떻게 쓰나요? 초보자를 위한 완전 가이드

STT용 배경 소음 제거 방법: 음성-텍스트 변환을 위한 노이즈 감소 완벽 가이드

AI가 방언을 받아쓸 수 있을까? 음성-텍스트에서의 방언 인식 완전 가이드

지금 무료로 체험하기