如何为 STT 去除背景噪声：语音转文字降噪完整指南

背景噪声是转录音频录音时最常见的挑战之一。无论是交通声、键盘敲击声、空调声还是人群噪声，在进行语音转文字处理之前先去除背景噪声，都能显著提升转录准确率。

本篇完整指南将介绍为 STT 去除背景噪声的实用方法，从简单的软件方案到高级音频处理技术一应俱全。

为什么要为 STT 去除背景噪声？

背景噪声会通过以下几个方面降低语音转文字准确率：

信噪比（SNR）降低，模型更难区分语音
频率掩蔽效应，噪声与语音频率重叠
模型混淆，当噪声模式类似语音时更易误判
置信度分数降低，导致更多转录错误
处理时间增加，模型在嘈杂输入下更难处理

去噪带来的好处：

✅ 转录准确率提升（通常可提高 10-30%）
✅ 更好的词语识别能力，尤其是专业术语
✅ 更干净音频带来更快处理速度
✅ 时间戳与分段结果更可靠
✅ 对轻声语音的处理更好

理解背景噪声类型

不同噪声类型需要不同的去除策略：

1. 恒定噪声（平稳噪声）

示例： 空调声、风扇嗡鸣、电流嗡声、白噪声
特征： 频率与振幅较稳定
去除方式： 用频谱减法或滤波更容易去除

2. 变化噪声（非平稳噪声）

示例： 交通声、人群交谈、键盘敲击、纸张摩擦
特征： 随时间变化，模式不可预测
去除方式： 需要深度学习模型等更高级技术

3. 脉冲噪声

示例： 咔嗒声、爆音、关门声、电话铃声
特征： 短促、突发
去除方式： 需要检测后进行替换/插值

4. 周期噪声

示例： 提示音、警报声、重复性声音
特征： 在特定频率上呈现规律模式
去除方式： 可使用陷波滤波器过滤

方法 1：使用音频编辑软件

Audacity（免费、开源）

Audacity 是一款功能强大的免费音频编辑器，内置降噪功能：

步骤：

在 Audacity 中打开你的音频文件
选择一段只有噪声（无语音）的片段
进入 Effect → Noise Reduction
点击 Get Noise Profile
选择整条音轨
再次进入 Effect → Noise Reduction
调整设置：
- Noise reduction (dB)： 12-24 dB（先从 15 开始）
- Sensitivity： 6.0（默认）
- Frequency smoothing (bands)： 3（默认）
点击 OK 应用

最佳实践：

使用 0.5-2 秒的噪声样本
选择具有代表性的噪声片段
先用中等强度设置，不够再增加
先预览，再应用到整条音轨

Adobe Audition

Adobe Audition 提供专业级降噪能力：

打开音频文件
选择仅噪声片段
进入 Effects → Noise Reduction/Restoration → Capture Noise Print
选择整条音轨
进入 Effects → Noise Reduction/Restoration → Noise Reduction (process)
调整：
- Noise Reduction： 40-80%（建议先从 60% 开始）
- Reduce by： 6-12 dB
- High Frequency Transition： 4000-8000 Hz
点击 Apply

方法 2：Python 音频处理库

使用 noisereduce 库

noisereduce 库提供了易用的降噪功能：

import noisereduce as nr
import soundfile as sf

# Load audio file
audio_data, sample_rate = sf.read("noisy_audio.wav")

# Method 1: Stationary noise reduction (for constant noise)
reduced_noise = nr.reduce_noise(
    y=audio_data,
    sr=sample_rate,
    stationary=True,
    prop_decrease=0.8  # Reduce noise by 80%
)

# Method 2: Non-stationary noise reduction (for variable noise)
reduced_noise = nr.reduce_noise(
    y=audio_data,
    sr=sample_rate,
    stationary=False,
    prop_decrease=0.8
)

# Save cleaned audio
sf.write("cleaned_audio.wav", reduced_noise, sample_rate)

安装：

pip install noisereduce soundfile

使用 librosa 进行频谱门限（Spectral Gating）

import librosa
import numpy as np
import soundfile as sf

def spectral_gate(audio_path, threshold_db=-40):
    """Remove noise using spectral gating."""
    # Load audio
    y, sr = librosa.load(audio_path, sr=None)
    
    # Compute short-time Fourier transform (STFT)
    stft = librosa.stft(y)
    magnitude = np.abs(stft)
    phase = np.angle(stft)
    
    # Convert to dB
    magnitude_db = librosa.amplitude_to_db(magnitude)
    
    # Apply threshold (remove frequencies below threshold)
    magnitude_db_cleaned = np.where(
        magnitude_db > threshold_db,
        magnitude_db,
        -80  # Silence very quiet parts
    )
    
    # Convert back to linear scale
    magnitude_cleaned = librosa.db_to_amplitude(magnitude_db_cleaned)
    
    # Reconstruct audio
    stft_cleaned = magnitude_cleaned * np.exp(1j * phase)
    y_cleaned = librosa.istft(stft_cleaned)
    
    return y_cleaned, sr

# Usage
cleaned_audio, sample_rate = spectral_gate("noisy_audio.wav", threshold_db=-35)
sf.write("cleaned_audio.wav", cleaned_audio, sample_rate)

使用 scipy 进行高通滤波

去除低频噪声（如隆隆声、风噪）：

from scipy import signal
import soundfile as sf

def high_pass_filter(audio_path, cutoff_freq=80):
    """Remove low-frequency noise with high-pass filter."""
    # Load audio
    audio_data, sample_rate = sf.read(audio_path)
    
    # Design high-pass filter
    nyquist = sample_rate / 2
    normalized_cutoff = cutoff_freq / nyquist
    b, a = signal.butter(4, normalized_cutoff, btype='high')
    
    # Apply filter
    filtered_audio = signal.filtfilt(b, a, audio_data)
    
    return filtered_audio, sample_rate

# Usage
cleaned_audio, sr = high_pass_filter("noisy_audio.wav", cutoff_freq=100)
sf.write("cleaned_audio.wav", cleaned_audio, sr)

方法 3：基于深度学习的降噪

使用 RNNoise

RNNoise 是专门用于降噪的深度学习模型：

import rnnoise
import numpy as np
import soundfile as sf

def rnnoise_denoise(audio_path):
    """Remove noise using RNNoise model."""
    # Load audio
    audio_data, sample_rate = sf.read(audio_path)
    
    # RNNoise expects 16kHz mono audio
    if sample_rate != 16000:
        import librosa
        audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if len(audio_data.shape) > 1:
        audio_data = np.mean(audio_data, axis=1)
    
    # Process in chunks (RNNoise processes 480 samples at a time)
    chunk_size = 480
    denoised_audio = []
    
    denoiser = rnnoise.RNNoise()
    
    for i in range(0, len(audio_data), chunk_size):
        chunk = audio_data[i:i+chunk_size]
        if len(chunk) < chunk_size:
            chunk = np.pad(chunk, (0, chunk_size - len(chunk)))
        
        denoised_chunk = denoiser.process(chunk)
        denoised_audio.extend(denoised_chunk)
    
    return np.array(denoised_audio), sample_rate

# Usage
cleaned_audio, sr = rnnoise_denoise("noisy_audio.wav")
sf.write("cleaned_audio.wav", cleaned_audio, sr)

安装：

pip install rnnoise

使用 Facebook 的 Demucs

Demucs 可以将语音与背景噪声分离：

from demucs.pretrained import get_model
from demucs.audio import AudioFile
import torch

def demucs_separation(audio_path):
    """Separate speech from noise using Demucs."""
    # Load pre-trained model
    model = get_model('htdemucs')
    model.eval()
    
    # Load audio
    wav = AudioFile(audio_path).read(streams=0, samplerate=model.sample_rate, channels=model.audio_channels)
    ref = wav.mean(0)
    wav = (wav - ref.mean()) / ref.std()
    wav = torch.from_numpy(wav).float()
    
    # Separate sources
    with torch.no_grad():
        sources = model(wav[None])
        sources = sources * ref.std() + ref.mean()
    
    # Extract vocals (speech) - usually index 0 or 3
    speech = sources[0, 0].cpu().numpy()
    
    return speech, model.sample_rate

# Usage
speech_audio, sr = demucs_separation("noisy_audio.wav")
sf.write("speech_only.wav", speech_audio, sr)

方法 4：在线降噪工具

1. Audacity Online（云端版本）

免费，基于浏览器
适合快速处理
文件大小限制较多

2. Adobe Podcast Enhance

AI 驱动降噪
免费额度有限
对语音效果出色

3. Krisp.ai

实时噪声抑制
提供 API 便于集成
适合实时音频

4. Cleanvoice.ai

自动去除噪声
可处理多种噪声类型
支持批量处理

完整工作流：为 STT 预处理音频

下面是一段结合多种技术的完整 Python 脚本：

import librosa
import noisereduce as nr
import soundfile as sf
from scipy import signal
import numpy as np

def preprocess_audio_for_stt(audio_path, output_path):
    """Complete audio preprocessing pipeline for STT."""
    
    # Step 1: Load audio
    print("Loading audio...")
    y, sr = librosa.load(audio_path, sr=16000, mono=True)
    
    # Step 2: Remove DC offset
    print("Removing DC offset...")
    y = y - np.mean(y)
    
    # Step 3: High-pass filter (remove low-frequency noise)
    print("Applying high-pass filter...")
    nyquist = sr / 2
    normalized_cutoff = 80 / nyquist
    b, a = signal.butter(4, normalized_cutoff, btype='high')
    y = signal.filtfilt(b, a, y)
    
    # Step 4: Normalize volume
    print("Normalizing volume...")
    max_val = np.max(np.abs(y))
    if max_val > 0:
        y = y / max_val * 0.95  # Normalize to 95% to avoid clipping
    
    # Step 5: Noise reduction
    print("Reducing noise...")
    y = nr.reduce_noise(
        y=y,
        sr=sr,
        stationary=False,  # Use non-stationary for variable noise
        prop_decrease=0.8  # Reduce noise by 80%
    )
    
    # Step 6: Final normalization
    print("Final normalization...")
    max_val = np.max(np.abs(y))
    if max_val > 0:
        y = y / max_val * 0.95
    
    # Step 7: Save processed audio
    print(f"Saving to {output_path}...")
    sf.write(output_path, y, sr)
    
    print("Preprocessing complete!")
    return y, sr

# Usage
preprocess_audio_for_stt("noisy_recording.wav", "cleaned_for_stt.wav")

去噪最佳实践

1. 选择合适的方法

恒定噪声： 使用频谱减法或平稳噪声抑制
变化噪声： 使用非平稳抑制或深度学习模型
脉冲噪声： 使用点击噪声移除或插值
多种噪声： 组合多种技术

2. 保留语音质量

不要过度处理（会引入伪影）
使用中等强度降噪设置（60-80%）
保留人声频率范围（80-8000 Hz）
维持自然语音特征

3. 测试并迭代

应用到整条音轨前务必先预览
对比原始与处理后音频
用两个版本分别测试转录准确率
根据结果调整参数

4. 考虑你的 STT 模型

某些模型（如 Whisper）本身就很抗噪
预处理并非总是必要
进行有无预处理的对比测试
更大的模型通常抗噪能力更强

需要避免的常见错误

❌ 降噪过于激进

可能会删除语音频段
会产生伪影和失真
让语音听起来像机器人

❌ 移除过多低频

可能移除重要语音成分
会让语音听起来单薄或发尖
影响自然度

❌ 不使用你的 STT 模型做测试

预处理不一定提升准确率
有些模型对原始音频效果更好
一定要做 A/B 测试

❌ 忽略音频格式

确保合适采样率（建议 16kHz）
尽可能使用无损格式
避免二次压缩

与语音转文字系统集成

与 OpenAI Whisper 一起使用

import whisper
import noisereduce as nr
import soundfile as sf

def transcribe_with_noise_reduction(audio_path):
    """Transcribe audio with noise reduction preprocessing."""
    
    # Step 1: Reduce noise
    audio_data, sr = sf.read(audio_path)
    cleaned_audio = nr.reduce_noise(
        y=audio_data,
        sr=sr,
        stationary=False,
        prop_decrease=0.75
    )
    
    # Save temporary cleaned file
    temp_path = "temp_cleaned.wav"
    sf.write(temp_path, cleaned_audio, sr)
    
    # Step 2: Transcribe with Whisper
    model = whisper.load_model("base")
    result = model.transcribe(temp_path)
    
    # Clean up
    import os
    os.remove(temp_path)
    
    return result["text"]

# Usage
transcription = transcribe_with_noise_reduction("noisy_audio.wav")
print(transcription)

与 SayToWords API 一起使用

import requests
import noisereduce as nr
import soundfile as sf

def transcribe_with_saytowords(audio_path):
    """Preprocess and transcribe with SayToWords."""
    
    # Preprocess audio
    audio_data, sr = sf.read(audio_path)
    cleaned_audio = nr.reduce_noise(
        y=audio_data,
        sr=sr,
        stationary=False,
        prop_decrease=0.8
    )
    
    # Save cleaned audio
    cleaned_path = "cleaned_for_api.wav"
    sf.write(cleaned_path, cleaned_audio, sr)
    
    # Upload and transcribe
    with open(cleaned_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(
            'https://api.saytowords.com/transcribe',
            files=files,
            headers={'Authorization': 'Bearer YOUR_API_KEY'}
        )
    
    return response.json()

衡量降噪效果

前后对比

import librosa
import numpy as np

def measure_snr(audio_path):
    """Estimate signal-to-noise ratio."""
    y, sr = librosa.load(audio_path, sr=None)
    
    # Simple SNR estimation
    signal_power = np.mean(y ** 2)
    noise_floor = np.percentile(np.abs(y), 10) ** 2
    snr_db = 10 * np.log10(signal_power / noise_floor) if noise_floor > 0 else 0
    
    return snr_db

# Compare before and after
original_snr = measure_snr("noisy_audio.wav")
cleaned_snr = measure_snr("cleaned_audio.wav")

print(f"Original SNR: {original_snr:.2f} dB")
print(f"Cleaned SNR: {cleaned_snr:.2f} dB")
print(f"Improvement: {cleaned_snr - original_snr:.2f} dB")

结论

在进行语音转文字处理之前去除背景噪声，可以显著提升转录准确率。最佳方案取决于：

噪声类型（恒定 vs. 变化）
音频质量（采样率、位深）
可用工具（软件 vs. 编程）
STT 模型（有些比其他模型更抗噪）

快速建议：

若要快速处理： 使用 Audacity 或在线工具
若要自动化： 使用 noisereduce 等 Python 库
若要最佳效果： 组合多种技术
若要生产部署： 用你的特定 STT 模型进行测试

请记住：并非所有音频都需要预处理。像 Whisper 这样的现代 STT 模型对噪声已有较强鲁棒性。请始终对比原始与处理后音频，确认哪种更适合你的具体场景。

其他资源

需要针对你的特定音频进行降噪帮助？试试 SayToWords Speech-to-Text，它内置了噪声处理与预处理选项。

如何为 STT 去除背景噪声：语音转文字降噪完整指南

如何为 STT 去除背景噪声：语音转文字降噪完整指南

为什么要为 STT 去除背景噪声？

理解背景噪声类型

1. 恒定噪声（平稳噪声）

2. 变化噪声（非平稳噪声）

3. 脉冲噪声

4. 周期噪声

方法 1：使用音频编辑软件

Audacity（免费、开源）

Adobe Audition

方法 2：Python 音频处理库

使用 noisereduce 库

使用 librosa 进行频谱门限（Spectral Gating）

使用 scipy 进行高通滤波

方法 3：基于深度学习的降噪

使用 RNNoise

使用 Facebook 的 Demucs

方法 4：在线降噪工具

1. Audacity Online（云端版本）

2. Adobe Podcast Enhance

3. Krisp.ai

4. Cleanvoice.ai

完整工作流：为 STT 预处理音频

去噪最佳实践

1. 选择合适的方法

2. 保留语音质量

3. 测试并迭代

4. 考虑你的 STT 模型

需要避免的常见错误

与语音转文字系统集成

与 OpenAI Whisper 一起使用

与 SayToWords API 一起使用

衡量降噪效果

前后对比

结论

其他资源

相关文章

什么是语音转文字以及如何使用：完整新手指南

如何在线将音频转换为文字：免费且准确的方法（2026 指南）

AI 能转写方言吗？语音转文字中方言识别的完整指南

立即免費試用