
如何为 STT 去除背景噪声:语音转文字降噪完整指南
Eric King
Author
如何为 STT 去除背景噪声:语音转文字降噪完整指南
背景噪声是转录音频录音时最常见的挑战之一。无论是交通声、键盘敲击声、空调声还是人群噪声,在进行语音转文字处理之前先去除背景噪声,都能显著提升转录准确率。
本篇完整指南将介绍为 STT 去除背景噪声的实用方法,从简单的软件方案到高级音频处理技术一应俱全。
为什么要为 STT 去除背景噪声?
背景噪声会通过以下几个方面降低语音转文字准确率:
- 信噪比(SNR)降低,模型更难区分语音
- 频率掩蔽效应,噪声与语音频率重叠
- 模型混淆,当噪声模式类似语音时更易误判
- 置信度分数降低,导致更多转录错误
- 处理时间增加,模型在嘈杂输入下更难处理
去噪带来的好处:
- ✅ 转录准确率提升(通常可提高 10-30%)
- ✅ 更好的词语识别能力,尤其是专业术语
- ✅ 更干净音频带来更快处理速度
- ✅ 时间戳与分段结果更可靠
- ✅ 对轻声语音的处理更好
理解背景噪声类型
不同噪声类型需要不同的去除策略:
1. 恒定噪声(平稳噪声)
- 示例: 空调声、风扇嗡鸣、电流嗡声、白噪声
- 特征: 频率与振幅较稳定
- 去除方式: 用频谱减法或滤波更容易去除
2. 变化噪声(非平稳噪声)
- 示例: 交通声、人群交谈、键盘敲击、纸张摩擦
- 特征: 随时间变化,模式不可预测
- 去除方式: 需要深度学习模型等更高级技术
3. 脉冲噪声
- 示例: 咔嗒声、爆音、关门声、电话铃声
- 特征: 短促、突发
- 去除方式: 需要检测后进行替换/插值
4. 周期噪声
- 示例: 提示音、警报声、重复性声音
- 特征: 在特定频率上呈现规律模式
- 去除方式: 可使用陷波滤波器过滤
方法 1:使用音频编辑软件
Audacity(免费、开源)
Audacity 是一款功能强大的免费音频编辑器,内置降噪功能:
步骤:
- 在 Audacity 中打开你的音频文件
- 选择一段只有噪声(无语音)的片段
- 进入 Effect → Noise Reduction
- 点击 Get Noise Profile
- 选择整条音轨
- 再次进入 Effect → Noise Reduction
- 调整设置:
- Noise reduction (dB): 12-24 dB(先从 15 开始)
- Sensitivity: 6.0(默认)
- Frequency smoothing (bands): 3(默认)
- 点击 OK 应用
最佳实践:
- 使用 0.5-2 秒的噪声样本
- 选择具有代表性的噪声片段
- 先用中等强度设置,不够再增加
- 先预览,再应用到整条音轨
Adobe Audition
Adobe Audition 提供专业级降噪能力:
- 打开音频文件
- 选择仅噪声片段
- 进入 Effects → Noise Reduction/Restoration → Capture Noise Print
- 选择整条音轨
- 进入 Effects → Noise Reduction/Restoration → Noise Reduction (process)
- 调整:
- Noise Reduction: 40-80%(建议先从 60% 开始)
- Reduce by: 6-12 dB
- High Frequency Transition: 4000-8000 Hz
- 点击 Apply
方法 2:Python 音频处理库
使用 noisereduce 库
noisereduce 库提供了易用的降噪功能:import noisereduce as nr
import soundfile as sf
# Load audio file
audio_data, sample_rate = sf.read("noisy_audio.wav")
# Method 1: Stationary noise reduction (for constant noise)
reduced_noise = nr.reduce_noise(
y=audio_data,
sr=sample_rate,
stationary=True,
prop_decrease=0.8 # Reduce noise by 80%
)
# Method 2: Non-stationary noise reduction (for variable noise)
reduced_noise = nr.reduce_noise(
y=audio_data,
sr=sample_rate,
stationary=False,
prop_decrease=0.8
)
# Save cleaned audio
sf.write("cleaned_audio.wav", reduced_noise, sample_rate)
安装:
pip install noisereduce soundfile
使用 librosa 进行频谱门限(Spectral Gating)
import librosa
import numpy as np
import soundfile as sf
def spectral_gate(audio_path, threshold_db=-40):
"""Remove noise using spectral gating."""
# Load audio
y, sr = librosa.load(audio_path, sr=None)
# Compute short-time Fourier transform (STFT)
stft = librosa.stft(y)
magnitude = np.abs(stft)
phase = np.angle(stft)
# Convert to dB
magnitude_db = librosa.amplitude_to_db(magnitude)
# Apply threshold (remove frequencies below threshold)
magnitude_db_cleaned = np.where(
magnitude_db > threshold_db,
magnitude_db,
-80 # Silence very quiet parts
)
# Convert back to linear scale
magnitude_cleaned = librosa.db_to_amplitude(magnitude_db_cleaned)
# Reconstruct audio
stft_cleaned = magnitude_cleaned * np.exp(1j * phase)
y_cleaned = librosa.istft(stft_cleaned)
return y_cleaned, sr
# Usage
cleaned_audio, sample_rate = spectral_gate("noisy_audio.wav", threshold_db=-35)
sf.write("cleaned_audio.wav", cleaned_audio, sample_rate)
使用 scipy 进行高通滤波
去除低频噪声(如隆隆声、风噪):
from scipy import signal
import soundfile as sf
def high_pass_filter(audio_path, cutoff_freq=80):
"""Remove low-frequency noise with high-pass filter."""
# Load audio
audio_data, sample_rate = sf.read(audio_path)
# Design high-pass filter
nyquist = sample_rate / 2
normalized_cutoff = cutoff_freq / nyquist
b, a = signal.butter(4, normalized_cutoff, btype='high')
# Apply filter
filtered_audio = signal.filtfilt(b, a, audio_data)
return filtered_audio, sample_rate
# Usage
cleaned_audio, sr = high_pass_filter("noisy_audio.wav", cutoff_freq=100)
sf.write("cleaned_audio.wav", cleaned_audio, sr)
方法 3:基于深度学习的降噪
使用 RNNoise
RNNoise 是专门用于降噪的深度学习模型:
import rnnoise
import numpy as np
import soundfile as sf
def rnnoise_denoise(audio_path):
"""Remove noise using RNNoise model."""
# Load audio
audio_data, sample_rate = sf.read(audio_path)
# RNNoise expects 16kHz mono audio
if sample_rate != 16000:
import librosa
audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000
# Convert to mono if stereo
if len(audio_data.shape) > 1:
audio_data = np.mean(audio_data, axis=1)
# Process in chunks (RNNoise processes 480 samples at a time)
chunk_size = 480
denoised_audio = []
denoiser = rnnoise.RNNoise()
for i in range(0, len(audio_data), chunk_size):
chunk = audio_data[i:i+chunk_size]
if len(chunk) < chunk_size:
chunk = np.pad(chunk, (0, chunk_size - len(chunk)))
denoised_chunk = denoiser.process(chunk)
denoised_audio.extend(denoised_chunk)
return np.array(denoised_audio), sample_rate
# Usage
cleaned_audio, sr = rnnoise_denoise("noisy_audio.wav")
sf.write("cleaned_audio.wav", cleaned_audio, sr)
安装:
pip install rnnoise
使用 Facebook 的 Demucs
Demucs 可以将语音与背景噪声分离:
from demucs.pretrained import get_model
from demucs.audio import AudioFile
import torch
def demucs_separation(audio_path):
"""Separate speech from noise using Demucs."""
# Load pre-trained model
model = get_model('htdemucs')
model.eval()
# Load audio
wav = AudioFile(audio_path).read(streams=0, samplerate=model.sample_rate, channels=model.audio_channels)
ref = wav.mean(0)
wav = (wav - ref.mean()) / ref.std()
wav = torch.from_numpy(wav).float()
# Separate sources
with torch.no_grad():
sources = model(wav[None])
sources = sources * ref.std() + ref.mean()
# Extract vocals (speech) - usually index 0 or 3
speech = sources[0, 0].cpu().numpy()
return speech, model.sample_rate
# Usage
speech_audio, sr = demucs_separation("noisy_audio.wav")
sf.write("speech_only.wav", speech_audio, sr)
方法 4:在线降噪工具
1. Audacity Online(云端版本)
- 免费,基于浏览器
- 适合快速处理
- 文件大小限制较多
2. Adobe Podcast Enhance
- AI 驱动降噪
- 免费额度有限
- 对语音效果出色
3. Krisp.ai
- 实时噪声抑制
- 提供 API 便于集成
- 适合实时音频
4. Cleanvoice.ai
- 自动去除噪声
- 可处理多种噪声类型
- 支持批量处理
完整工作流:为 STT 预处理音频
下面是一段结合多种技术的完整 Python 脚本:
import librosa
import noisereduce as nr
import soundfile as sf
from scipy import signal
import numpy as np
def preprocess_audio_for_stt(audio_path, output_path):
"""Complete audio preprocessing pipeline for STT."""
# Step 1: Load audio
print("Loading audio...")
y, sr = librosa.load(audio_path, sr=16000, mono=True)
# Step 2: Remove DC offset
print("Removing DC offset...")
y = y - np.mean(y)
# Step 3: High-pass filter (remove low-frequency noise)
print("Applying high-pass filter...")
nyquist = sr / 2
normalized_cutoff = 80 / nyquist
b, a = signal.butter(4, normalized_cutoff, btype='high')
y = signal.filtfilt(b, a, y)
# Step 4: Normalize volume
print("Normalizing volume...")
max_val = np.max(np.abs(y))
if max_val > 0:
y = y / max_val * 0.95 # Normalize to 95% to avoid clipping
# Step 5: Noise reduction
print("Reducing noise...")
y = nr.reduce_noise(
y=y,
sr=sr,
stationary=False, # Use non-stationary for variable noise
prop_decrease=0.8 # Reduce noise by 80%
)
# Step 6: Final normalization
print("Final normalization...")
max_val = np.max(np.abs(y))
if max_val > 0:
y = y / max_val * 0.95
# Step 7: Save processed audio
print(f"Saving to {output_path}...")
sf.write(output_path, y, sr)
print("Preprocessing complete!")
return y, sr
# Usage
preprocess_audio_for_stt("noisy_recording.wav", "cleaned_for_stt.wav")
去噪最佳实践
1. 选择合适的方法
- 恒定噪声: 使用频谱减法或平稳噪声抑制
- 变化噪声: 使用非平稳抑制或深度学习模型
- 脉冲噪声: 使用点击噪声移除或插值
- 多种噪声: 组合多种技术
2. 保留语音质量
- 不要过度处理(会引入伪影)
- 使用中等强度降噪设置(60-80%)
- 保留人声频率范围(80-8000 Hz)
- 维持自然语音特征
3. 测试并迭代
- 应用到整条音轨前务必先预览
- 对比原始与处理后音频
- 用两个版本分别测试转录准确率
- 根据结果调整参数
4. 考虑你的 STT 模型
- 某些模型(如 Whisper)本身就很抗噪
- 预处理并非总是必要
- 进行有无预处理的对比测试
- 更大的模型通常抗噪能力更强
需要避免的常见错误
❌ 降噪过于激进
- 可能会删除语音频段
- 会产生伪影和失真
- 让语音听起来像机器人
❌ 移除过多低频
- 可能移除重要语音成分
- 会让语音听起来单薄或发尖
- 影响自然度
❌ 不使用你的 STT 模型做测试
- 预处理不一定提升准确率
- 有些模型对原始音频效果更好
- 一定要做 A/B 测试
❌ 忽略音频格式
- 确保合适采样率(建议 16kHz)
- 尽可能使用无损格式
- 避免二次压缩
与语音转文字系统集成
与 OpenAI Whisper 一起使用
import whisper
import noisereduce as nr
import soundfile as sf
def transcribe_with_noise_reduction(audio_path):
"""Transcribe audio with noise reduction preprocessing."""
# Step 1: Reduce noise
audio_data, sr = sf.read(audio_path)
cleaned_audio = nr.reduce_noise(
y=audio_data,
sr=sr,
stationary=False,
prop_decrease=0.75
)
# Save temporary cleaned file
temp_path = "temp_cleaned.wav"
sf.write(temp_path, cleaned_audio, sr)
# Step 2: Transcribe with Whisper
model = whisper.load_model("base")
result = model.transcribe(temp_path)
# Clean up
import os
os.remove(temp_path)
return result["text"]
# Usage
transcription = transcribe_with_noise_reduction("noisy_audio.wav")
print(transcription)
与 SayToWords API 一起使用
import requests
import noisereduce as nr
import soundfile as sf
def transcribe_with_saytowords(audio_path):
"""Preprocess and transcribe with SayToWords."""
# Preprocess audio
audio_data, sr = sf.read(audio_path)
cleaned_audio = nr.reduce_noise(
y=audio_data,
sr=sr,
stationary=False,
prop_decrease=0.8
)
# Save cleaned audio
cleaned_path = "cleaned_for_api.wav"
sf.write(cleaned_path, cleaned_audio, sr)
# Upload and transcribe
with open(cleaned_path, 'rb') as f:
files = {'file': f}
response = requests.post(
'https://api.saytowords.com/transcribe',
files=files,
headers={'Authorization': 'Bearer YOUR_API_KEY'}
)
return response.json()
衡量降噪效果
前后对比
import librosa
import numpy as np
def measure_snr(audio_path):
"""Estimate signal-to-noise ratio."""
y, sr = librosa.load(audio_path, sr=None)
# Simple SNR estimation
signal_power = np.mean(y ** 2)
noise_floor = np.percentile(np.abs(y), 10) ** 2
snr_db = 10 * np.log10(signal_power / noise_floor) if noise_floor > 0 else 0
return snr_db
# Compare before and after
original_snr = measure_snr("noisy_audio.wav")
cleaned_snr = measure_snr("cleaned_audio.wav")
print(f"Original SNR: {original_snr:.2f} dB")
print(f"Cleaned SNR: {cleaned_snr:.2f} dB")
print(f"Improvement: {cleaned_snr - original_snr:.2f} dB")
结论
在进行语音转文字处理之前去除背景噪声,可以显著提升转录准确率。最佳方案取决于:
- 噪声类型(恒定 vs. 变化)
- 音频质量(采样率、位深)
- 可用工具(软件 vs. 编程)
- STT 模型(有些比其他模型更抗噪)
快速建议:
- 若要快速处理: 使用 Audacity 或在线工具
- 若要自动化: 使用
noisereduce等 Python 库 - 若要最佳效果: 组合多种技术
- 若要生产部署: 用你的特定 STT 模型进行测试
请记住:并非所有音频都需要预处理。像 Whisper 这样的现代 STT 模型对噪声已有较强鲁棒性。请始终对比原始与处理后音频,确认哪种更适合你的具体场景。
其他资源
需要针对你的特定音频进行降噪帮助?试试 SayToWords Speech-to-Text,它内置了噪声处理与预处理选项。