
How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text
Eric King
Author
How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text
Background noise is one of the most common challenges when transcribing audio recordings. Whether it's traffic sounds, keyboard typing, air conditioning, or crowd noise, removing background noise before speech-to-text processing can significantly improve transcription accuracy.
This comprehensive guide covers practical methods for removing background noise for STT, from simple software solutions to advanced audio processing techniques.
Why Remove Background Noise for STT?
Background noise negatively impacts speech-to-text accuracy in several ways:
- Reduced signal-to-noise ratio (SNR) makes it harder for models to distinguish speech
- Frequency masking where noise overlaps with speech frequencies
- Model confusion when noise patterns resemble speech
- Lower confidence scores leading to more transcription errors
- Increased processing time as models struggle with noisy input
Benefits of noise removal:
- β Improved transcription accuracy (often 10-30% better)
- β Better word recognition, especially for technical terms
- β Faster processing with cleaner audio
- β More reliable timestamps and segmentation
- β Better handling of quiet speech
Understanding Background Noise Types
Different noise types require different removal strategies:
1. Constant Noise (Stationary)
- Examples: Air conditioning, fan hum, electrical hum, white noise
- Characteristics: Consistent frequency and amplitude
- Removal: Easier to remove with spectral subtraction or filtering
2. Variable Noise (Non-Stationary)
- Examples: Traffic, crowd chatter, keyboard typing, paper rustling
- Characteristics: Changes over time, unpredictable patterns
- Removal: Requires more advanced techniques like deep learning models
3. Impulsive Noise
- Examples: Clicks, pops, door slams, phone rings
- Characteristics: Short, sudden bursts
- Removal: Requires detection and replacement/interpolation
4. Periodic Noise
- Examples: Beeping, alarms, repetitive sounds
- Characteristics: Regular patterns at specific frequencies
- Removal: Can be filtered with notch filters
Method 1: Using Audio Editing Software
Audacity (Free, Open Source)
Audacity is a powerful, free audio editor with built-in noise reduction:
Steps:
- Open your audio file in Audacity
- Select a section with only noise (no speech)
- Go to Effect β Noise Reduction
- Click Get Noise Profile
- Select the entire audio track
- Go to Effect β Noise Reduction again
- Adjust settings:
- Noise reduction (dB): 12-24 dB (start with 15)
- Sensitivity: 6.0 (default)
- Frequency smoothing (bands): 3 (default)
- Click OK to apply
Best practices:
- Use a noise sample of 0.5-2 seconds
- Choose a section with representative noise
- Start with moderate settings and increase if needed
- Preview before applying to full track
Adobe Audition
Adobe Audition offers professional noise reduction:
- Open audio file
- Select noise-only section
- Go to Effects β Noise Reduction/Restoration β Capture Noise Print
- Select entire track
- Go to Effects β Noise Reduction/Restoration β Noise Reduction (process)
- Adjust:
- Noise Reduction: 40-80% (start with 60%)
- Reduce by: 6-12 dB
- High Frequency Transition: 4000-8000 Hz
- Click Apply
Method 2: Python Audio Processing Libraries
Using noisereduce Library
The
noisereduce library provides easy-to-use noise reduction:import noisereduce as nr
import soundfile as sf
# Load audio file
audio_data, sample_rate = sf.read("noisy_audio.wav")
# Method 1: Stationary noise reduction (for constant noise)
reduced_noise = nr.reduce_noise(
y=audio_data,
sr=sample_rate,
stationary=True,
prop_decrease=0.8 # Reduce noise by 80%
)
# Method 2: Non-stationary noise reduction (for variable noise)
reduced_noise = nr.reduce_noise(
y=audio_data,
sr=sample_rate,
stationary=False,
prop_decrease=0.8
)
# Save cleaned audio
sf.write("cleaned_audio.wav", reduced_noise, sample_rate)
Installation:
pip install noisereduce soundfile
Using librosa for Spectral Gating
import librosa
import numpy as np
import soundfile as sf
def spectral_gate(audio_path, threshold_db=-40):
"""Remove noise using spectral gating."""
# Load audio
y, sr = librosa.load(audio_path, sr=None)
# Compute short-time Fourier transform (STFT)
stft = librosa.stft(y)
magnitude = np.abs(stft)
phase = np.angle(stft)
# Convert to dB
magnitude_db = librosa.amplitude_to_db(magnitude)
# Apply threshold (remove frequencies below threshold)
magnitude_db_cleaned = np.where(
magnitude_db > threshold_db,
magnitude_db,
-80 # Silence very quiet parts
)
# Convert back to linear scale
magnitude_cleaned = librosa.db_to_amplitude(magnitude_db_cleaned)
# Reconstruct audio
stft_cleaned = magnitude_cleaned * np.exp(1j * phase)
y_cleaned = librosa.istft(stft_cleaned)
return y_cleaned, sr
# Usage
cleaned_audio, sample_rate = spectral_gate("noisy_audio.wav", threshold_db=-35)
sf.write("cleaned_audio.wav", cleaned_audio, sample_rate)
Using scipy for High-Pass Filtering
Remove low-frequency noise (like rumble, wind):
from scipy import signal
import soundfile as sf
def high_pass_filter(audio_path, cutoff_freq=80):
"""Remove low-frequency noise with high-pass filter."""
# Load audio
audio_data, sample_rate = sf.read(audio_path)
# Design high-pass filter
nyquist = sample_rate / 2
normalized_cutoff = cutoff_freq / nyquist
b, a = signal.butter(4, normalized_cutoff, btype='high')
# Apply filter
filtered_audio = signal.filtfilt(b, a, audio_data)
return filtered_audio, sample_rate
# Usage
cleaned_audio, sr = high_pass_filter("noisy_audio.wav", cutoff_freq=100)
sf.write("cleaned_audio.wav", cleaned_audio, sr)
Method 3: Deep Learning-Based Noise Reduction
Using RNNoise
RNNoise is a deep learning model specifically designed for noise reduction:
import rnnoise
import numpy as np
import soundfile as sf
def rnnoise_denoise(audio_path):
"""Remove noise using RNNoise model."""
# Load audio
audio_data, sample_rate = sf.read(audio_path)
# RNNoise expects 16kHz mono audio
if sample_rate != 16000:
import librosa
audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
sample_rate = 16000
# Convert to mono if stereo
if len(audio_data.shape) > 1:
audio_data = np.mean(audio_data, axis=1)
# Process in chunks (RNNoise processes 480 samples at a time)
chunk_size = 480
denoised_audio = []
denoiser = rnnoise.RNNoise()
for i in range(0, len(audio_data), chunk_size):
chunk = audio_data[i:i+chunk_size]
if len(chunk) < chunk_size:
chunk = np.pad(chunk, (0, chunk_size - len(chunk)))
denoised_chunk = denoiser.process(chunk)
denoised_audio.extend(denoised_chunk)
return np.array(denoised_audio), sample_rate
# Usage
cleaned_audio, sr = rnnoise_denoise("noisy_audio.wav")
sf.write("cleaned_audio.wav", cleaned_audio, sr)
Installation:
pip install rnnoise
Using Facebook's Demucs
Demucs can separate speech from background noise:
from demucs.pretrained import get_model
from demucs.audio import AudioFile
import torch
def demucs_separation(audio_path):
"""Separate speech from noise using Demucs."""
# Load pre-trained model
model = get_model('htdemucs')
model.eval()
# Load audio
wav = AudioFile(audio_path).read(streams=0, samplerate=model.sample_rate, channels=model.audio_channels)
ref = wav.mean(0)
wav = (wav - ref.mean()) / ref.std()
wav = torch.from_numpy(wav).float()
# Separate sources
with torch.no_grad():
sources = model(wav[None])
sources = sources * ref.std() + ref.mean()
# Extract vocals (speech) - usually index 0 or 3
speech = sources[0, 0].cpu().numpy()
return speech, model.sample_rate
# Usage
speech_audio, sr = demucs_separation("noisy_audio.wav")
sf.write("speech_only.wav", speech_audio, sr)
Method 4: Online Noise Reduction Tools
1. Audacity Online (Cloud Version)
- Free, browser-based
- Good for quick processing
- Limited file size
2. Adobe Podcast Enhance
- AI-powered noise reduction
- Free for limited use
- Excellent results for speech
3. Krisp.ai
- Real-time noise suppression
- API available for integration
- Good for live audio
4. Cleanvoice.ai
- Automatic noise removal
- Handles multiple noise types
- Batch processing available
Complete Workflow: Preprocessing Audio for STT
Here's a complete Python script that combines multiple techniques:
import librosa
import noisereduce as nr
import soundfile as sf
from scipy import signal
import numpy as np
def preprocess_audio_for_stt(audio_path, output_path):
"""Complete audio preprocessing pipeline for STT."""
# Step 1: Load audio
print("Loading audio...")
y, sr = librosa.load(audio_path, sr=16000, mono=True)
# Step 2: Remove DC offset
print("Removing DC offset...")
y = y - np.mean(y)
# Step 3: High-pass filter (remove low-frequency noise)
print("Applying high-pass filter...")
nyquist = sr / 2
normalized_cutoff = 80 / nyquist
b, a = signal.butter(4, normalized_cutoff, btype='high')
y = signal.filtfilt(b, a, y)
# Step 4: Normalize volume
print("Normalizing volume...")
max_val = np.max(np.abs(y))
if max_val > 0:
y = y / max_val * 0.95 # Normalize to 95% to avoid clipping
# Step 5: Noise reduction
print("Reducing noise...")
y = nr.reduce_noise(
y=y,
sr=sr,
stationary=False, # Use non-stationary for variable noise
prop_decrease=0.8 # Reduce noise by 80%
)
# Step 6: Final normalization
print("Final normalization...")
max_val = np.max(np.abs(y))
if max_val > 0:
y = y / max_val * 0.95
# Step 7: Save processed audio
print(f"Saving to {output_path}...")
sf.write(output_path, y, sr)
print("Preprocessing complete!")
return y, sr
# Usage
preprocess_audio_for_stt("noisy_recording.wav", "cleaned_for_stt.wav")
Best Practices for Noise Removal
1. Choose the Right Method
- Constant noise: Use spectral subtraction or stationary noise reduction
- Variable noise: Use non-stationary reduction or deep learning models
- Impulsive noise: Use click removal or interpolation
- Multiple noise types: Combine multiple techniques
2. Preserve Speech Quality
- Don't over-process (can introduce artifacts)
- Use moderate noise reduction settings (60-80%)
- Preserve frequency range of human speech (80-8000 Hz)
- Maintain natural speech characteristics
3. Test and Iterate
- Always preview before applying to full track
- Compare original vs. processed audio
- Test transcription accuracy with both versions
- Adjust settings based on results
4. Consider Your STT Model
- Some models (like Whisper) handle noise well
- Preprocessing may not always be necessary
- Test with and without preprocessing
- Larger models are more noise-robust
Common Mistakes to Avoid
β Over-aggressive noise reduction
- Can remove speech frequencies
- Creates artifacts and distortion
- Makes speech sound robotic
β Removing too much low frequency
- Can remove important speech components
- Makes speech sound thin or tinny
- Affects naturalness
β Not testing with your STT model
- Preprocessing may not improve accuracy
- Some models work better with original audio
- Always A/B test
β Ignoring audio format
- Ensure proper sample rate (16kHz recommended)
- Use lossless formats when possible
- Avoid double compression
Integration with Speech-to-Text
Using with OpenAI Whisper
import whisper
import noisereduce as nr
import soundfile as sf
def transcribe_with_noise_reduction(audio_path):
"""Transcribe audio with noise reduction preprocessing."""
# Step 1: Reduce noise
audio_data, sr = sf.read(audio_path)
cleaned_audio = nr.reduce_noise(
y=audio_data,
sr=sr,
stationary=False,
prop_decrease=0.75
)
# Save temporary cleaned file
temp_path = "temp_cleaned.wav"
sf.write(temp_path, cleaned_audio, sr)
# Step 2: Transcribe with Whisper
model = whisper.load_model("base")
result = model.transcribe(temp_path)
# Clean up
import os
os.remove(temp_path)
return result["text"]
# Usage
transcription = transcribe_with_noise_reduction("noisy_audio.wav")
print(transcription)
Using with SayToWords API
import requests
import noisereduce as nr
import soundfile as sf
def transcribe_with_saytowords(audio_path):
"""Preprocess and transcribe with SayToWords."""
# Preprocess audio
audio_data, sr = sf.read(audio_path)
cleaned_audio = nr.reduce_noise(
y=audio_data,
sr=sr,
stationary=False,
prop_decrease=0.8
)
# Save cleaned audio
cleaned_path = "cleaned_for_api.wav"
sf.write(cleaned_path, cleaned_audio, sr)
# Upload and transcribe
with open(cleaned_path, 'rb') as f:
files = {'file': f}
response = requests.post(
'https://api.saytowords.com/transcribe',
files=files,
headers={'Authorization': 'Bearer YOUR_API_KEY'}
)
return response.json()
Measuring Noise Reduction Effectiveness
Before/After Comparison
import librosa
import numpy as np
def measure_snr(audio_path):
"""Estimate signal-to-noise ratio."""
y, sr = librosa.load(audio_path, sr=None)
# Simple SNR estimation
signal_power = np.mean(y ** 2)
noise_floor = np.percentile(np.abs(y), 10) ** 2
snr_db = 10 * np.log10(signal_power / noise_floor) if noise_floor > 0 else 0
return snr_db
# Compare before and after
original_snr = measure_snr("noisy_audio.wav")
cleaned_snr = measure_snr("cleaned_audio.wav")
print(f"Original SNR: {original_snr:.2f} dB")
print(f"Cleaned SNR: {cleaned_snr:.2f} dB")
print(f"Improvement: {cleaned_snr - original_snr:.2f} dB")
Conclusion
Removing background noise before speech-to-text processing can significantly improve transcription accuracy. The best approach depends on:
- Noise type (constant vs. variable)
- Audio quality (sample rate, bit depth)
- Available tools (software vs. programming)
- STT model (some handle noise better than others)
Quick recommendations:
- For quick processing: Use Audacity or online tools
- For automation: Use Python libraries like
noisereduce - For best results: Combine multiple techniques
- For production: Test with your specific STT model
Remember: Not all audio needs preprocessing. Some modern STT models like Whisper are quite robust to noise. Always test both original and processed audio to see which gives better results for your specific use case.
Additional Resources
Need help with noise reduction for your specific audio? Try SayToWords Speech-to-Text which includes built-in noise handling and preprocessing options.