πŸŽ‰ We're live! All services are free during our trial periodβ€”pricing plans coming soon.

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Eric King

Eric King

Author


How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Background noise is one of the most common challenges when transcribing audio recordings. Whether it's traffic sounds, keyboard typing, air conditioning, or crowd noise, removing background noise before speech-to-text processing can significantly improve transcription accuracy.
This comprehensive guide covers practical methods for removing background noise for STT, from simple software solutions to advanced audio processing techniques.

Why Remove Background Noise for STT?

Background noise negatively impacts speech-to-text accuracy in several ways:
  • Reduced signal-to-noise ratio (SNR) makes it harder for models to distinguish speech
  • Frequency masking where noise overlaps with speech frequencies
  • Model confusion when noise patterns resemble speech
  • Lower confidence scores leading to more transcription errors
  • Increased processing time as models struggle with noisy input
Benefits of noise removal:
  • βœ… Improved transcription accuracy (often 10-30% better)
  • βœ… Better word recognition, especially for technical terms
  • βœ… Faster processing with cleaner audio
  • βœ… More reliable timestamps and segmentation
  • βœ… Better handling of quiet speech

Understanding Background Noise Types

Different noise types require different removal strategies:

1. Constant Noise (Stationary)

  • Examples: Air conditioning, fan hum, electrical hum, white noise
  • Characteristics: Consistent frequency and amplitude
  • Removal: Easier to remove with spectral subtraction or filtering

2. Variable Noise (Non-Stationary)

  • Examples: Traffic, crowd chatter, keyboard typing, paper rustling
  • Characteristics: Changes over time, unpredictable patterns
  • Removal: Requires more advanced techniques like deep learning models

3. Impulsive Noise

  • Examples: Clicks, pops, door slams, phone rings
  • Characteristics: Short, sudden bursts
  • Removal: Requires detection and replacement/interpolation

4. Periodic Noise

  • Examples: Beeping, alarms, repetitive sounds
  • Characteristics: Regular patterns at specific frequencies
  • Removal: Can be filtered with notch filters

Method 1: Using Audio Editing Software

Audacity (Free, Open Source)

Audacity is a powerful, free audio editor with built-in noise reduction:
Steps:
  1. Open your audio file in Audacity
  2. Select a section with only noise (no speech)
  3. Go to Effect β†’ Noise Reduction
  4. Click Get Noise Profile
  5. Select the entire audio track
  6. Go to Effect β†’ Noise Reduction again
  7. Adjust settings:
    • Noise reduction (dB): 12-24 dB (start with 15)
    • Sensitivity: 6.0 (default)
    • Frequency smoothing (bands): 3 (default)
  8. Click OK to apply
Best practices:
  • Use a noise sample of 0.5-2 seconds
  • Choose a section with representative noise
  • Start with moderate settings and increase if needed
  • Preview before applying to full track

Adobe Audition

Adobe Audition offers professional noise reduction:
  1. Open audio file
  2. Select noise-only section
  3. Go to Effects β†’ Noise Reduction/Restoration β†’ Capture Noise Print
  4. Select entire track
  5. Go to Effects β†’ Noise Reduction/Restoration β†’ Noise Reduction (process)
  6. Adjust:
    • Noise Reduction: 40-80% (start with 60%)
    • Reduce by: 6-12 dB
    • High Frequency Transition: 4000-8000 Hz
  7. Click Apply

Method 2: Python Audio Processing Libraries

Using noisereduce Library

The noisereduce library provides easy-to-use noise reduction:
import noisereduce as nr
import soundfile as sf

# Load audio file
audio_data, sample_rate = sf.read("noisy_audio.wav")

# Method 1: Stationary noise reduction (for constant noise)
reduced_noise = nr.reduce_noise(
    y=audio_data,
    sr=sample_rate,
    stationary=True,
    prop_decrease=0.8  # Reduce noise by 80%
)

# Method 2: Non-stationary noise reduction (for variable noise)
reduced_noise = nr.reduce_noise(
    y=audio_data,
    sr=sample_rate,
    stationary=False,
    prop_decrease=0.8
)

# Save cleaned audio
sf.write("cleaned_audio.wav", reduced_noise, sample_rate)
Installation:
pip install noisereduce soundfile

Using librosa for Spectral Gating

import librosa
import numpy as np
import soundfile as sf

def spectral_gate(audio_path, threshold_db=-40):
    """Remove noise using spectral gating."""
    # Load audio
    y, sr = librosa.load(audio_path, sr=None)
    
    # Compute short-time Fourier transform (STFT)
    stft = librosa.stft(y)
    magnitude = np.abs(stft)
    phase = np.angle(stft)
    
    # Convert to dB
    magnitude_db = librosa.amplitude_to_db(magnitude)
    
    # Apply threshold (remove frequencies below threshold)
    magnitude_db_cleaned = np.where(
        magnitude_db > threshold_db,
        magnitude_db,
        -80  # Silence very quiet parts
    )
    
    # Convert back to linear scale
    magnitude_cleaned = librosa.db_to_amplitude(magnitude_db_cleaned)
    
    # Reconstruct audio
    stft_cleaned = magnitude_cleaned * np.exp(1j * phase)
    y_cleaned = librosa.istft(stft_cleaned)
    
    return y_cleaned, sr

# Usage
cleaned_audio, sample_rate = spectral_gate("noisy_audio.wav", threshold_db=-35)
sf.write("cleaned_audio.wav", cleaned_audio, sample_rate)

Using scipy for High-Pass Filtering

Remove low-frequency noise (like rumble, wind):
from scipy import signal
import soundfile as sf

def high_pass_filter(audio_path, cutoff_freq=80):
    """Remove low-frequency noise with high-pass filter."""
    # Load audio
    audio_data, sample_rate = sf.read(audio_path)
    
    # Design high-pass filter
    nyquist = sample_rate / 2
    normalized_cutoff = cutoff_freq / nyquist
    b, a = signal.butter(4, normalized_cutoff, btype='high')
    
    # Apply filter
    filtered_audio = signal.filtfilt(b, a, audio_data)
    
    return filtered_audio, sample_rate

# Usage
cleaned_audio, sr = high_pass_filter("noisy_audio.wav", cutoff_freq=100)
sf.write("cleaned_audio.wav", cleaned_audio, sr)

Method 3: Deep Learning-Based Noise Reduction

Using RNNoise

RNNoise is a deep learning model specifically designed for noise reduction:
import rnnoise
import numpy as np
import soundfile as sf

def rnnoise_denoise(audio_path):
    """Remove noise using RNNoise model."""
    # Load audio
    audio_data, sample_rate = sf.read(audio_path)
    
    # RNNoise expects 16kHz mono audio
    if sample_rate != 16000:
        import librosa
        audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if len(audio_data.shape) > 1:
        audio_data = np.mean(audio_data, axis=1)
    
    # Process in chunks (RNNoise processes 480 samples at a time)
    chunk_size = 480
    denoised_audio = []
    
    denoiser = rnnoise.RNNoise()
    
    for i in range(0, len(audio_data), chunk_size):
        chunk = audio_data[i:i+chunk_size]
        if len(chunk) < chunk_size:
            chunk = np.pad(chunk, (0, chunk_size - len(chunk)))
        
        denoised_chunk = denoiser.process(chunk)
        denoised_audio.extend(denoised_chunk)
    
    return np.array(denoised_audio), sample_rate

# Usage
cleaned_audio, sr = rnnoise_denoise("noisy_audio.wav")
sf.write("cleaned_audio.wav", cleaned_audio, sr)
Installation:
pip install rnnoise

Using Facebook's Demucs

Demucs can separate speech from background noise:
from demucs.pretrained import get_model
from demucs.audio import AudioFile
import torch

def demucs_separation(audio_path):
    """Separate speech from noise using Demucs."""
    # Load pre-trained model
    model = get_model('htdemucs')
    model.eval()
    
    # Load audio
    wav = AudioFile(audio_path).read(streams=0, samplerate=model.sample_rate, channels=model.audio_channels)
    ref = wav.mean(0)
    wav = (wav - ref.mean()) / ref.std()
    wav = torch.from_numpy(wav).float()
    
    # Separate sources
    with torch.no_grad():
        sources = model(wav[None])
        sources = sources * ref.std() + ref.mean()
    
    # Extract vocals (speech) - usually index 0 or 3
    speech = sources[0, 0].cpu().numpy()
    
    return speech, model.sample_rate

# Usage
speech_audio, sr = demucs_separation("noisy_audio.wav")
sf.write("speech_only.wav", speech_audio, sr)

Method 4: Online Noise Reduction Tools

1. Audacity Online (Cloud Version)

  • Free, browser-based
  • Good for quick processing
  • Limited file size

2. Adobe Podcast Enhance

  • AI-powered noise reduction
  • Free for limited use
  • Excellent results for speech

3. Krisp.ai

  • Real-time noise suppression
  • API available for integration
  • Good for live audio

4. Cleanvoice.ai

  • Automatic noise removal
  • Handles multiple noise types
  • Batch processing available

Complete Workflow: Preprocessing Audio for STT

Here's a complete Python script that combines multiple techniques:
import librosa
import noisereduce as nr
import soundfile as sf
from scipy import signal
import numpy as np

def preprocess_audio_for_stt(audio_path, output_path):
    """Complete audio preprocessing pipeline for STT."""
    
    # Step 1: Load audio
    print("Loading audio...")
    y, sr = librosa.load(audio_path, sr=16000, mono=True)
    
    # Step 2: Remove DC offset
    print("Removing DC offset...")
    y = y - np.mean(y)
    
    # Step 3: High-pass filter (remove low-frequency noise)
    print("Applying high-pass filter...")
    nyquist = sr / 2
    normalized_cutoff = 80 / nyquist
    b, a = signal.butter(4, normalized_cutoff, btype='high')
    y = signal.filtfilt(b, a, y)
    
    # Step 4: Normalize volume
    print("Normalizing volume...")
    max_val = np.max(np.abs(y))
    if max_val > 0:
        y = y / max_val * 0.95  # Normalize to 95% to avoid clipping
    
    # Step 5: Noise reduction
    print("Reducing noise...")
    y = nr.reduce_noise(
        y=y,
        sr=sr,
        stationary=False,  # Use non-stationary for variable noise
        prop_decrease=0.8  # Reduce noise by 80%
    )
    
    # Step 6: Final normalization
    print("Final normalization...")
    max_val = np.max(np.abs(y))
    if max_val > 0:
        y = y / max_val * 0.95
    
    # Step 7: Save processed audio
    print(f"Saving to {output_path}...")
    sf.write(output_path, y, sr)
    
    print("Preprocessing complete!")
    return y, sr

# Usage
preprocess_audio_for_stt("noisy_recording.wav", "cleaned_for_stt.wav")

Best Practices for Noise Removal

1. Choose the Right Method

  • Constant noise: Use spectral subtraction or stationary noise reduction
  • Variable noise: Use non-stationary reduction or deep learning models
  • Impulsive noise: Use click removal or interpolation
  • Multiple noise types: Combine multiple techniques

2. Preserve Speech Quality

  • Don't over-process (can introduce artifacts)
  • Use moderate noise reduction settings (60-80%)
  • Preserve frequency range of human speech (80-8000 Hz)
  • Maintain natural speech characteristics

3. Test and Iterate

  • Always preview before applying to full track
  • Compare original vs. processed audio
  • Test transcription accuracy with both versions
  • Adjust settings based on results

4. Consider Your STT Model

  • Some models (like Whisper) handle noise well
  • Preprocessing may not always be necessary
  • Test with and without preprocessing
  • Larger models are more noise-robust

Common Mistakes to Avoid

❌ Over-aggressive noise reduction
  • Can remove speech frequencies
  • Creates artifacts and distortion
  • Makes speech sound robotic
❌ Removing too much low frequency
  • Can remove important speech components
  • Makes speech sound thin or tinny
  • Affects naturalness
❌ Not testing with your STT model
  • Preprocessing may not improve accuracy
  • Some models work better with original audio
  • Always A/B test
❌ Ignoring audio format
  • Ensure proper sample rate (16kHz recommended)
  • Use lossless formats when possible
  • Avoid double compression

Integration with Speech-to-Text

Using with OpenAI Whisper

import whisper
import noisereduce as nr
import soundfile as sf

def transcribe_with_noise_reduction(audio_path):
    """Transcribe audio with noise reduction preprocessing."""
    
    # Step 1: Reduce noise
    audio_data, sr = sf.read(audio_path)
    cleaned_audio = nr.reduce_noise(
        y=audio_data,
        sr=sr,
        stationary=False,
        prop_decrease=0.75
    )
    
    # Save temporary cleaned file
    temp_path = "temp_cleaned.wav"
    sf.write(temp_path, cleaned_audio, sr)
    
    # Step 2: Transcribe with Whisper
    model = whisper.load_model("base")
    result = model.transcribe(temp_path)
    
    # Clean up
    import os
    os.remove(temp_path)
    
    return result["text"]

# Usage
transcription = transcribe_with_noise_reduction("noisy_audio.wav")
print(transcription)

Using with SayToWords API

import requests
import noisereduce as nr
import soundfile as sf

def transcribe_with_saytowords(audio_path):
    """Preprocess and transcribe with SayToWords."""
    
    # Preprocess audio
    audio_data, sr = sf.read(audio_path)
    cleaned_audio = nr.reduce_noise(
        y=audio_data,
        sr=sr,
        stationary=False,
        prop_decrease=0.8
    )
    
    # Save cleaned audio
    cleaned_path = "cleaned_for_api.wav"
    sf.write(cleaned_path, cleaned_audio, sr)
    
    # Upload and transcribe
    with open(cleaned_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(
            'https://api.saytowords.com/transcribe',
            files=files,
            headers={'Authorization': 'Bearer YOUR_API_KEY'}
        )
    
    return response.json()

Measuring Noise Reduction Effectiveness

Before/After Comparison

import librosa
import numpy as np

def measure_snr(audio_path):
    """Estimate signal-to-noise ratio."""
    y, sr = librosa.load(audio_path, sr=None)
    
    # Simple SNR estimation
    signal_power = np.mean(y ** 2)
    noise_floor = np.percentile(np.abs(y), 10) ** 2
    snr_db = 10 * np.log10(signal_power / noise_floor) if noise_floor > 0 else 0
    
    return snr_db

# Compare before and after
original_snr = measure_snr("noisy_audio.wav")
cleaned_snr = measure_snr("cleaned_audio.wav")

print(f"Original SNR: {original_snr:.2f} dB")
print(f"Cleaned SNR: {cleaned_snr:.2f} dB")
print(f"Improvement: {cleaned_snr - original_snr:.2f} dB")

Conclusion

Removing background noise before speech-to-text processing can significantly improve transcription accuracy. The best approach depends on:
  • Noise type (constant vs. variable)
  • Audio quality (sample rate, bit depth)
  • Available tools (software vs. programming)
  • STT model (some handle noise better than others)
Quick recommendations:
  • For quick processing: Use Audacity or online tools
  • For automation: Use Python libraries like noisereduce
  • For best results: Combine multiple techniques
  • For production: Test with your specific STT model
Remember: Not all audio needs preprocessing. Some modern STT models like Whisper are quite robust to noise. Always test both original and processed audio to see which gives better results for your specific use case.

Additional Resources


Need help with noise reduction for your specific audio? Try SayToWords Speech-to-Text which includes built-in noise handling and preprocessing options.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast productionβ€”start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineSpeech to Text for MeetingsFast Speech to TextTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website