🎉 We're live! All services are free during our trial period—pricing plans coming soon.

Understanding Speech-to-Text Quality: WER and CER Explained

Understanding Speech-to-Text Quality: WER and CER Explained

Eric King

Eric King

Author


Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), has become a core capability in modern AI applications—powering voice assistants, call-center analytics, smart devices, automated captioning, and more.
As adoption grows across industries, one question often arises:
How do we measure the quality of Speech-to-Text output?
Two metrics dominate the field:
  • WER (Word Error Rate)
  • CER (Character Error Rate)
Despite their simplicity, these metrics directly influence how we evaluate models, compare engines, and monitor production performance. This article breaks down what they mean, when to use each, and how to interpret them in real-world scenarios.

What Is WER (Word Error Rate)?

WER is the most widely used metric for evaluating speech recognition in languages with clear word boundaries such as English, Spanish, German, or French.
It measures how many mistakes appear in the transcribed text compared to a reference transcript.

Formula

WER = (S + D + I) / N
Where:
  • S — Substitutions (a word is replaced with an incorrect one)
  • D — Deletions (a word from the reference is missing in the hypothesis)
  • I — Insertions (an extra word is added in the hypothesis that isn't in the reference)
  • N — Total number of words in the reference text

WER Thresholds for Interpretation

  • 0% → perfect transcription
  • 10–20% → acceptable for many industrial tasks
  • 20–40% → typical for noisy environments or accented speech
  • 40%+ → poor recognition quality

Example

Reference: "The quick brown fox jumps over the lazy dog"
Hypothesis: "The quick brown fox jump over lazy dog"
Errors:
  • Substitution ("jumps" → "jump")
  • Deletion ("the")
  • 0 Insertions
Calculation:
WER = (1 + 1 + 0) / 9 = 22.2%

What Is CER (Character Error Rate)?

CER evaluates transcription accuracy at the character level rather than the word level.
This metric is especially important for:
  • Chinese, Japanese, Korean (languages without natural word spacing)
  • OCR (image text recognition)
  • Models requiring extremely fine-grained evaluation

Formula

CER = (S + D + I) / N_characters
Where the components (S, D, I) refer to character-level substitutions, deletions, and insertions, and N_characters is the total number of characters in the reference text.
Because it measures each character individually, CER can highlight errors that WER may hide—particularly in languages where a missing character changes the meaning completely.

WER vs CER: When to Choose Which?

ScenarioRecommended MetricWhy
English, Spanish, French, etc.WERWords are natural semantic units
Chinese / Japanese / KoreanCERNo spaces; characters carry core meaning
OCR text recognitionCERRequires detailed character-level accuracy
Mixed-language contentBothProvides complementary semantic and granular insights
Noisy, multi-speaker datasetsWERBetter reflects semantic errors that impact usability

Why Evaluation Matters in Speech-to-Text

Modern STT systems—such as Whisper, Deepgram, Google ASR, or custom fine-tuned models—are increasingly accurate. But without consistent evaluation metrics, it becomes impossible to answer critical questions like:
  • Which model performs best on my domain-specific data?
  • Does transcription accuracy degrade over time in production?
  • Did a new model update improve (or harm) transcription quality?
  • How significant is the impact of background noise or accent variation?
WER and CER give teams an objective way to measure improvements and track production quality at scale.

Practical Tips for Using WER / CER

1. Always normalize text

Before calculating metrics, apply these preprocessing steps to avoid inflating error rates with trivial differences:
  • Case folding (convert all text to lowercase/uppercase)
  • Punctuation removal
  • Unicode normalization (standardize special characters)
  • Consistent tokenization (align word/character boundaries)

2. Use segment-level evaluation

Instead of comparing entire paragraphs, evaluate accuracy by smaller units:
  • Sentences
  • Time-aligned audio segments
  • Speaker turns
This approach pinpoints exactly where errors occur (e.g., noisy audio clips, fast speech) for targeted model optimization.

3. Don't obsess over absolute numbers

A small numerical difference in WER/CER does not always translate to real-world usability. For example:
  • Model A: 7.1% WER
  • Model B: 6.5% WER
The 0.6% gap is negligible—always listen to sample outputs and assess semantic meaning before choosing a model. WER/CER are approximations, not full measures of meaning preservation.

The Future of Speech-to-Text Metrics

As LLM-driven STT systems become more capable, traditional WER/CER will remain foundational, but new evaluation models are emerging to address their limitations:
  • Semantic Error Rate (SER): Focuses on meaning rather than surface-level text (e.g., whether "the cat chased the mouse" and "the mouse was chased by the cat" are deemed equivalent)
  • Entity Error Rate: Measures accuracy of high-value terms (names, phone numbers, product SKUs, keywords)
  • Task Success Rate: Evaluates how well transcriptions support downstream workflows (e.g., call-center ticket routing, caption accessibility)
However, WER and CER will continue to be the industry-standard metrics for benchmarking audio transcription and comparing STT engines due to their simplicity and universality.

Conclusion

WER and CER are simple but powerful tools for evaluating Speech-to-Text systems. Whether you're building your own ASR engine, integrating a commercial API, or monitoring production transcriptions, these metrics provide a clear, objective way to measure accuracy and track improvements over time.
Understanding WER and CER is essential for anyone working with audio data, natural language processing, or AI-driven automation—they are the backbone of reliable STT system validation and optimization.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!