
Understanding Speech-to-Text Quality: WER and CER Explained
Eric King
Author
Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), has become a core capability in modern AI applicationsâpowering voice assistants, call-center analytics, smart devices, automated captioning, and more.
As adoption grows across industries, one question often arises:
How do we measure the quality of Speech-to-Text output?
Two metrics dominate the field:
- WER (Word Error Rate)
- CER (Character Error Rate)
Despite their simplicity, these metrics directly influence how we evaluate models, compare engines, and monitor production performance. This article breaks down what they mean, when to use each, and how to interpret them in real-world scenarios.
What Is WER (Word Error Rate)?
WER is the most widely used metric for evaluating speech recognition in languages with clear word boundaries such as English, Spanish, German, or French.
It measures how many mistakes appear in the transcribed text compared to a reference transcript.
Formula
WER = (S + D + I) / N
Where:
- S â Substitutions (a word is replaced with an incorrect one)
- D â Deletions (a word from the reference is missing in the hypothesis)
- I â Insertions (an extra word is added in the hypothesis that isn't in the reference)
- N â Total number of words in the reference text
WER Thresholds for Interpretation
- 0% â perfect transcription
- 10â20% â acceptable for many industrial tasks
- 20â40% â typical for noisy environments or accented speech
- 40%+ â poor recognition quality
Example
Reference: "The quick brown fox jumps over the lazy dog"
Hypothesis: "The quick brown fox jump over lazy dog"
Hypothesis: "The quick brown fox jump over lazy dog"
Errors:
- Substitution ("jumps" â "jump")
- Deletion ("the")
- 0 Insertions
Calculation:
WER = (1 + 1 + 0) / 9 = 22.2%
What Is CER (Character Error Rate)?
CER evaluates transcription accuracy at the character level rather than the word level.
This metric is especially important for:
- Chinese, Japanese, Korean (languages without natural word spacing)
- OCR (image text recognition)
- Models requiring extremely fine-grained evaluation
Formula
CER = (S + D + I) / N_characters
Where the components (S, D, I) refer to character-level substitutions, deletions, and insertions, and N_characters is the total number of characters in the reference text.
Because it measures each character individually, CER can highlight errors that WER may hideâparticularly in languages where a missing character changes the meaning completely.
WER vs CER: When to Choose Which?
| Scenario | Recommended Metric | Why |
|---|---|---|
| English, Spanish, French, etc. | WER | Words are natural semantic units |
| Chinese / Japanese / Korean | CER | No spaces; characters carry core meaning |
| OCR text recognition | CER | Requires detailed character-level accuracy |
| Mixed-language content | Both | Provides complementary semantic and granular insights |
| Noisy, multi-speaker datasets | WER | Better reflects semantic errors that impact usability |
Why Evaluation Matters in Speech-to-Text
Modern STT systemsâsuch as Whisper, Deepgram, Google ASR, or custom fine-tuned modelsâare increasingly accurate. But without consistent evaluation metrics, it becomes impossible to answer critical questions like:
- Which model performs best on my domain-specific data?
- Does transcription accuracy degrade over time in production?
- Did a new model update improve (or harm) transcription quality?
- How significant is the impact of background noise or accent variation?
WER and CER give teams an objective way to measure improvements and track production quality at scale.
Practical Tips for Using WER / CER
1. Always normalize text
Before calculating metrics, apply these preprocessing steps to avoid inflating error rates with trivial differences:
- Case folding (convert all text to lowercase/uppercase)
- Punctuation removal
- Unicode normalization (standardize special characters)
- Consistent tokenization (align word/character boundaries)
2. Use segment-level evaluation
Instead of comparing entire paragraphs, evaluate accuracy by smaller units:
- Sentences
- Time-aligned audio segments
- Speaker turns
This approach pinpoints exactly where errors occur (e.g., noisy audio clips, fast speech) for targeted model optimization.
3. Don't obsess over absolute numbers
A small numerical difference in WER/CER does not always translate to real-world usability. For example:
- Model A: 7.1% WER
- Model B: 6.5% WER
The 0.6% gap is negligibleâalways listen to sample outputs and assess semantic meaning before choosing a model. WER/CER are approximations, not full measures of meaning preservation.
The Future of Speech-to-Text Metrics
As LLM-driven STT systems become more capable, traditional WER/CER will remain foundational, but new evaluation models are emerging to address their limitations:
- Semantic Error Rate (SER): Focuses on meaning rather than surface-level text (e.g., whether "the cat chased the mouse" and "the mouse was chased by the cat" are deemed equivalent)
- Entity Error Rate: Measures accuracy of high-value terms (names, phone numbers, product SKUs, keywords)
- Task Success Rate: Evaluates how well transcriptions support downstream workflows (e.g., call-center ticket routing, caption accessibility)
However, WER and CER will continue to be the industry-standard metrics for benchmarking audio transcription and comparing STT engines due to their simplicity and universality.
Conclusion
WER and CER are simple but powerful tools for evaluating Speech-to-Text systems. Whether you're building your own ASR engine, integrating a commercial API, or monitoring production transcriptions, these metrics provide a clear, objective way to measure accuracy and track improvements over time.
Understanding WER and CER is essential for anyone working with audio data, natural language processing, or AI-driven automationâthey are the backbone of reliable STT system validation and optimization.


