Understanding Speech-to-Text Quality: WER and CER Explained

2025-12-05Document Technology

Eric King

Author

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), has become a core capability in modern AI applications—powering voice assistants, call-center analytics, smart devices, automated captioning, and more.

As adoption grows across industries, one question often arises:

How do we measure the quality of Speech-to-Text output?

Two metrics dominate the field:

WER (Word Error Rate)
CER (Character Error Rate)

Despite their simplicity, these metrics directly influence how we evaluate models, compare engines, and monitor production performance. This article breaks down what they mean, when to use each, and how to interpret them in real-world scenarios.

What Is WER (Word Error Rate)?

WER is the most widely used metric for evaluating speech recognition in languages with clear word boundaries such as English, Spanish, German, or French.

It measures how many mistakes appear in the transcribed text compared to a reference transcript.

Formula

WER = (S + D + I) / N

Where:

S — Substitutions (a word is replaced with an incorrect one)
D — Deletions (a word from the reference is missing in the hypothesis)
I — Insertions (an extra word is added in the hypothesis that isn't in the reference)
N — Total number of words in the reference text

WER Thresholds for Interpretation

0% → perfect transcription
10–20% → acceptable for many industrial tasks
20–40% → typical for noisy environments or accented speech
40%+ → poor recognition quality

Example

Reference: "The quick brown fox jumps over the lazy dog"
Hypothesis: "The quick brown fox jump over lazy dog"

Errors:

Substitution ("jumps" → "jump")
Deletion ("the")
0 Insertions

Calculation:

WER = (1 + 1 + 0) / 9 = 22.2%

What Is CER (Character Error Rate)?

CER evaluates transcription accuracy at the character level rather than the word level.

This metric is especially important for:

Chinese, Japanese, Korean (languages without natural word spacing)
OCR (image text recognition)
Models requiring extremely fine-grained evaluation

Formula

CER = (S + D + I) / N_characters

Where the components (S, D, I) refer to character-level substitutions, deletions, and insertions, and N_characters is the total number of characters in the reference text.

Because it measures each character individually, CER can highlight errors that WER may hide—particularly in languages where a missing character changes the meaning completely.

WER vs CER: When to Choose Which?

Scenario	Recommended Metric	Why
English, Spanish, French, etc.	WER	Words are natural semantic units
Chinese / Japanese / Korean	CER	No spaces; characters carry core meaning
OCR text recognition	CER	Requires detailed character-level accuracy
Mixed-language content	Both	Provides complementary semantic and granular insights
Noisy, multi-speaker datasets	WER	Better reflects semantic errors that impact usability

Why Evaluation Matters in Speech-to-Text

Modern STT systems—such as Whisper, Deepgram, Google ASR, or custom fine-tuned models—are increasingly accurate. But without consistent evaluation metrics, it becomes impossible to answer critical questions like:

Which model performs best on my domain-specific data?
Does transcription accuracy degrade over time in production?
Did a new model update improve (or harm) transcription quality?
How significant is the impact of background noise or accent variation?

WER and CER give teams an objective way to measure improvements and track production quality at scale.

Practical Tips for Using WER / CER

1. Always normalize text

Before calculating metrics, apply these preprocessing steps to avoid inflating error rates with trivial differences:

Case folding (convert all text to lowercase/uppercase)
Punctuation removal
Unicode normalization (standardize special characters)
Consistent tokenization (align word/character boundaries)

2. Use segment-level evaluation

Instead of comparing entire paragraphs, evaluate accuracy by smaller units:

Sentences
Time-aligned audio segments
Speaker turns

This approach pinpoints exactly where errors occur (e.g., noisy audio clips, fast speech) for targeted model optimization.

3. Don't obsess over absolute numbers

A small numerical difference in WER/CER does not always translate to real-world usability. For example:

Model A: 7.1% WER
Model B: 6.5% WER

The 0.6% gap is negligible—always listen to sample outputs and assess semantic meaning before choosing a model. WER/CER are approximations, not full measures of meaning preservation.

The Future of Speech-to-Text Metrics

As LLM-driven STT systems become more capable, traditional WER/CER will remain foundational, but new evaluation models are emerging to address their limitations:

Semantic Error Rate (SER): Focuses on meaning rather than surface-level text (e.g., whether "the cat chased the mouse" and "the mouse was chased by the cat" are deemed equivalent)
Entity Error Rate: Measures accuracy of high-value terms (names, phone numbers, product SKUs, keywords)
Task Success Rate: Evaluates how well transcriptions support downstream workflows (e.g., call-center ticket routing, caption accessibility)

However, WER and CER will continue to be the industry-standard metrics for benchmarking audio transcription and comparing STT engines due to their simplicity and universality.

Conclusion

WER and CER are simple but powerful tools for evaluating Speech-to-Text systems. Whether you're building your own ASR engine, integrating a commercial API, or monitoring production transcriptions, these metrics provide a clear, objective way to measure accuracy and track improvements over time.

Understanding WER and CER is essential for anyone working with audio data, natural language processing, or AI-driven automation—they are the backbone of reliable STT system validation and optimization.

Understanding Speech-to-Text Quality: WER and CER Explained

What Is WER (Word Error Rate)?

Formula

WER Thresholds for Interpretation

Example

What Is CER (Character Error Rate)?

Formula

WER vs CER: When to Choose Which?

Why Evaluation Matters in Speech-to-Text

Practical Tips for Using WER / CER

1. Always normalize text

2. Use segment-level evaluation

3. Don't obsess over absolute numbers

The Future of Speech-to-Text Metrics

Conclusion

Related Posts

What Is Voice Typing and How Does It Work?

Low Latency Speech Recognition: Real-Time Speech to Text with SayToWords

Speech to Text for Beginners: A Complete Guide to Get Started

Try It Free Now