
Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)
2026-03-30Test
Eric King
Author
This post records one fixed-configuration run on English YouTube audio with Whisper medium. From
result.json, the strict score is WER = 68.23% and accuracy = 31.77%, with a strongly deletion-heavy profile (D = 8,718, S = 131, I = 0). In plain terms, this looks less like isolated word confusions and more like a coverage mismatch between the reference captions and the generated transcript, so the output should be interpreted as a reproducible baseline rather than a standalone quality claim.Video and reference text. The source is this YouTube video. The reference file (
ref.vtt) comes from the caption track provided with that video, and model.vtt is the output from this Whisper run. That means the benchmark measures agreement with platform captions (useful in production workflows), not with a manually curated linguistic gold transcript.1. Why This Benchmark Matters
Long-form YouTube audio is a practical stress case for ASR because it mixes natural pacing shifts, edits, names, and topic changes in a way short demos do not. If your downstream workflow is subtitle QA, search indexing, content repurposing, or draft summarization, this scenario reflects real operational constraints better than clean lab speech.
Using the platform caption track as the reference creates a realistic “what users already see vs what our ASR pipeline outputs” comparison. It is not perfect ground truth, but it is highly relevant for product teams who need consistency checks and repeatable tracking over time.
2. Testing Setup
Values below come directly from
other.yaml and result.json in this case folder.| Field | Value |
|---|---|
| Source | YouTube video |
| Date (processing window) | 2026-03-30 (processtime-at → completed-at) |
| Language | English |
| Whisper model | medium |
| Audio duration (YAML label) | 22:44 |
| Audio duration (scorer / YAML parsed) | 1364 s (≈ 22.73 minutes) |
| STT processing time | 365 s |
| RTF | 0.2676 |
Wall-clock timestamps: 2026-03-30 19:49:57 → 2026-03-30 19:56:02, consistent with 365 seconds of processing.
3. Evaluation Methodology
The evaluation is produced by:
scripts/evaluate-vtt-metrics.js
The script reads
ref.vtt and model.vtt, extracts plain cue text, normalizes tokens, then aligns reference and hypothesis with Levenshtein dynamic programming.Word-level alignment
At word level, backtracking over the DP matrix yields substitutions (S), deletions (D), and insertions (I) against reference size N.
[
\mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}.
]
Character Error Rate (CER)
Whitespace is removed first. Character edit distance is then computed by Levenshtein at character level.
[
\mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}.
]
Real-Time Factor (RTF)
[
\mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}.
]
The script now outputs two scoring views:
strictMetrics: default normalization (punctuation/case normalized, word-level strictness preserved)relaxedMetrics: additional normalization (quote removal, looser numeric formatting)
This dual reporting helps distinguish “formatting mismatch” from deeper lexical/coverage mismatch.
4. Model Overview
Whisper medium is a general-purpose Whisper checkpoint often used when teams want a practical trade-off between speed and recognition quality on commodity hardware. It is commonly sufficient for draft transcription, indexing, and downstream NLP preprocessing, but still requires validation for verbatim publishing or compliance-sensitive use cases.
Only one configuration is evaluated here (model and language from
other.yaml). No decoder hyperparameter sweep, no custom post-correction, and no domain lexicon boosting were applied in this run.5. Results (From result.json)
Strict metrics (
metrics / strictMetrics)- Reference word count (N): 12,970
- Substitutions (S): 131
- Deletions (D): 8,718
- Insertions (I): 0
- WER: 0.6822667694680031
- Accuracy: 0.31773323053199687
- Reference character count: 51,745
- Character edit distance: 34,683
- CER: 0.6702676587109866
- Audio duration (seconds): 1,364
- STT processing time (seconds): 365
- RTF: 0.26759530791788855
- Eval script runtime (seconds): 149.07
Relaxed metrics (
relaxedMetrics)- WER: 0.682112567463377
- Accuracy: 0.317887432536623
- CER: 0.6700148518721175
- Character edit distance: 34,286
- Reference character count: 51,172
Rounded interpretation
- Strict WER ≈ 68.23%, Accuracy ≈ 31.77%, CER ≈ 67.03%
- Relaxed WER ≈ 68.21%, Accuracy ≈ 31.79%, CER ≈ 67.00%
- Difference between strict and relaxed is small, suggesting the mismatch is not mainly punctuation/formatting noise.
- RTF ≈ 0.268 (about 3.7× faster than real time)
6. Error Pattern Analysis
Two signals stand out immediately:
- Insertion = 0
- Deletion >> substitution (8,718 vs 131)
That pattern usually indicates that many reference words do not find aligned counterparts in the hypothesis. In practice, this can happen due to large coverage differences (different subtitle segmentation, truncated hypothesis, reference including non-speech captions, or timing-window mismatch), not only “wrongly recognized words.”
The strict/relaxed gap is tiny, which further supports this interpretation: normalization tweaks barely moved scores, so the dominant issue is likely alignment/coverage rather than punctuation or quote formatting.
7. Key Insights
- Speed: With RTF ≈ 0.268, processing is clearly faster than real time and usable for batch pipelines.
- Accuracy signal: ~68% WER is too high for quote-level publishing without review.
- Error profile: Deletion dominance points to coverage mismatch first; optimize pairing/segmentation checks before model tuning.
- Method robustness: Strict and relaxed metrics are almost identical, so the result is not driven by superficial formatting differences.
- Representativeness: ~22.7 minutes is a meaningful long-form sample, but still one clip and one configuration.
8. Best Model for This Scenario
Under the narrow scope “Whisper medium, this exact YouTube clip, this exact reference caption source,” the run serves as a transparent baseline. It gives a stable throughput anchor (RTF) and two consistent text-agreement views (strict/relaxed WER/CER) for future A/B comparisons.
It does not imply Whisper medium is universally best for English YouTube ASR; it simply defines a reproducible checkpoint for your own evaluation ladder.
9. Neutral Final Verdict
For drafting, rough indexing, and topic extraction, this setup may still be useful because throughput is practical and outputs are deterministic under the same script.
For verbatim publishing, legal/compliance records, or accessibility-critical subtitles, the current agreement level (about 31.8% accuracy) and deletion-heavy profile imply that manual correction or stronger setup changes are required.
Most importantly, keep the evaluation method fixed (
scripts/evaluate-vtt-metrics.js) when iterating models. Consistent methodology is what makes improvements measurable.Source Materials
Case folder name
{case-name} = 20260330.- Original audio (video): https://www.youtube.com/watch?v=EatCzpKNTMs — reference subtitles are the caption track from this video (exported to
ref.vtt). - Reference transcript (VTT):
test-transcripts/{case-name}/ref.vtt - Model transcript (VTT):
test-transcripts/{case-name}/model.vtt - Run metadata:
test-transcripts/{case-name}/other.yaml - Precomputed evaluation metrics:
test-transcripts/{case-name}/result.json
Evaluation script used:
For long transcripts, run Node with a higher heap limit when needed (for example:
scripts/evaluate-vtt-metrics.jsFor long transcripts, run Node with a higher heap limit when needed (for example:
NODE_OPTIONS=--max-old-space-size=8192).