Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

2026-03-30Test

Eric King

Author

This post records one fixed-configuration run on English YouTube audio with Whisper medium. From result.json, the strict score is WER = 68.23% and accuracy = 31.77%, with a strongly deletion-heavy profile (D = 8,718, S = 131, I = 0). In plain terms, this looks less like isolated word confusions and more like a coverage mismatch between the reference captions and the generated transcript, so the output should be interpreted as a reproducible baseline rather than a standalone quality claim.

Video and reference text. The source is this YouTube video. The reference file (ref.vtt) comes from the caption track provided with that video, and model.vtt is the output from this Whisper run. That means the benchmark measures agreement with platform captions (useful in production workflows), not with a manually curated linguistic gold transcript.

1. Why This Benchmark Matters

Long-form YouTube audio is a practical stress case for ASR because it mixes natural pacing shifts, edits, names, and topic changes in a way short demos do not. If your downstream workflow is subtitle QA, search indexing, content repurposing, or draft summarization, this scenario reflects real operational constraints better than clean lab speech.

Using the platform caption track as the reference creates a realistic “what users already see vs what our ASR pipeline outputs” comparison. It is not perfect ground truth, but it is highly relevant for product teams who need consistency checks and repeatable tracking over time.

2. Testing Setup

Values below come directly from other.yaml and result.json in this case folder.

Field	Value
Source	YouTube video
Date (processing window)	2026-03-30 (`processtime-at` → `completed-at`)
Language	English
Whisper model	medium
Audio duration (YAML label)	22:44
Audio duration (scorer / YAML parsed)	1364 s (≈ 22.73 minutes)
STT processing time	365 s
RTF	0.2676

Wall-clock timestamps: 2026-03-30 19:49:57 → 2026-03-30 19:56:02, consistent with 365 seconds of processing.

3. Evaluation Methodology

The evaluation is produced by:

scripts/evaluate-vtt-metrics.js

The script reads ref.vtt and model.vtt, extracts plain cue text, normalizes tokens, then aligns reference and hypothesis with Levenshtein dynamic programming.

Word-level alignment

At word level, backtracking over the DP matrix yields substitutions (S), deletions (D), and insertions (I) against reference size N.

[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]

Character Error Rate (CER)

Whitespace is removed first. Character edit distance is then computed by Levenshtein at character level.

[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}. ]

Real-Time Factor (RTF)

[ \mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}. ]

The script now outputs two scoring views:

strictMetrics: default normalization (punctuation/case normalized, word-level strictness preserved)
relaxedMetrics: additional normalization (quote removal, looser numeric formatting)

This dual reporting helps distinguish “formatting mismatch” from deeper lexical/coverage mismatch.

4. Model Overview

Whisper medium is a general-purpose Whisper checkpoint often used when teams want a practical trade-off between speed and recognition quality on commodity hardware. It is commonly sufficient for draft transcription, indexing, and downstream NLP preprocessing, but still requires validation for verbatim publishing or compliance-sensitive use cases.

Only one configuration is evaluated here (model and language from other.yaml). No decoder hyperparameter sweep, no custom post-correction, and no domain lexicon boosting were applied in this run.

5. Results (From `result.json`)

Strict metrics (metrics / strictMetrics)

Reference word count (N): 12,970
Substitutions (S): 131
Deletions (D): 8,718
Insertions (I): 0
WER: 0.6822667694680031
Accuracy: 0.31773323053199687
Reference character count: 51,745
Character edit distance: 34,683
CER: 0.6702676587109866
Audio duration (seconds): 1,364
STT processing time (seconds): 365
RTF: 0.26759530791788855
Eval script runtime (seconds): 149.07

Relaxed metrics (relaxedMetrics)

WER: 0.682112567463377
Accuracy: 0.317887432536623
CER: 0.6700148518721175
Character edit distance: 34,286
Reference character count: 51,172

Rounded interpretation

Strict WER ≈ 68.23%, Accuracy ≈ 31.77%, CER ≈ 67.03%
Relaxed WER ≈ 68.21%, Accuracy ≈ 31.79%, CER ≈ 67.00%
Difference between strict and relaxed is small, suggesting the mismatch is not mainly punctuation/formatting noise.
RTF ≈ 0.268 (about 3.7× faster than real time)

6. Error Pattern Analysis

Two signals stand out immediately:

Insertion = 0
Deletion >> substitution (8,718 vs 131)

That pattern usually indicates that many reference words do not find aligned counterparts in the hypothesis. In practice, this can happen due to large coverage differences (different subtitle segmentation, truncated hypothesis, reference including non-speech captions, or timing-window mismatch), not only “wrongly recognized words.”

The strict/relaxed gap is tiny, which further supports this interpretation: normalization tweaks barely moved scores, so the dominant issue is likely alignment/coverage rather than punctuation or quote formatting.

7. Key Insights

Speed: With RTF ≈ 0.268, processing is clearly faster than real time and usable for batch pipelines.
Accuracy signal: ~68% WER is too high for quote-level publishing without review.
Error profile: Deletion dominance points to coverage mismatch first; optimize pairing/segmentation checks before model tuning.
Method robustness: Strict and relaxed metrics are almost identical, so the result is not driven by superficial formatting differences.
Representativeness: ~22.7 minutes is a meaningful long-form sample, but still one clip and one configuration.

8. Best Model for This Scenario

Under the narrow scope “Whisper medium, this exact YouTube clip, this exact reference caption source,” the run serves as a transparent baseline. It gives a stable throughput anchor (RTF) and two consistent text-agreement views (strict/relaxed WER/CER) for future A/B comparisons.

It does not imply Whisper medium is universally best for English YouTube ASR; it simply defines a reproducible checkpoint for your own evaluation ladder.

9. Neutral Final Verdict

For drafting, rough indexing, and topic extraction, this setup may still be useful because throughput is practical and outputs are deterministic under the same script.

For verbatim publishing, legal/compliance records, or accessibility-critical subtitles, the current agreement level (about 31.8% accuracy) and deletion-heavy profile imply that manual correction or stronger setup changes are required.

Most importantly, keep the evaluation method fixed (scripts/evaluate-vtt-metrics.js) when iterating models. Consistent methodology is what makes improvements measurable.

Source Materials

Case folder name {case-name} = 20260330.

Original audio (video): https://www.youtube.com/watch?v=EatCzpKNTMs — reference subtitles are the caption track from this video (exported to ref.vtt).
Reference transcript (VTT): test-transcripts/{case-name}/ref.vtt
Model transcript (VTT): test-transcripts/{case-name}/model.vtt
Run metadata: test-transcripts/{case-name}/other.yaml
Precomputed evaluation metrics: test-transcripts/{case-name}/result.json

Evaluation script used: scripts/evaluate-vtt-metrics.js
For long transcripts, run Node with a higher heap limit when needed (for example: NODE_OPTIONS=--max-old-space-size=8192).

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From `result.json`)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From result.json)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

5. Results (From `result.json`)