
Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)
2026-03-29Test
Eric King
Author
This note captures one fixed-configuration run on English audio of roughly 18 minutes drawn from a YouTube upload. The scorer reports WER ≈ 67.6% with deletions dominating (6 449 vs 60 substitutions, 0 insertions). That profile suggests the hypothesis transcript aligns poorly in coverage against the reference—often seen when the reference is the platform caption track and the ASR output reflects a different segmentation or length—so the figures should be read as diagnostic, not as a polished “accuracy score” in isolation.
Video and reference text. The reference WebVTT (
ref.vtt) is the caption text supplied with the source video (exported as WebVTT). The hypothesis (model.vtt) is Whisper large-v3-turbo on the same underlying audio. Metrics compare those captions to this ASR run—a practical baseline for “how close is our pipeline to what YouTube ships,” not a claim about human-verified ground truth.1. Why This Benchmark Matters
YouTube-style speech is everywhere in real workflows: variable mic quality, music beds, cuts, and long monologues or dialogues. Evaluating ASR on actual platform captions as the reference answers a concrete question: if we run our own Whisper-based stack on the same audio, how far does the text drift from what viewers already see as subtitles? That is useful for caption QA, repurposing content, and search indexing—domains where “good enough” depends on the product, but the numbers must be reproducible.
2. Testing Setup
Values below come from
other.yaml and result.json for this case (directory mode so YAML metadata is attached to the scorer output).| Field | Value |
|---|---|
| Source | YouTube video (audio aligned to that upload) |
| Date (processing window) | 2026-03-29 (processtime-at → completed-at in other.yaml) |
| Language | English |
| Whisper model | large-v3-turbo |
| Audio duration (YAML label) | 17:39 |
| Audio duration (scorer, from VTT) | 1059.88 s (≈ 17.7 minutes) |
| STT processing time | 175 s (sttProcessingTimeSeconds in result.json, from YAML timestamps) |
| RTF | 0.165 (from result.json) |
Wall-clock interval in YAML: 2026-03-29 16:04:37 → 2026-03-29 16:07:32 (consistent with 175 s processing time).
3. Evaluation Methodology
Reference and hypothesis are WebVTT files. Cue text is extracted, then normalized (case, punctuation, light cleanup) before scoring.
Word-level alignment
Token sequences are aligned with a Levenshtein-style dynamic program; backtracking yields substitutions (S), deletions (D), and insertions (I) versus reference length N.
[
\mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}.
]
Character Error Rate (CER)
Whitespace is stripped; character edit distance is Levenshtein distance at character level.
[
\mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}.
]
Real-Time Factor (RTF)
[
\mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}.
]
RTF below 1 means decoding faster than real time on this run.
4. Model Overview
Whisper large-v3-turbo targets strong quality with improved throughput relative to heavier “large” variants (behavior depends on implementation and hardware). It is a general-purpose multilingual ASR, suited to draft transcripts, search, and caption drafts where verbatim perfection is not assumed. This benchmark uses one decode configuration recorded in
other.yaml; it does not sweep decoding options, VAD, or post-processing.5. Results (From result.json)
Exact metrics:
- Reference word count (N): 9627
- Substitutions (S): 60
- Deletions (D): 6449
- Insertions (I): 0
- WER: 0.6761192479484782
- Accuracy: 0.3238807520515218
- Reference character count: 38334
- Character edit distance: 25696
- CER: 0.6703187770647467
- Audio duration (seconds): 1059.8809999999999
- STT processing time (seconds): 175
- RTF: 0.16511287587946197
- Eval script runtime (seconds): 25.612
Rounded for reading
- WER ≈ 67.6%; accuracy ≈ 32.4%
- CER ≈ 67.0%
- ~25.7k character edits on ~38.3k reference characters
- RTF ≈ 0.165× (about 6× faster than real time)
6. Error Pattern Analysis
Insertions are zero and deletions dwarf substitutions (6449 vs 60). That is not the usual “noisy ASR with extra filler words” profile; it points to large spans of reference text not matched by the hypothesis under this alignment—consistent with length mismatch, different segmentation, or reference spanning more content than the ASR saw (e.g., caption file vs. audio segment). CER ≈ 67% reinforces that the gap is broad, not a handful of word swaps.
For product teams: do not interpret this as “Whisper misheard 68% of words” in the colloquial sense until you confirm same audio window, same language, and comparable text normalization between caption export and model output.
7. Key Insights
- Speed: RTF ≈ 0.17 is attractive for batch processing long clips.
- Accuracy: ~68% WER is not publication-ready without review if you need faithful quotes.
- Error shape: Deletion-heavy, zero insertions—audit pairing and coverage before tuning model knobs.
- Scenario realism: ~18 minutes of continuous English from a real YouTube source is more representative than toy clips, but still one video and one model setting.
- Reference choice: Using platform captions anchors the test to a viewer-visible baseline, which may differ from a human re-transcribe.
8. Best Model for This Scenario
Under the narrow scope “large-v3-turbo on this clip, with YouTube captions as reference,” the run is a documented baseline: it fixes throughput (RTF) and quantitative mismatch (WER/CER) for later comparison. It is not a claim that this is the best model for all English YouTube content.
9. Neutral Final Verdict
For internal drafts, topic tagging, or rough search, low RTF may make this stack usable if stakeholders accept error rates and validate critical passages.
For verbatim quoting, compliance, or accessibility-critical subtitles, ~32% word accuracy and deletion-heavy errors mean human review or alignment fixes remain mandatory. Rerun the scorer after any change to inputs; methodology stays comparable.
Source Materials
Case folder name
{case-name} = 20260329 (mirror under test-transcripts/ in the repo when you publish assets).- Original video (audio source): https://www.youtube.com/watch?v=E73XCmLAFe8 — the reference subtitles are the captions provided with this video (exported as
ref.vtt). - Reference transcript (VTT):
test-transcripts/{case-name}/ref.vtt - Model transcript (VTT):
test-transcripts/{case-name}/model.vtt - Run metadata:
test-transcripts/{case-name}/other.yaml - Precomputed evaluation metrics:
test-transcripts/{case-name}/result.json
Scoring uses
scripts/evaluate-vtt-metrics.js in this repository. For long transcripts, run Node with a raised heap limit if needed (e.g. NODE_OPTIONS=--max-old-space-size=8192).