Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

This note captures one fixed-configuration run on English audio of roughly 18 minutes drawn from a YouTube upload. The scorer reports WER ≈ 67.6% with deletions dominating (6 449 vs 60 substitutions, 0 insertions). That profile suggests the hypothesis transcript aligns poorly in coverage against the reference—often seen when the reference is the platform caption track and the ASR output reflects a different segmentation or length—so the figures should be read as diagnostic, not as a polished “accuracy score” in isolation.

Video and reference text. The reference WebVTT (ref.vtt) is the caption text supplied with the source video (exported as WebVTT). The hypothesis (model.vtt) is Whisper large-v3-turbo on the same underlying audio. Metrics compare those captions to this ASR run—a practical baseline for “how close is our pipeline to what YouTube ships,” not a claim about human-verified ground truth.

1. Why This Benchmark Matters

YouTube-style speech is everywhere in real workflows: variable mic quality, music beds, cuts, and long monologues or dialogues. Evaluating ASR on actual platform captions as the reference answers a concrete question: if we run our own Whisper-based stack on the same audio, how far does the text drift from what viewers already see as subtitles? That is useful for caption QA, repurposing content, and search indexing—domains where “good enough” depends on the product, but the numbers must be reproducible.

2. Testing Setup

Values below come from other.yaml and result.json for this case (directory mode so YAML metadata is attached to the scorer output).

Field	Value
Source	YouTube video (audio aligned to that upload)
Date (processing window)	2026-03-29 (`processtime-at` → `completed-at` in `other.yaml`)
Language	English
Whisper model	large-v3-turbo
Audio duration (YAML label)	17:39
Audio duration (scorer, from VTT)	1059.88 s (≈ 17.7 minutes)
STT processing time	175 s (`sttProcessingTimeSeconds` in `result.json`, from YAML timestamps)
RTF	0.165 (from `result.json`)

Wall-clock interval in YAML: 2026-03-29 16:04:37 → 2026-03-29 16:07:32 (consistent with 175 s processing time).

3. Evaluation Methodology

Reference and hypothesis are WebVTT files. Cue text is extracted, then normalized (case, punctuation, light cleanup) before scoring.

Word-level alignment

Token sequences are aligned with a Levenshtein-style dynamic program; backtracking yields substitutions (S), deletions (D), and insertions (I) versus reference length N.

[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]

Character Error Rate (CER)

Whitespace is stripped; character edit distance is Levenshtein distance at character level.

[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}. ]

Real-Time Factor (RTF)

[ \mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}. ]

RTF below 1 means decoding faster than real time on this run.

4. Model Overview

Whisper large-v3-turbo targets strong quality with improved throughput relative to heavier “large” variants (behavior depends on implementation and hardware). It is a general-purpose multilingual ASR, suited to draft transcripts, search, and caption drafts where verbatim perfection is not assumed. This benchmark uses one decode configuration recorded in other.yaml; it does not sweep decoding options, VAD, or post-processing.

5. Results (From `result.json`)

Exact metrics:

Reference word count (N): 9627
Substitutions (S): 60
Deletions (D): 6449
Insertions (I): 0
WER: 0.6761192479484782
Accuracy: 0.3238807520515218
Reference character count: 38334
Character edit distance: 25696
CER: 0.6703187770647467
Audio duration (seconds): 1059.8809999999999
STT processing time (seconds): 175
RTF: 0.16511287587946197
Eval script runtime (seconds): 25.612

Rounded for reading

WER ≈ 67.6%; accuracy ≈ 32.4%
CER ≈ 67.0%
~25.7k character edits on ~38.3k reference characters
RTF ≈ 0.165× (about 6× faster than real time)

6. Error Pattern Analysis

Insertions are zero and deletions dwarf substitutions (6449 vs 60). That is not the usual “noisy ASR with extra filler words” profile; it points to large spans of reference text not matched by the hypothesis under this alignment—consistent with length mismatch, different segmentation, or reference spanning more content than the ASR saw (e.g., caption file vs. audio segment). CER ≈ 67% reinforces that the gap is broad, not a handful of word swaps.

For product teams: do not interpret this as “Whisper misheard 68% of words” in the colloquial sense until you confirm same audio window, same language, and comparable text normalization between caption export and model output.

7. Key Insights

Speed: RTF ≈ 0.17 is attractive for batch processing long clips.
Accuracy: ~68% WER is not publication-ready without review if you need faithful quotes.
Error shape: Deletion-heavy, zero insertions—audit pairing and coverage before tuning model knobs.
Scenario realism: ~18 minutes of continuous English from a real YouTube source is more representative than toy clips, but still one video and one model setting.
Reference choice: Using platform captions anchors the test to a viewer-visible baseline, which may differ from a human re-transcribe.

8. Best Model for This Scenario

Under the narrow scope “large-v3-turbo on this clip, with YouTube captions as reference,” the run is a documented baseline: it fixes throughput (RTF) and quantitative mismatch (WER/CER) for later comparison. It is not a claim that this is the best model for all English YouTube content.

9. Neutral Final Verdict

For internal drafts, topic tagging, or rough search, low RTF may make this stack usable if stakeholders accept error rates and validate critical passages.

For verbatim quoting, compliance, or accessibility-critical subtitles, ~32% word accuracy and deletion-heavy errors mean human review or alignment fixes remain mandatory. Rerun the scorer after any change to inputs; methodology stays comparable.

Source Materials

Case folder name {case-name} = 20260329 (mirror under test-transcripts/ in the repo when you publish assets).

Original video (audio source): https://www.youtube.com/watch?v=E73XCmLAFe8 — the reference subtitles are the captions provided with this video (exported as ref.vtt).
Reference transcript (VTT): test-transcripts/{case-name}/ref.vtt
Model transcript (VTT): test-transcripts/{case-name}/model.vtt
Run metadata: test-transcripts/{case-name}/other.yaml
Precomputed evaluation metrics: test-transcripts/{case-name}/result.json

Scoring uses scripts/evaluate-vtt-metrics.js in this repository. For long transcripts, run Node with a raised heap limit if needed (e.g. NODE_OPTIONS=--max-old-space-size=8192).

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From `result.json`)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From result.json)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

5. Results (From `result.json`)