
Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)
2026-03-28Test
Eric King
Author
This note documents a single, fixed-configuration run on English interview-style audio (~8.5 minutes). The scorer reports a word error rate near 69%, with deletions dominating the error budget (2 192 deletions vs 188 substitutions, 0 insertions). That pattern usually means the hypothesis text covers far less of the reference than a typical “noisy but complete” transcript would—so the numbers should be read as diagnostic, alongside a manual check that the model output and the reference describe the same recording and segmentation.
Video and reference text. The audio under test comes from one source video (link below). The reference WebVTT (
ref.vtt) is the caption track supplied with that video—exported or saved as WebVTT from the platform’s subtitles—not an independently authored “gold” transcript. The hypothesis (model.vtt) is Whisper large-v3-turbo ASR on the same audio. Metrics therefore compare platform-provided captions to this ASR run, which is a practical baseline but not the same as scoring against hand-curated research transcripts.1. Why This Benchmark Matters
Interview audio stresses ASR with overlapping speech, uneven pacing, names, and numbers—conditions common in editorial and research work. Publishing model id, language, duration, timestamps, and standard metrics keeps the run comparable to reruns or other pipelines; the aim is transparency, not a product claim.
2. Testing Setup
Unless stated otherwise, values below come from
other.yaml and result.json for this case.| Field | Value |
|---|---|
| Date (processing window) | 2026-03-28 (see processtime-at / completed-at in other.yaml) |
| Scenario | English interview-style content (language tag: English) |
| Whisper model | large-v3-turbo (whisper-model in other.yaml) |
| Audio duration (YAML) | 08:25 (8 min 25 s wall-clock label) |
| Audio duration (scorer) | 506.88 s (from reference VTT cue span in result.json) |
| Wall-clock processing interval | processtime-at: 2026-03-28 09:56:40.204 → completed-at: 2026-03-28 09:57:57.000 |
| Derived STT processing time | ≈ 76.8 s (difference between the two timestamps above; not stored in result.json because this run used explicit VTT mode without YAML attached to the scorer output) |
| Derived RTF | ≈ 0.151 (processing time ÷ 506.88 s audio duration) |
Note:
result.json lists "yamlMeta": null for this explicit two-file run; RTF there is null. Processing time and RTF in this article are recomputed from other.yaml for reporting consistency with the methodology section.3. Evaluation Methodology
Reference and hypothesis are WebVTT files. Plain text is extracted from cues (timestamps and indices stripped), then normalized (casing, punctuation, and simple typography) before scoring.
Word-level alignment
Reference and hypothesis are aligned as token sequences. A standard Levenshtein–style dynamic program finds a minimum-cost path between the two word sequences; backtracking yields counts of substitutions (S), deletions (D), and insertions (I) relative to the reference length N.
Word Error Rate (WER) and accuracy
Let (S), (D), and (I) be substitution, deletion, and insertion counts, and (N) the number of reference words.
[
\mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}.
]
Character Error Rate (CER)
Whitespace is removed from the normalized strings. Character edit distance is the Levenshtein distance at the character level; reference character count is the length of the reference string without spaces.
[
\mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count}}.
]
Real-Time Factor (RTF)
[
\mathrm{RTF} = \frac{\text{Processing time (seconds)}}{\text{Audio duration (seconds)}}.
]
RTF below 1 means decoding was faster than real time on this hardware/run.
4. Model Overview
Whisper large-v3-turbo sits in the “large” family and trades some compute for throughput versus full large checkpoints (exact behavior depends on implementation and hardware). It is a general-purpose multilingual ASR suitable for drafts and search indexing where perfect fidelity is not assumed. This run tests one configuration from
other.yaml; no sweep of temperature, chunking, or VAD.5. Results (From result.json)
Exact values from the precomputed metrics object:
- Reference word count (N): 3442
- Substitutions (S): 188
- Deletions (D): 2192
- Insertions (I): 0
- WER: 0.6914584543869843
- Accuracy: 0.3085415456130157
- Reference character count: 15790
- Character edit distance: 10494
- CER: 0.664597846738442
- Audio duration (seconds): 506.88
- STT processing time (in JSON):
null(see Section 2 for YAML-derived duration) - RTF (in JSON):
null(derived RTF ≈ 0.151 using YAML timestamps) - Eval script runtime: 3.11 s
Rounded for reading
- WER ≈ 69.1%; accuracy ≈ 30.9%
- CER ≈ 66.5%
- ~10.5k character edits on ~15.8k reference characters
- RTF ≈ 0.15× (faster than real time on this clip, using YAML-derived processing time)
6. Error Pattern Analysis
With I = 0, the hypothesis never adds spurious words relative to this alignment; almost all word-level error mass is deletions and substitutions, and deletions are an order of magnitude larger than substitutions (2192 vs 188).
Interpretation for practice:
- Deletion-heavy profiles often indicate missing spans in the hypothesis (silence handling, early stop, different clip length, or reference longer than the audio actually transcribed).
- Zero insertions rarely appear in messy real-world ASR; when it happens together with extreme WER, it is a signal to verify data pairing (same file, same language, same edit of the reference) before attributing the score to “model quality” alone.
CER ~66% is consistent with large stretches of text that do not match between reference and hypothesis—not only occasional word swaps.
7. Key Insights
- Speed: Derived RTF ≈ 0.15 suggests the stack finished in a fraction of real time for this clip—useful where latency matters, independent of raw WER.
- Accuracy: ~69% WER is not sufficient for publishable quotes or legal-grade transcripts without heavy human review.
- Error shape: Deletions dominate; prioritize investigating coverage and segment alignment before tuning decoding hyperparameters.
- Single-sample limits: One interview and one model configuration do not define expected production performance across accents, codecs, or noise.
- Reproducibility: Keeping all four artifacts together preserves a frozen snapshot.
8. Best Model for This Scenario
For this clip and reference only, Whisper large-v3-turbo is a documented baseline: timestamps describe throughput; WER/CER describe mismatch versus your reference. It is not argued to be the best model for all English interviews.
9. Neutral Final Verdict
For draft notes, internal search, or rough indexing where errors are acceptable and speed matters, a low RTF and a stored transcript may still be usable with clear disclaimers.
For quoting participants, compliance-sensitive workflows, or archival publication, this run’s ~31% word accuracy and deletion-heavy error profile imply that human proofreading or a different capture/reference alignment should be assumed until scores improve on validated pairs.
Rerun the scorer after fixing data issues; the methodology stays comparable.
Source Materials
Case folder name:
20260328 (repository path prefix: test-transcripts/20260328/).- Original video (audio source): Add the canonical URL to the same video whose captions were used as the reference (e.g. YouTube watch link). The audio processed for ASR should correspond to this upload.
- Reference transcript (VTT):
test-transcripts/20260328/ref.vtt— subtitles/captions provided with the source video, stored as WebVTT for scoring. - Model transcript (VTT):
test-transcripts/20260328/model.vtt— Whisper large-v3-turbo output on that audio. - Run metadata:
test-transcripts/20260328/other.yaml - Precomputed evaluation metrics:
test-transcripts/20260328/result.json
Evaluation was produced with
scripts/evaluate-vtt-metrics.js in this repository. Place the files above under test-transcripts/20260328/ to reproduce the quoted numbers.