Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

2026-03-28Test

Eric King

Author

This note documents a single, fixed-configuration run on English interview-style audio (~8.5 minutes). The scorer reports a word error rate near 69%, with deletions dominating the error budget (2 192 deletions vs 188 substitutions, 0 insertions). That pattern usually means the hypothesis text covers far less of the reference than a typical “noisy but complete” transcript would—so the numbers should be read as diagnostic, alongside a manual check that the model output and the reference describe the same recording and segmentation.

Video and reference text. The audio under test comes from one source video (link below). The reference WebVTT (ref.vtt) is the caption track supplied with that video—exported or saved as WebVTT from the platform’s subtitles—not an independently authored “gold” transcript. The hypothesis (model.vtt) is Whisper large-v3-turbo ASR on the same audio. Metrics therefore compare platform-provided captions to this ASR run, which is a practical baseline but not the same as scoring against hand-curated research transcripts.

1. Why This Benchmark Matters

Interview audio stresses ASR with overlapping speech, uneven pacing, names, and numbers—conditions common in editorial and research work. Publishing model id, language, duration, timestamps, and standard metrics keeps the run comparable to reruns or other pipelines; the aim is transparency, not a product claim.

2. Testing Setup

Unless stated otherwise, values below come from other.yaml and result.json for this case.

Field	Value
Date (processing window)	2026-03-28 (see `processtime-at` / `completed-at` in `other.yaml`)
Scenario	English interview-style content (language tag: English)
Whisper model	large-v3-turbo (`whisper-model` in `other.yaml`)
Audio duration (YAML)	08:25 (8 min 25 s wall-clock label)
Audio duration (scorer)	506.88 s (from reference VTT cue span in `result.json`)
Wall-clock processing interval	`processtime-at`: 2026-03-28 09:56:40.204 → `completed-at`: 2026-03-28 09:57:57.000
Derived STT processing time	≈ 76.8 s (difference between the two timestamps above; not stored in `result.json` because this run used explicit VTT mode without YAML attached to the scorer output)
Derived RTF	≈ 0.151 (processing time ÷ 506.88 s audio duration)

Note: result.json lists "yamlMeta": null for this explicit two-file run; RTF there is null. Processing time and RTF in this article are recomputed from other.yaml for reporting consistency with the methodology section.

3. Evaluation Methodology

Reference and hypothesis are WebVTT files. Plain text is extracted from cues (timestamps and indices stripped), then normalized (casing, punctuation, and simple typography) before scoring.

Word-level alignment

Reference and hypothesis are aligned as token sequences. A standard Levenshtein–style dynamic program finds a minimum-cost path between the two word sequences; backtracking yields counts of substitutions (S), deletions (D), and insertions (I) relative to the reference length N.

Word Error Rate (WER) and accuracy

Let (S), (D), and (I) be substitution, deletion, and insertion counts, and (N) the number of reference words.

[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]

Character Error Rate (CER)

Whitespace is removed from the normalized strings. Character edit distance is the Levenshtein distance at the character level; reference character count is the length of the reference string without spaces.

[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count}}. ]

Real-Time Factor (RTF)

[ \mathrm{RTF} = \frac{\text{Processing time (seconds)}}{\text{Audio duration (seconds)}}. ]

RTF below 1 means decoding was faster than real time on this hardware/run.

4. Model Overview

Whisper large-v3-turbo sits in the “large” family and trades some compute for throughput versus full large checkpoints (exact behavior depends on implementation and hardware). It is a general-purpose multilingual ASR suitable for drafts and search indexing where perfect fidelity is not assumed. This run tests one configuration from other.yaml; no sweep of temperature, chunking, or VAD.

5. Results (From `result.json`)

Exact values from the precomputed metrics object:

Reference word count (N): 3442
Substitutions (S): 188
Deletions (D): 2192
Insertions (I): 0
WER: 0.6914584543869843
Accuracy: 0.3085415456130157
Reference character count: 15790
Character edit distance: 10494
CER: 0.664597846738442
Audio duration (seconds): 506.88
STT processing time (in JSON): null (see Section 2 for YAML-derived duration)
RTF (in JSON): null (derived RTF ≈ 0.151 using YAML timestamps)
Eval script runtime: 3.11 s

Rounded for reading

WER ≈ 69.1%; accuracy ≈ 30.9%
CER ≈ 66.5%
~10.5k character edits on ~15.8k reference characters
RTF ≈ 0.15× (faster than real time on this clip, using YAML-derived processing time)

6. Error Pattern Analysis

With I = 0, the hypothesis never adds spurious words relative to this alignment; almost all word-level error mass is deletions and substitutions, and deletions are an order of magnitude larger than substitutions (2192 vs 188).

Interpretation for practice:

Deletion-heavy profiles often indicate missing spans in the hypothesis (silence handling, early stop, different clip length, or reference longer than the audio actually transcribed).
Zero insertions rarely appear in messy real-world ASR; when it happens together with extreme WER, it is a signal to verify data pairing (same file, same language, same edit of the reference) before attributing the score to “model quality” alone.

CER ~66% is consistent with large stretches of text that do not match between reference and hypothesis—not only occasional word swaps.

7. Key Insights

Speed: Derived RTF ≈ 0.15 suggests the stack finished in a fraction of real time for this clip—useful where latency matters, independent of raw WER.
Accuracy: ~69% WER is not sufficient for publishable quotes or legal-grade transcripts without heavy human review.
Error shape: Deletions dominate; prioritize investigating coverage and segment alignment before tuning decoding hyperparameters.
Single-sample limits: One interview and one model configuration do not define expected production performance across accents, codecs, or noise.
Reproducibility: Keeping all four artifacts together preserves a frozen snapshot.

8. Best Model for This Scenario

For this clip and reference only, Whisper large-v3-turbo is a documented baseline: timestamps describe throughput; WER/CER describe mismatch versus your reference. It is not argued to be the best model for all English interviews.

9. Neutral Final Verdict

For draft notes, internal search, or rough indexing where errors are acceptable and speed matters, a low RTF and a stored transcript may still be usable with clear disclaimers.

For quoting participants, compliance-sensitive workflows, or archival publication, this run’s ~31% word accuracy and deletion-heavy error profile imply that human proofreading or a different capture/reference alignment should be assumed until scores improve on validated pairs.

Rerun the scorer after fixing data issues; the methodology stays comparable.

Source Materials

Case folder name: 20260328 (repository path prefix: test-transcripts/20260328/).

Original video (audio source): Add the canonical URL to the same video whose captions were used as the reference (e.g. YouTube watch link). The audio processed for ASR should correspond to this upload.
Reference transcript (VTT): test-transcripts/20260328/ref.vtt — subtitles/captions provided with the source video, stored as WebVTT for scoring.
Model transcript (VTT): test-transcripts/20260328/model.vtt — Whisper large-v3-turbo output on that audio.
Run metadata: test-transcripts/20260328/other.yaml
Precomputed evaluation metrics: test-transcripts/20260328/result.json

Evaluation was produced with scripts/evaluate-vtt-metrics.js in this repository. Place the files above under test-transcripts/20260328/ to reproduce the quoted numbers.

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From `result.json`)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

1. Why This Benchmark Matters

2. Testing Setup

3. Evaluation Methodology

4. Model Overview

5. Results (From result.json)

6. Error Pattern Analysis

7. Key Insights

8. Best Model for This Scenario

9. Neutral Final Verdict

Source Materials

Related Posts

Whisper Medium on English YouTube Audio — March 31, 2026 Benchmark (WER, CER, RTF)

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

Try It Free Now

5. Results (From `result.json`)