🎉 We're live! All services are free during our trial period—pricing plans coming soon.

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on an English Interview — March 28, 2026 Benchmark (WER, CER, RTF)

2026-03-28Test
Eric King

Eric King

Author


This note documents a single, fixed-configuration run on English interview-style audio (~8.5 minutes). The scorer reports a word error rate near 69%, with deletions dominating the error budget (2 192 deletions vs 188 substitutions, 0 insertions). That pattern usually means the hypothesis text covers far less of the reference than a typical “noisy but complete” transcript would—so the numbers should be read as diagnostic, alongside a manual check that the model output and the reference describe the same recording and segmentation.
Video and reference text. The audio under test comes from one source video (link below). The reference WebVTT (ref.vtt) is the caption track supplied with that video—exported or saved as WebVTT from the platform’s subtitles—not an independently authored “gold” transcript. The hypothesis (model.vtt) is Whisper large-v3-turbo ASR on the same audio. Metrics therefore compare platform-provided captions to this ASR run, which is a practical baseline but not the same as scoring against hand-curated research transcripts.

1. Why This Benchmark Matters

Interview audio stresses ASR with overlapping speech, uneven pacing, names, and numbers—conditions common in editorial and research work. Publishing model id, language, duration, timestamps, and standard metrics keeps the run comparable to reruns or other pipelines; the aim is transparency, not a product claim.

2. Testing Setup

Unless stated otherwise, values below come from other.yaml and result.json for this case.
FieldValue
Date (processing window)2026-03-28 (see processtime-at / completed-at in other.yaml)
ScenarioEnglish interview-style content (language tag: English)
Whisper modellarge-v3-turbo (whisper-model in other.yaml)
Audio duration (YAML)08:25 (8 min 25 s wall-clock label)
Audio duration (scorer)506.88 s (from reference VTT cue span in result.json)
Wall-clock processing intervalprocesstime-at: 2026-03-28 09:56:40.204completed-at: 2026-03-28 09:57:57.000
Derived STT processing time76.8 s (difference between the two timestamps above; not stored in result.json because this run used explicit VTT mode without YAML attached to the scorer output)
Derived RTF0.151 (processing time ÷ 506.88 s audio duration)
Note: result.json lists "yamlMeta": null for this explicit two-file run; RTF there is null. Processing time and RTF in this article are recomputed from other.yaml for reporting consistency with the methodology section.

3. Evaluation Methodology

Reference and hypothesis are WebVTT files. Plain text is extracted from cues (timestamps and indices stripped), then normalized (casing, punctuation, and simple typography) before scoring.
Word-level alignment
Reference and hypothesis are aligned as token sequences. A standard Levenshtein–style dynamic program finds a minimum-cost path between the two word sequences; backtracking yields counts of substitutions (S), deletions (D), and insertions (I) relative to the reference length N.
Word Error Rate (WER) and accuracy
Let (S), (D), and (I) be substitution, deletion, and insertion counts, and (N) the number of reference words.
[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]
Character Error Rate (CER)
Whitespace is removed from the normalized strings. Character edit distance is the Levenshtein distance at the character level; reference character count is the length of the reference string without spaces.
[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count}}. ]
Real-Time Factor (RTF)
[ \mathrm{RTF} = \frac{\text{Processing time (seconds)}}{\text{Audio duration (seconds)}}. ]
RTF below 1 means decoding was faster than real time on this hardware/run.

4. Model Overview

Whisper large-v3-turbo sits in the “large” family and trades some compute for throughput versus full large checkpoints (exact behavior depends on implementation and hardware). It is a general-purpose multilingual ASR suitable for drafts and search indexing where perfect fidelity is not assumed. This run tests one configuration from other.yaml; no sweep of temperature, chunking, or VAD.

5. Results (From result.json)

Exact values from the precomputed metrics object:
  • Reference word count (N): 3442
  • Substitutions (S): 188
  • Deletions (D): 2192
  • Insertions (I): 0
  • WER: 0.6914584543869843
  • Accuracy: 0.3085415456130157
  • Reference character count: 15790
  • Character edit distance: 10494
  • CER: 0.664597846738442
  • Audio duration (seconds): 506.88
  • STT processing time (in JSON): null (see Section 2 for YAML-derived duration)
  • RTF (in JSON): null (derived RTF ≈ 0.151 using YAML timestamps)
  • Eval script runtime: 3.11 s
Rounded for reading
  • WER ≈ 69.1%; accuracy ≈ 30.9%
  • CER ≈ 66.5%
  • ~10.5k character edits on ~15.8k reference characters
  • RTF ≈ 0.15× (faster than real time on this clip, using YAML-derived processing time)

6. Error Pattern Analysis

With I = 0, the hypothesis never adds spurious words relative to this alignment; almost all word-level error mass is deletions and substitutions, and deletions are an order of magnitude larger than substitutions (2192 vs 188).
Interpretation for practice:
  • Deletion-heavy profiles often indicate missing spans in the hypothesis (silence handling, early stop, different clip length, or reference longer than the audio actually transcribed).
  • Zero insertions rarely appear in messy real-world ASR; when it happens together with extreme WER, it is a signal to verify data pairing (same file, same language, same edit of the reference) before attributing the score to “model quality” alone.
CER ~66% is consistent with large stretches of text that do not match between reference and hypothesis—not only occasional word swaps.

7. Key Insights

  • Speed: Derived RTF ≈ 0.15 suggests the stack finished in a fraction of real time for this clip—useful where latency matters, independent of raw WER.
  • Accuracy: ~69% WER is not sufficient for publishable quotes or legal-grade transcripts without heavy human review.
  • Error shape: Deletions dominate; prioritize investigating coverage and segment alignment before tuning decoding hyperparameters.
  • Single-sample limits: One interview and one model configuration do not define expected production performance across accents, codecs, or noise.
  • Reproducibility: Keeping all four artifacts together preserves a frozen snapshot.

8. Best Model for This Scenario

For this clip and reference only, Whisper large-v3-turbo is a documented baseline: timestamps describe throughput; WER/CER describe mismatch versus your reference. It is not argued to be the best model for all English interviews.

9. Neutral Final Verdict

For draft notes, internal search, or rough indexing where errors are acceptable and speed matters, a low RTF and a stored transcript may still be usable with clear disclaimers.
For quoting participants, compliance-sensitive workflows, or archival publication, this run’s ~31% word accuracy and deletion-heavy error profile imply that human proofreading or a different capture/reference alignment should be assumed until scores improve on validated pairs.
Rerun the scorer after fixing data issues; the methodology stays comparable.

Source Materials

Case folder name: 20260328 (repository path prefix: test-transcripts/20260328/).
  • Original video (audio source): Add the canonical URL to the same video whose captions were used as the reference (e.g. YouTube watch link). The audio processed for ASR should correspond to this upload.
  • Reference transcript (VTT): test-transcripts/20260328/ref.vttsubtitles/captions provided with the source video, stored as WebVTT for scoring.
  • Model transcript (VTT): test-transcripts/20260328/model.vtt — Whisper large-v3-turbo output on that audio.
  • Run metadata: test-transcripts/20260328/other.yaml
  • Precomputed evaluation metrics: test-transcripts/20260328/result.json
Evaluation was produced with scripts/evaluate-vtt-metrics.js in this repository. Place the files above under test-transcripts/20260328/ to reproduce the quoted numbers.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!

Sound to Text OnlineSound to Text FreeSound to Text ConverterSound to Text MP3Sound to Text WAVSound to Text with TimestampsSound to Text for MeetingsSound to Text Multi LanguageSound to Text SubtitlesConvert WAV to TextVoice to TextVoice to Text OnlineSpeech to TextConvert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineOnline Transcription SoftwareSpeech to Text for MeetingsFast Speech to TextReal Time Speech to TextLive Transcription AppTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextTalk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for MeetingsAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website