🎉 We're live! All services are free during our trial period—pricing plans coming soon.

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

Whisper Large v3 Turbo on English YouTube Audio — March 29, 2026 Benchmark (WER, CER, RTF)

2026-03-29Test
Eric King

Eric King

Author


This note captures one fixed-configuration run on English audio of roughly 18 minutes drawn from a YouTube upload. The scorer reports WER ≈ 67.6% with deletions dominating (6 449 vs 60 substitutions, 0 insertions). That profile suggests the hypothesis transcript aligns poorly in coverage against the reference—often seen when the reference is the platform caption track and the ASR output reflects a different segmentation or length—so the figures should be read as diagnostic, not as a polished “accuracy score” in isolation.
Video and reference text. The reference WebVTT (ref.vtt) is the caption text supplied with the source video (exported as WebVTT). The hypothesis (model.vtt) is Whisper large-v3-turbo on the same underlying audio. Metrics compare those captions to this ASR run—a practical baseline for “how close is our pipeline to what YouTube ships,” not a claim about human-verified ground truth.

1. Why This Benchmark Matters

YouTube-style speech is everywhere in real workflows: variable mic quality, music beds, cuts, and long monologues or dialogues. Evaluating ASR on actual platform captions as the reference answers a concrete question: if we run our own Whisper-based stack on the same audio, how far does the text drift from what viewers already see as subtitles? That is useful for caption QA, repurposing content, and search indexing—domains where “good enough” depends on the product, but the numbers must be reproducible.

2. Testing Setup

Values below come from other.yaml and result.json for this case (directory mode so YAML metadata is attached to the scorer output).
FieldValue
SourceYouTube video (audio aligned to that upload)
Date (processing window)2026-03-29 (processtime-atcompleted-at in other.yaml)
LanguageEnglish
Whisper modellarge-v3-turbo
Audio duration (YAML label)17:39
Audio duration (scorer, from VTT)1059.88 s (≈ 17.7 minutes)
STT processing time175 s (sttProcessingTimeSeconds in result.json, from YAML timestamps)
RTF0.165 (from result.json)
Wall-clock interval in YAML: 2026-03-29 16:04:372026-03-29 16:07:32 (consistent with 175 s processing time).

3. Evaluation Methodology

Reference and hypothesis are WebVTT files. Cue text is extracted, then normalized (case, punctuation, light cleanup) before scoring.
Word-level alignment
Token sequences are aligned with a Levenshtein-style dynamic program; backtracking yields substitutions (S), deletions (D), and insertions (I) versus reference length N.
[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]
Character Error Rate (CER)
Whitespace is stripped; character edit distance is Levenshtein distance at character level.
[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}. ]
Real-Time Factor (RTF)
[ \mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}. ]
RTF below 1 means decoding faster than real time on this run.

4. Model Overview

Whisper large-v3-turbo targets strong quality with improved throughput relative to heavier “large” variants (behavior depends on implementation and hardware). It is a general-purpose multilingual ASR, suited to draft transcripts, search, and caption drafts where verbatim perfection is not assumed. This benchmark uses one decode configuration recorded in other.yaml; it does not sweep decoding options, VAD, or post-processing.

5. Results (From result.json)

Exact metrics:
  • Reference word count (N): 9627
  • Substitutions (S): 60
  • Deletions (D): 6449
  • Insertions (I): 0
  • WER: 0.6761192479484782
  • Accuracy: 0.3238807520515218
  • Reference character count: 38334
  • Character edit distance: 25696
  • CER: 0.6703187770647467
  • Audio duration (seconds): 1059.8809999999999
  • STT processing time (seconds): 175
  • RTF: 0.16511287587946197
  • Eval script runtime (seconds): 25.612
Rounded for reading
  • WER ≈ 67.6%; accuracy ≈ 32.4%
  • CER ≈ 67.0%
  • ~25.7k character edits on ~38.3k reference characters
  • RTF ≈ 0.165× (about faster than real time)

6. Error Pattern Analysis

Insertions are zero and deletions dwarf substitutions (6449 vs 60). That is not the usual “noisy ASR with extra filler words” profile; it points to large spans of reference text not matched by the hypothesis under this alignment—consistent with length mismatch, different segmentation, or reference spanning more content than the ASR saw (e.g., caption file vs. audio segment). CER ≈ 67% reinforces that the gap is broad, not a handful of word swaps.
For product teams: do not interpret this as “Whisper misheard 68% of words” in the colloquial sense until you confirm same audio window, same language, and comparable text normalization between caption export and model output.

7. Key Insights

  • Speed: RTF ≈ 0.17 is attractive for batch processing long clips.
  • Accuracy: ~68% WER is not publication-ready without review if you need faithful quotes.
  • Error shape: Deletion-heavy, zero insertions—audit pairing and coverage before tuning model knobs.
  • Scenario realism: ~18 minutes of continuous English from a real YouTube source is more representative than toy clips, but still one video and one model setting.
  • Reference choice: Using platform captions anchors the test to a viewer-visible baseline, which may differ from a human re-transcribe.

8. Best Model for This Scenario

Under the narrow scope “large-v3-turbo on this clip, with YouTube captions as reference,” the run is a documented baseline: it fixes throughput (RTF) and quantitative mismatch (WER/CER) for later comparison. It is not a claim that this is the best model for all English YouTube content.

9. Neutral Final Verdict

For internal drafts, topic tagging, or rough search, low RTF may make this stack usable if stakeholders accept error rates and validate critical passages.
For verbatim quoting, compliance, or accessibility-critical subtitles, ~32% word accuracy and deletion-heavy errors mean human review or alignment fixes remain mandatory. Rerun the scorer after any change to inputs; methodology stays comparable.

Source Materials

Case folder name {case-name} = 20260329 (mirror under test-transcripts/ in the repo when you publish assets).
  • Original video (audio source): https://www.youtube.com/watch?v=E73XCmLAFe8 — the reference subtitles are the captions provided with this video (exported as ref.vtt).
  • Reference transcript (VTT): test-transcripts/{case-name}/ref.vtt
  • Model transcript (VTT): test-transcripts/{case-name}/model.vtt
  • Run metadata: test-transcripts/{case-name}/other.yaml
  • Precomputed evaluation metrics: test-transcripts/{case-name}/result.json
Scoring uses scripts/evaluate-vtt-metrics.js in this repository. For long transcripts, run Node with a raised heap limit if needed (e.g. NODE_OPTIONS=--max-old-space-size=8192).

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!

Sound to Text OnlineSound to Text FreeSound to Text ConverterSound to Text MP3Sound to Text WAVSound to Text with TimestampsSound to Text for MeetingsSound to Text Multi LanguageSound to Text SubtitlesConvert WAV to TextVoice to TextVoice to Text OnlineSpeech to TextConvert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineOnline Transcription SoftwareSpeech to Text for MeetingsFast Speech to TextReal Time Speech to TextLive Transcription AppTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextTalk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for MeetingsAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website