🎉 We're live! All services are free during our trial period—pricing plans coming soon.

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

Whisper Medium on English YouTube Audio — March 30, 2026 Benchmark (WER, CER, RTF)

2026-03-30Test
Eric King

Eric King

Author


This post records one fixed-configuration run on English YouTube audio with Whisper medium. From result.json, the strict score is WER = 68.23% and accuracy = 31.77%, with a strongly deletion-heavy profile (D = 8,718, S = 131, I = 0). In plain terms, this looks less like isolated word confusions and more like a coverage mismatch between the reference captions and the generated transcript, so the output should be interpreted as a reproducible baseline rather than a standalone quality claim.
Video and reference text. The source is this YouTube video. The reference file (ref.vtt) comes from the caption track provided with that video, and model.vtt is the output from this Whisper run. That means the benchmark measures agreement with platform captions (useful in production workflows), not with a manually curated linguistic gold transcript.

1. Why This Benchmark Matters

Long-form YouTube audio is a practical stress case for ASR because it mixes natural pacing shifts, edits, names, and topic changes in a way short demos do not. If your downstream workflow is subtitle QA, search indexing, content repurposing, or draft summarization, this scenario reflects real operational constraints better than clean lab speech.
Using the platform caption track as the reference creates a realistic “what users already see vs what our ASR pipeline outputs” comparison. It is not perfect ground truth, but it is highly relevant for product teams who need consistency checks and repeatable tracking over time.

2. Testing Setup

Values below come directly from other.yaml and result.json in this case folder.
FieldValue
SourceYouTube video
Date (processing window)2026-03-30 (processtime-atcompleted-at)
LanguageEnglish
Whisper modelmedium
Audio duration (YAML label)22:44
Audio duration (scorer / YAML parsed)1364 s (≈ 22.73 minutes)
STT processing time365 s
RTF0.2676
Wall-clock timestamps: 2026-03-30 19:49:572026-03-30 19:56:02, consistent with 365 seconds of processing.

3. Evaluation Methodology

The evaluation is produced by:
  • scripts/evaluate-vtt-metrics.js
The script reads ref.vtt and model.vtt, extracts plain cue text, normalizes tokens, then aligns reference and hypothesis with Levenshtein dynamic programming.
Word-level alignment
At word level, backtracking over the DP matrix yields substitutions (S), deletions (D), and insertions (I) against reference size N.
[ \mathrm{WER} = \frac{S + D + I}{N}, \qquad \mathrm{Accuracy} = 1 - \mathrm{WER}. ]
Character Error Rate (CER)
Whitespace is removed first. Character edit distance is then computed by Levenshtein at character level.
[ \mathrm{CER} = \frac{\text{Character edit distance}}{\text{Reference character count (no spaces)}}. ]
Real-Time Factor (RTF)
[ \mathrm{RTF} = \frac{\text{STT processing time}}{\text{Audio duration}}. ]
The script now outputs two scoring views:
  • strictMetrics: default normalization (punctuation/case normalized, word-level strictness preserved)
  • relaxedMetrics: additional normalization (quote removal, looser numeric formatting)
This dual reporting helps distinguish “formatting mismatch” from deeper lexical/coverage mismatch.

4. Model Overview

Whisper medium is a general-purpose Whisper checkpoint often used when teams want a practical trade-off between speed and recognition quality on commodity hardware. It is commonly sufficient for draft transcription, indexing, and downstream NLP preprocessing, but still requires validation for verbatim publishing or compliance-sensitive use cases.
Only one configuration is evaluated here (model and language from other.yaml). No decoder hyperparameter sweep, no custom post-correction, and no domain lexicon boosting were applied in this run.

5. Results (From result.json)

Strict metrics (metrics / strictMetrics)
  • Reference word count (N): 12,970
  • Substitutions (S): 131
  • Deletions (D): 8,718
  • Insertions (I): 0
  • WER: 0.6822667694680031
  • Accuracy: 0.31773323053199687
  • Reference character count: 51,745
  • Character edit distance: 34,683
  • CER: 0.6702676587109866
  • Audio duration (seconds): 1,364
  • STT processing time (seconds): 365
  • RTF: 0.26759530791788855
  • Eval script runtime (seconds): 149.07
Relaxed metrics (relaxedMetrics)
  • WER: 0.682112567463377
  • Accuracy: 0.317887432536623
  • CER: 0.6700148518721175
  • Character edit distance: 34,286
  • Reference character count: 51,172
Rounded interpretation
  • Strict WER ≈ 68.23%, Accuracy ≈ 31.77%, CER ≈ 67.03%
  • Relaxed WER ≈ 68.21%, Accuracy ≈ 31.79%, CER ≈ 67.00%
  • Difference between strict and relaxed is small, suggesting the mismatch is not mainly punctuation/formatting noise.
  • RTF ≈ 0.268 (about 3.7× faster than real time)

6. Error Pattern Analysis

Two signals stand out immediately:
  • Insertion = 0
  • Deletion >> substitution (8,718 vs 131)
That pattern usually indicates that many reference words do not find aligned counterparts in the hypothesis. In practice, this can happen due to large coverage differences (different subtitle segmentation, truncated hypothesis, reference including non-speech captions, or timing-window mismatch), not only “wrongly recognized words.”
The strict/relaxed gap is tiny, which further supports this interpretation: normalization tweaks barely moved scores, so the dominant issue is likely alignment/coverage rather than punctuation or quote formatting.

7. Key Insights

  • Speed: With RTF ≈ 0.268, processing is clearly faster than real time and usable for batch pipelines.
  • Accuracy signal: ~68% WER is too high for quote-level publishing without review.
  • Error profile: Deletion dominance points to coverage mismatch first; optimize pairing/segmentation checks before model tuning.
  • Method robustness: Strict and relaxed metrics are almost identical, so the result is not driven by superficial formatting differences.
  • Representativeness: ~22.7 minutes is a meaningful long-form sample, but still one clip and one configuration.

8. Best Model for This Scenario

Under the narrow scope “Whisper medium, this exact YouTube clip, this exact reference caption source,” the run serves as a transparent baseline. It gives a stable throughput anchor (RTF) and two consistent text-agreement views (strict/relaxed WER/CER) for future A/B comparisons.
It does not imply Whisper medium is universally best for English YouTube ASR; it simply defines a reproducible checkpoint for your own evaluation ladder.

9. Neutral Final Verdict

For drafting, rough indexing, and topic extraction, this setup may still be useful because throughput is practical and outputs are deterministic under the same script.
For verbatim publishing, legal/compliance records, or accessibility-critical subtitles, the current agreement level (about 31.8% accuracy) and deletion-heavy profile imply that manual correction or stronger setup changes are required.
Most importantly, keep the evaluation method fixed (scripts/evaluate-vtt-metrics.js) when iterating models. Consistent methodology is what makes improvements measurable.

Source Materials

Case folder name {case-name} = 20260330.
  • Original audio (video): https://www.youtube.com/watch?v=EatCzpKNTMs — reference subtitles are the caption track from this video (exported to ref.vtt).
  • Reference transcript (VTT): test-transcripts/{case-name}/ref.vtt
  • Model transcript (VTT): test-transcripts/{case-name}/model.vtt
  • Run metadata: test-transcripts/{case-name}/other.yaml
  • Precomputed evaluation metrics: test-transcripts/{case-name}/result.json
Evaluation script used: scripts/evaluate-vtt-metrics.js
For long transcripts, run Node with a higher heap limit when needed (for example: NODE_OPTIONS=--max-old-space-size=8192).

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!

Sound to Text OnlineSound to Text FreeSound to Text ConverterSound to Text MP3Sound to Text WAVSound to Text with TimestampsSound to Text for MeetingsSound to Text Multi LanguageSound to Text SubtitlesConvert WAV to TextVoice to TextVoice to Text OnlineSpeech to TextConvert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineOnline Transcription SoftwareSpeech to Text for MeetingsFast Speech to TextReal Time Speech to TextLive Transcription AppTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextTalk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for MeetingsAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website