🎉 We're live! All services are free during our trial period—pricing plans coming soon.

I Tested AI Transcription on an English Interview — February 26, 2026 Results (Whisper BASE, ~11‑Minute Audio)

I Tested AI Transcription on an English Interview — February 26, 2026 Results (Whisper BASE, ~11‑Minute Audio)

2026-02-26Test
Eric King

Eric King

Author


1. Why This Interview Benchmark Matters

For real interviews, transcription accuracy is not optional. It decides whether you can safely quote guests, search for key topics, and build downstream analysis without misrepresenting what was said. A dropped qualifier, a misheard number, or a mangled proper noun can change the meaning of an answer.
In this benchmark, I ran an English “Bill interview” clip through a Whisper‑based transcription stack and evaluated it on standard ASR metrics. The goal is not marketing, but a concrete, reproducible snapshot of how the system performs on a real, moderately long interview.
The original interview audio corresponds to a YouTube video, which you can reference here for context:
Source interview video on YouTube.

Source Materials

All inputs used for this benchmark live in the repository and can be inspected directly:
These files are the only sources used to derive the numbers and conclusions in this post.

Screenshots from this run

SayToWords transcription dashboard — metrics overview
SayToWords transcription dashboard — transcript view

2. Testing Setup

For this run, I used the following configuration (all values are taken from the precomputed metadata and result.json):
  • Date of run: 2026‑02‑26 (derived from processing timestamps)
  • Scenario: English interview (test-transcripts/bill-interview)
  • Language: English
  • Audio duration:
    • audioDurationSeconds = 653.2934375
    • 10.89 minutes of material
  • Processing time:
    • sttProcessingTimeSeconds = 85.476
    • 1.42 minutes end‑to‑end decoding time
  • Model / mode:
    • whisper-model: BASE
    • saytowords-mode: base
Recording conditions, microphone type, and speech density are not explicitly documented in the metadata, so they are left out rather than guessed. All alignment and scoring were completed before this report was generated; the numbers below are read directly from test-transcripts/bill-interview/result.json.

3. Evaluation Methodology

Both the human transcript (ref.vtt) and the model output (model.vtt) are stored in WebVTT format. The evaluation pipeline first extracts plain text from these files, aligns the reference and hypothesis, and then computes error metrics.
Word Error Rate (WER)
After tokenizing into word sequences, we count:
  • (S): substitutions
  • (D): deletions
  • (I): insertions
  • (N): number of reference words
The word error rate is:
[ \text{WER} = \frac{S + D + I}{N} ]
Word‑level accuracy is then:
[ \text{Accuracy} = 1 - \text{WER} ]
Character Error Rate (CER)
At character level, whitespace is stripped and a Levenshtein edit distance is computed:
  • Character edit distance: total insertions, deletions, substitutions
  • Total characters: number of reference characters (without spaces)
[ \text{CER} = \frac{\text{Character edit distance}}{\text{Total characters}} ]
Real‑Time Factor (RTF)
Throughput is measured with the real‑time factor:
[ \text{RTF} = \frac{\text{Processing Time}}{\text{Audio Duration}} ]
Here, processing time comes from the difference between processtime-at and completed-at in other.yaml, and audio duration is taken from audio-duration in the same file.
Implementation notes
  • All metrics are computed via transcript alignment between reference and hypothesis.
  • Edit distances (word‑ and character‑level) use a high‑performance Levenshtein implementation.
  • The alignment engine runs on a C++‑optimized backend.
  • Time complexity of alignment is O(nm) for sequences of length (n) and (m).
  • All values in result.json are deterministic and reproducible: given the same inputs, the scorer always produces the same numbers.

4. Model Overview

Only one model configuration was evaluated in this run:
  • Whisper BASE (saytowords-mode: base)
    A general‑purpose speech‑to‑text model with moderate capacity, designed for multi‑accent English and long‑form audio. In this benchmark, it is used as‑is (no fine‑tuning, no manual correction) to show raw behavior on a real interview.
Future comparisons could add smaller or larger Whisper variants and non‑Whisper systems, but this post focuses on characterizing this single baseline.

5. Results (From result.json)

The following values are taken exactly from test-transcripts/bill-interview/result.json:
  • Audio duration (s): 653.2934375
  • Processing time (s): 85.476
  • Reference words (N): 1846
  • Substitutions (S): 67
  • Deletions (D): 178
  • Insertions (I): 23
  • WER: 0.14517876489707476
  • Accuracy: 0.8548212351029252
  • Reference characters: 7335
  • Character edit distance: 825
  • CER: 0.11247443762781185
  • RTF: 0.13083860191079907
For convenience:
  • WER ≈ 14.52%
  • Accuracy ≈ 85.48%
  • CER ≈ 11.25%
  • RTF ≈ 0.13, i.e. roughly 7.6× faster than real time.

6. Error Pattern Analysis

No explicit segment markers or timestamps were provided for targeted inspection, so this analysis is based purely on the aggregate counts.
  • Dominant error type: deletions
    • Deletions: D = 178
    • Substitutions: S = 67
    • Insertions: I = 23
      Deletions make up the majority of word‑level errors. This indicates that the model mostly drops words rather than hallucinating extra content. In the context of an interview, this typically translates to missing function words, trailing words in fast speech, or pieces of overlapping speech that the model resolves by omission.
  • Substitutions are present but secondary
    With S = 67, substitutions represent roughly a quarter of all errors. These usually correspond to lexical confusions: similar‑sounding words, misrecognized names, or domain terms the model has not seen often enough.
  • Insertions are relatively rare
    Only I = 23 insertions were observed. This is consistent with a model that is conservative about hallucinating content: it errs more by omission than by adding spurious words.
At the character level:
  • Character edit distance = 825 over 7335 characters, yielding CER ≈ 11.25%.
    Compared to the WER of ~14.5%, this lower CER suggests that, when errors occur, there is often partial character overlap—e.g., minor inflections, small spelling differences, or broken compounds—rather than completely unrelated strings.
Without timestamp‑level error markers, we can’t point to specific moments in the interview where the model failed. However, the S/D/I breakdown already gives a usable profile: this system is more likely to under‑transcribe than to invent passages that aren’t there.

7. Key Insights

Based strictly on the numerical metrics:
  1. Speed vs. accuracy is well balanced for interviews
    With RTF ≈ 0.13, the system processes ~10.9 minutes of audio in ~1.4 minutes while keeping WER ≈ 14.5% and CER ≈ 11.3%. For bulk processing of interviews, this is a practical operating point.
  2. Errors are heavily skewed toward deletions
    Deletions (178) dominate over substitutions (67) and insertions (23). In practice, that means you’re more likely to lose small chunks of content than to see the model fabricate phrases wholesale.
  3. Character‑level stability is better than word‑level
    CER being lower than WER indicates that many incorrect words are still close to the reference at the character level. This is good news for tasks like search and topic clustering that can tolerate mild lexical variation.
  4. Evaluation is based on a non‑trivial amount of speech
    With 1846 reference words and 7335 characters, this is closer to a real interview than a toy example. The metrics represent sustained behavior across several minutes of spontaneous speech.

8. Best Model for This Scenario

In this benchmark, only Whisper BASE (base mode) was tested, so it is simultaneously:
  • The strongest model on this chart, and
  • The only point of reference.
Within that constraint, it delivers:
  • WER ≈ 14.5%, Accuracy ≈ 85.5% on ~11 minutes of interview audio.
  • RTF ≈ 0.13, i.e. 7–8× faster‑than‑real‑time decoding.
For workflows that need quick, reasonably accurate interview transcripts—especially for browsing, search, or rough quoting—this configuration is numerically adequate. For use cases where every word must be perfect, these metrics also make clear that manual review or a stronger model would still be required.

9. Neutral Final Verdict

On this specific English interview from February 26, 2026, Whisper BASE in “base” mode shows:
  • A deletion‑heavy error profile with relatively few insertions.
  • Mid‑teens WER and low‑teens CER, backed by a non‑trivial reference transcript.
  • A Real‑Time Factor around 0.13, making it suitable for large‑scale batch processing.
The behavior is numerically consistent, reproducible, and fast enough for daily benchmarking. For an independent evaluator, the takeaway is straightforward: this setup is a viable baseline for interview transcription, but not yet a replacement for human review in highly sensitive domains.

Reference Artifacts

Below are collapsible views of the reference and model transcripts. You can expand them for a full side‑by‑side comparison.

Try It Free Now

Try our AI audio and video service! You can not only enjoy high-precision speech-to-text transcription, multilingual translation, and intelligent speaker diarization, but also realize automatic video subtitle generation, intelligent audio and video content editing, and synchronized audio-visual analysis. It covers all scenarios such as meeting recordings, short video creation, and podcast production—start your free trial now!

Convert MP3 to TextConvert Voice Recording to TextVoice Typing OnlineVoice to Text with TimestampsVoice to Text Real TimeVoice to Text for Long AudioVoice to Text for VideoVoice to Text for YouTubeVoice to Text for Video EditingVoice to Text for SubtitlesVoice to Text for PodcastsVoice to Text for InterviewsInterview Audio to TextVoice to Text for RecordingsVoice to Text for MeetingsVoice to Text for LecturesVoice to Text for NotesVoice to Text Multi LanguageVoice to Text AccurateVoice to Text FastPremiere Pro Voice to Text AlternativeDaVinci Voice to Text AlternativeVEED Voice to Text AlternativeInVideo Voice to Text AlternativeOtter.ai Voice to Text AlternativeDescript Voice to Text AlternativeTrint Voice to Text AlternativeRev Voice to Text AlternativeSonix Voice to Text AlternativeHappy Scribe Voice to Text AlternativeZoom Voice to Text AlternativeGoogle Meet Voice to Text AlternativeMicrosoft Teams Voice to Text AlternativeFireflies.ai Voice to Text AlternativeFathom Voice to Text AlternativeFlexClip Voice to Text AlternativeKapwing Voice to Text AlternativeCanva Voice to Text AlternativeSpeech to Text for Long AudioAI Voice to TextVoice to Text FreeVoice to Text No AdsVoice to Text for Noisy AudioVoice to Text with TimeGenerate Subtitles from AudioPodcast Transcription OnlineTranscribe Customer CallsTikTok Voice to TextTikTok Audio to TextYouTube Voice to TextYouTube Audio to TextMemo Voice to TextWhatsApp Voice Message to TextTelegram Voice to TextDiscord Call TranscriptionTwitch Voice to TextSkype Voice to TextMessenger Voice to TextLINE Voice Message to TextTranscribe Vlogs to TextConvert Sermon Audio to TextConvert Talking to WritingTranslate Audio to TextTurn Audio Notes to TextVoice TypingVoice Typing for MeetingsVoice Typing for YouTubeSpeak to TypeHands-Free TypingVoice to WordsSpeech to WordsSpeech to Text OnlineOnline Transcription SoftwareSpeech to Text for MeetingsFast Speech to TextReal Time Speech to TextLive Transcription AppTikTok Speech to TextTikTok Sound to TextTalking to WordsTalk to TextTalk to Text FreeTalk to Text OnlineTalk to Text for YouTubeTalk to Text for SubtitlesTalk to Text for Content CreatorsTalk to Text for MeetingsAudio to TypingSound to TextVoice Writing ToolSpeech Writing ToolVoice DictationLegal Transcription ToolMedical Voice Dictation ToolJapanese Audio TranscriptionKorean Meeting TranscriptionMeeting Transcription ToolMeeting Audio to TextLecture to Text ConverterLecture Audio to TextVideo to Text TranscriptionSubtitle Generator for TikTokCall Center TranscriptionReels Audio to Text ToolTranscribe MP3 to TextTranscribe WAV File to TextCapCut Voice to TextCapCut Speech to TextVoice to Text in EnglishAudio to Text EnglishVoice to Text in SpanishVoice to Text in FrenchAudio to Text FrenchVoice to Text in GermanAudio to Text GermanVoice to Text in JapaneseAudio to Text JapaneseVoice to Text in KoreanAudio to Text KoreanVoice to Text in PortugueseVoice to Text in ArabicVoice to Text in ChineseVoice to Text in HindiVoice to Text in RussianWeb Voice Typing ToolVoice Typing Website