
Whisper JavaScript Example: Speech to Text with Node.js
Eric King
Author
Whisper JavaScript Example: Speech to Text with Node.js
Whisper is a powerful speech-to-text model widely used for voice to text, audio transcription, and long-form speech recognition.
In this article, youβll learn how to use Whisper with JavaScript (Node.js) to convert audio files into text.
In this article, youβll learn how to use Whisper with JavaScript (Node.js) to convert audio files into text.
This guide is suitable for:
- Developers building voice to text features
- SaaS products using audio transcription
- Anyone looking for a Whisper JavaScript example
What Is Whisper?
Whisper is an automatic speech recognition (ASR) model that can:
- Transcribe speech into text
- Detect spoken language automatically
- Handle long audio files
- Work well with noisy recordings
Itβs commonly used for:
- Podcasts
- Meetings
- Interviews
- Video subtitles
Prerequisites
Before starting, make sure you have:
- Node.js 18+
- An audio file (
mp3,wav,m4a, etc.) - An API key for speech-to-text (Whisper-compatible)
Install dependencies:
npm install openai
Basic Whisper JavaScript Example
Below is a minimal Node.js example that sends an audio file to Whisper and returns the transcription.
Project Structure
project/
ββ audio/
β ββ sample.mp3
ββ transcribe.js
ββ package.json
JavaScript Code: Audio to Text
import fs from "fs";
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function transcribeAudio() {
const response = await openai.audio.transcriptions.create({
file: fs.createReadStream("./audio/sample.mp3"),
model: "whisper-1"
});
console.log("Transcription result:");
console.log(response.text);
}
transcribeAudio();
Run the script
node transcribe.js
Output example:
Hello everyone, welcome to todayβs meeting. We will discuss the project timeline.
Transcribing Long Audio Files
Whisper works well with long recordings, such as:
- Podcasts
- Lectures
- Interviews
For very large files, common best practices include:
- Splitting audio into chunks
- Transcribing asynchronously
- Merging results afterward
Getting Timestamps (Optional)
Some Whisper-based systems support timestamps at the sentence or word level.
This is useful for:
- Subtitles (SRT / VTT)
- Video editing
- Searchable transcripts
Example output format:
[00:00:01] Hello everyone
[00:00:05] Welcome to todayβs meeting
Supported Audio Formats
Whisper supports most common formats:
- MP3
- WAV
- M4A
- MP4
- WEBM
For best accuracy:
- Use clear audio
- Avoid heavy background noise
- Prefer WAV or high-bitrate MP3
Common Use Cases
- Voice to text for meetings
- Podcast transcription
- YouTube video subtitles
- Interview transcription
- Research and academic transcription
Whisper vs Other Speech-to-Text Tools
| Feature | Whisper |
|---|---|
| Long audio support | β |
| Multi-language | β |
| Open-source model | β |
| JavaScript support | β |
| Timestamp support | β |
Whisper is especially strong for long-form voice to text compared to many real-time-only solutions.
Conclusion
This Whisper JavaScript example shows how easy it is to build a voice to text feature using Node.js.
With just a few lines of code, you can transcribe audio files accurately and scale it for real-world applications.
If youβre building a speech-to-text SaaS, Whisper is a solid foundation for:
- Long audio transcription
- Multilingual voice to text
- Timestamped transcripts
