
Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output
Eric King
Author
Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output
When using OpenAI Whisper for speech-to-text transcription, the raw output is just the beginning. Formatting your transcripts properly makes them more useful, readable, and compatible with different applications and workflows.
This comprehensive guide covers everything you need to know about formatting Whisper transcripts, including code examples for multiple output formats, best practices, and real-world use cases.
Why Format Whisper Transcripts?
Raw Whisper output provides the transcribed text, but formatted transcripts offer:
- Better readability with proper structure and timestamps
- Subtitle compatibility (SRT, VTT) for video platforms
- Structured data (JSON) for programmatic processing
- Professional presentation (DOCX, PDF) for documentation
- Search and navigation with timestamps and segments
- Speaker identification and diarization formatting
Understanding Whisper Output Structure
Whisper returns a dictionary with the following structure:
{
"text": "Full transcription text...",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 5.2,
"text": "Segment text...",
"tokens": [1234, 5678, ...],
"temperature": 0.0,
"avg_logprob": -0.5,
"compression_ratio": 1.2,
"no_speech_prob": 0.1
},
...
],
"language": "en"
}
Key fields:
text: Complete transcription as a single stringsegments: List of time-stamped segmentslanguage: Detected language code
Format 1: Plain Text (TXT)
The simplest format, suitable for basic documentation and reading.
Basic Text Formatting
import whisper
def format_as_text(result):
"""Format Whisper output as plain text."""
return result["text"]
# Usage
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
formatted_text = format_as_text(result)
# Save to file
with open("transcript.txt", "w", encoding="utf-8") as f:
f.write(formatted_text)
Enhanced Text Formatting with Timestamps
def format_text_with_timestamps(result):
"""Format with timestamps for each segment."""
formatted = []
for segment in result["segments"]:
start_time = format_time(segment["start"])
end_time = format_time(segment["end"])
text = segment["text"].strip()
formatted.append(f"[{start_time} - {end_time}] {text}")
return "\n\n".join(formatted)
def format_time(seconds):
"""Format seconds to HH:MM:SS."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
# Usage
formatted = format_text_with_timestamps(result)
with open("transcript_timestamped.txt", "w", encoding="utf-8") as f:
f.write(formatted)
Output example:
[00:00:00 - 00:00:05] Hello everyone, welcome to today's meeting.
[00:00:05 - 00:00:12] We will discuss the project timeline and upcoming milestones.
Format 2: SRT (SubRip Subtitle)
SRT is the most common subtitle format, compatible with YouTube, Vimeo, and most video players.
SRT Formatting Function
def format_as_srt(result):
"""Format Whisper output as SRT subtitles."""
srt_content = []
for i, segment in enumerate(result["segments"], start=1):
start_time = format_srt_timestamp(segment["start"])
end_time = format_srt_timestamp(segment["end"])
text = segment["text"].strip()
srt_content.append(f"{i}")
srt_content.append(f"{start_time} --> {end_time}")
srt_content.append(text)
srt_content.append("") # Empty line between entries
return "\n".join(srt_content)
def format_srt_timestamp(seconds):
"""Format seconds to SRT timestamp (HH:MM:SS,mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
# Usage
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=False)
srt_content = format_as_srt(result)
with open("transcript.srt", "w", encoding="utf-8") as f:
f.write(srt_content)
SRT Output example:
1
00:00:00,000 --> 00:00:05,200
Hello everyone, welcome to today's meeting.
2
00:00:05,200 --> 00:00:12,500
We will discuss the project timeline and upcoming milestones.
Advanced SRT with Word-Level Timestamps
def format_srt_with_words(result):
"""Create SRT with word-level timing for better synchronization."""
if not result.get("segments") or not result["segments"][0].get("words"):
# Fallback to segment-level if word timestamps not available
return format_as_srt(result)
srt_content = []
subtitle_index = 1
current_subtitle_words = []
current_start = None
current_end = None
for segment in result["segments"]:
words = segment.get("words", [])
for word_info in words:
word = word_info["word"].strip()
start = word_info["start"]
end = word_info["end"]
if current_start is None:
current_start = start
current_subtitle_words.append(word)
current_end = end
# Create subtitle every ~3 seconds or 10 words
if (end - current_start > 3.0) or (len(current_subtitle_words) >= 10):
text = " ".join(current_subtitle_words)
srt_content.append(f"{subtitle_index}")
srt_content.append(f"{format_srt_timestamp(current_start)} --> {format_srt_timestamp(current_end)}")
srt_content.append(text)
srt_content.append("")
subtitle_index += 1
current_subtitle_words = []
current_start = None
current_end = None
# Handle remaining words in segment
if current_subtitle_words:
text = " ".join(current_subtitle_words)
srt_content.append(f"{subtitle_index}")
srt_content.append(f"{format_srt_timestamp(current_start)} --> {format_srt_timestamp(current_end)}")
srt_content.append(text)
srt_content.append("")
subtitle_index += 1
current_subtitle_words = []
current_start = None
current_end = None
return "\n".join(srt_content)
# Usage with word timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
srt_content = format_srt_with_words(result)
Format 3: VTT (WebVTT)
WebVTT is the web standard for subtitles, used by HTML5 video players and web applications.
VTT Formatting Function
def format_as_vtt(result):
"""Format Whisper output as WebVTT subtitles."""
vtt_content = ["WEBVTT", ""] # VTT header
for segment in result["segments"]:
start_time = format_vtt_timestamp(segment["start"])
end_time = format_vtt_timestamp(segment["end"])
text = segment["text"].strip()
vtt_content.append(f"{start_time} --> {end_time}")
vtt_content.append(text)
vtt_content.append("") # Empty line between entries
return "\n".join(vtt_content)
def format_vtt_timestamp(seconds):
"""Format seconds to VTT timestamp (HH:MM:SS.mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
# Usage
vtt_content = format_as_vtt(result)
with open("transcript.vtt", "w", encoding="utf-8") as f:
f.write(vtt_content)
VTT Output example:
WEBVTT
00:00:00.000 --> 00:00:05.200
Hello everyone, welcome to today's meeting.
00:00:05.200 --> 00:00:12.500
We will discuss the project timeline and upcoming milestones.
Enhanced VTT with Styling
def format_vtt_with_styling(result, title="Transcription"):
"""Create VTT with styling and metadata."""
vtt_content = [
"WEBVTT",
f"Kind: captions",
f"Language: {result.get('language', 'en')}",
""
]
for segment in result["segments"]:
start_time = format_vtt_timestamp(segment["start"])
end_time = format_vtt_timestamp(segment["end"])
text = segment["text"].strip()
vtt_content.append(f"{start_time} --> {end_time}")
vtt_content.append(text)
vtt_content.append("")
return "\n".join(vtt_content)
Format 4: JSON (Structured Data)
JSON format preserves all Whisper metadata and is ideal for programmatic processing.
Basic JSON Formatting
import json
def format_as_json(result, pretty=True):
"""Format Whisper output as JSON."""
if pretty:
return json.dumps(result, indent=2, ensure_ascii=False)
else:
return json.dumps(result, ensure_ascii=False)
# Usage
json_content = format_as_json(result)
with open("transcript.json", "w", encoding="utf-8") as f:
f.write(json_content)
Custom JSON Structure
def format_custom_json(result, metadata=None):
"""Create custom JSON structure with additional metadata."""
custom_result = {
"metadata": {
"language": result.get("language", "unknown"),
"duration": result["segments"][-1]["end"] if result.get("segments") else 0,
"segment_count": len(result.get("segments", [])),
**(metadata or {})
},
"transcription": {
"full_text": result["text"],
"segments": [
{
"id": seg["id"],
"start": seg["start"],
"end": seg["end"],
"text": seg["text"].strip(),
"duration": seg["end"] - seg["start"]
}
for seg in result.get("segments", [])
]
}
}
return json.dumps(custom_result, indent=2, ensure_ascii=False)
# Usage with metadata
metadata = {
"source_file": "meeting_audio.mp3",
"transcribed_at": "2026-01-15T10:30:00Z",
"model": "whisper-base"
}
json_content = format_custom_json(result, metadata)
Format 5: DOCX (Microsoft Word)
For professional documents and reports, DOCX format provides rich formatting options.
DOCX Formatting with python-docx
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
def format_as_docx(result, output_path="transcript.docx", title="Transcription"):
"""Format Whisper output as DOCX document."""
doc = Document()
# Add title
title_para = doc.add_heading(title, 0)
title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Add metadata
doc.add_paragraph(f"Language: {result.get('language', 'Unknown')}")
doc.add_paragraph(f"Total Segments: {len(result.get('segments', []))}")
doc.add_paragraph("") # Empty line
# Add full transcription
doc.add_heading("Full Transcription", level=1)
full_text_para = doc.add_paragraph(result["text"])
full_text_para.style = 'Normal'
# Add segmented transcription with timestamps
doc.add_heading("Segmented Transcription", level=1)
for segment in result.get("segments", []):
start_time = format_time(segment["start"])
end_time = format_time(segment["end"])
text = segment["text"].strip()
# Timestamp paragraph
time_para = doc.add_paragraph()
time_run = time_para.add_run(f"[{start_time} - {end_time}]")
time_run.bold = True
time_run.font.color.rgb = RGBColor(0, 100, 200)
# Text paragraph
text_para = doc.add_paragraph(text)
text_para.style = 'List Paragraph'
# Save document
doc.save(output_path)
print(f"✓ DOCX saved: {output_path}")
# Install: pip install python-docx
# Usage
format_as_docx(result, "transcript.docx", "Meeting Transcription")
Enhanced DOCX with Speaker Labels
def format_docx_with_speakers(result, speakers=None, output_path="transcript.docx"):
"""Create DOCX with speaker identification."""
doc = Document()
doc.add_heading("Meeting Transcription", 0)
if speakers:
doc.add_paragraph(f"Participants: {', '.join(speakers)}")
doc.add_paragraph("") # Empty line
for segment in result.get("segments", []):
start_time = format_time(segment["start"])
speaker = segment.get("speaker", "Unknown")
text = segment["text"].strip()
# Speaker and timestamp
header_para = doc.add_paragraph()
header_run = header_para.add_run(f"{speaker} [{start_time}]")
header_run.bold = True
header_run.font.size = Pt(11)
# Text
text_para = doc.add_paragraph(text)
text_para.style = 'List Paragraph'
text_para.add_run("") # Empty line
doc.save(output_path)
Format 6: CSV (Spreadsheet Format)
CSV format is useful for data analysis and spreadsheet applications.
CSV Formatting
import csv
def format_as_csv(result, output_path="transcript.csv"):
"""Format Whisper output as CSV."""
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
# Header
writer.writerow(["Segment ID", "Start Time", "End Time", "Duration", "Text"])
# Data rows
for segment in result.get("segments", []):
segment_id = segment.get("id", 0)
start = segment["start"]
end = segment["end"]
duration = end - start
text = segment["text"].strip()
writer.writerow([segment_id, start, end, duration, text])
print(f"✓ CSV saved: {output_path}")
# Usage
format_as_csv(result)
Complete Formatting Utility Class
Here's a comprehensive utility class that handles all formats:
import whisper
import json
import csv
from pathlib import Path
from datetime import datetime
class WhisperFormatter:
"""Utility class for formatting Whisper transcription results."""
def __init__(self, result):
self.result = result
self.segments = result.get("segments", [])
self.language = result.get("language", "unknown")
def to_text(self, include_timestamps=False):
"""Convert to plain text."""
if include_timestamps:
lines = []
for seg in self.segments:
start = self._format_time(seg["start"])
end = self._format_time(seg["end"])
text = seg["text"].strip()
lines.append(f"[{start} - {end}] {text}")
return "\n\n".join(lines)
return self.result["text"]
def to_srt(self):
"""Convert to SRT subtitle format."""
srt_lines = []
for i, seg in enumerate(self.segments, start=1):
start = self._format_srt_time(seg["start"])
end = self._format_srt_time(seg["end"])
text = seg["text"].strip()
srt_lines.append(f"{i}\n{start} --> {end}\n{text}\n")
return "\n".join(srt_lines)
def to_vtt(self):
"""Convert to WebVTT format."""
vtt_lines = ["WEBVTT", ""]
for seg in self.segments:
start = self._format_vtt_time(seg["start"])
end = self._format_vtt_time(seg["end"])
text = seg["text"].strip()
vtt_lines.append(f"{start} --> {end}\n{text}\n")
return "\n".join(vtt_lines)
def to_json(self, pretty=True):
"""Convert to JSON format."""
if pretty:
return json.dumps(self.result, indent=2, ensure_ascii=False)
return json.dumps(self.result, ensure_ascii=False)
def to_csv(self):
"""Convert to CSV format."""
import io
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["ID", "Start", "End", "Duration", "Text"])
for seg in self.segments:
writer.writerow([
seg.get("id", 0),
seg["start"],
seg["end"],
seg["end"] - seg["start"],
seg["text"].strip()
])
return output.getvalue()
def save(self, output_path, format="txt"):
"""Save transcription in specified format."""
output_path = Path(output_path)
format = format.lower()
if format == "txt":
content = self.to_text()
elif format == "txt_ts":
content = self.to_text(include_timestamps=True)
elif format == "srt":
content = self.to_srt()
elif format == "vtt":
content = self.to_vtt()
elif format == "json":
content = self.to_json()
elif format == "csv":
content = self.to_csv()
else:
raise ValueError(f"Unsupported format: {format}")
# Determine file extension
ext_map = {
"txt": ".txt",
"txt_ts": ".txt",
"srt": ".srt",
"vtt": ".vtt",
"json": ".json",
"csv": ".csv"
}
file_path = output_path.with_suffix(ext_map.get(format, ".txt"))
with open(file_path, "w", encoding="utf-8") as f:
f.write(content)
print(f"✓ Saved: {file_path}")
return file_path
def _format_time(self, seconds):
"""Format seconds to HH:MM:SS."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
def _format_srt_time(self, seconds):
"""Format seconds to SRT timestamp."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def _format_vtt_time(self, seconds):
"""Format seconds to VTT timestamp."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
# Usage example
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)
formatter = WhisperFormatter(result)
# Save in multiple formats
formatter.save("transcript", format="txt")
formatter.save("transcript", format="srt")
formatter.save("transcript", format="vtt")
formatter.save("transcript", format="json")
formatter.save("transcript", format="csv")
Best Practices for Transcript Formatting
1. Enable Word Timestamps for Better Accuracy
# Enable word-level timestamps for precise formatting
result = model.transcribe(
"audio.mp3",
word_timestamps=True # Essential for SRT/VTT
)
2. Handle Long Segments
def split_long_segments(segments, max_duration=5.0):
"""Split segments longer than max_duration."""
split_segments = []
for seg in segments:
duration = seg["end"] - seg["start"]
if duration > max_duration:
# Split into smaller chunks
words = seg.get("words", [])
if words:
chunk_start = seg["start"]
chunk_words = []
for word_info in words:
chunk_words.append(word_info["word"].strip())
if word_info["end"] - chunk_start > max_duration:
split_segments.append({
"start": chunk_start,
"end": word_info["end"],
"text": " ".join(chunk_words)
})
chunk_start = word_info["end"]
chunk_words = []
# Add remaining words
if chunk_words:
split_segments.append({
"start": chunk_start,
"end": seg["end"],
"text": " ".join(chunk_words)
})
else:
split_segments.append(seg)
else:
split_segments.append(seg)
return split_segments
3. Clean and Normalize Text
import re
def clean_transcript_text(text):
"""Clean and normalize transcript text."""
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Fix common transcription errors
text = text.replace(" ' ", "'")
text = text.replace(" ,", ",")
text = text.replace(" .", ".")
text = text.replace(" ?", "?")
text = text.replace(" !", "!")
# Capitalize sentences
sentences = re.split(r'([.!?]\s+)', text)
text = ''.join([s.capitalize() if i % 2 == 0 else s
for i, s in enumerate(sentences)])
return text.strip()
# Apply cleaning
for segment in result["segments"]:
segment["text"] = clean_transcript_text(segment["text"])
4. Add Speaker Labels
def add_speaker_labels(result, speakers=None):
"""Add speaker identification to segments."""
if not speakers:
speakers = ["Speaker 1", "Speaker 2"]
# Simple round-robin assignment (use proper diarization in production)
for i, segment in enumerate(result["segments"]):
speaker_index = i % len(speakers)
segment["speaker"] = speakers[speaker_index]
return result
5. Validate Format Output
def validate_srt(srt_content):
"""Validate SRT format."""
lines = srt_content.strip().split('\n')
i = 0
while i < len(lines):
# Check sequence number
try:
seq_num = int(lines[i])
if seq_num <= 0:
return False, f"Invalid sequence number at line {i+1}"
except ValueError:
return False, f"Expected sequence number at line {i+1}"
i += 1
if i >= len(lines):
return False, "Missing timestamp line"
# Check timestamp
if '-->' not in lines[i]:
return False, f"Invalid timestamp format at line {i+1}"
i += 1
if i >= len(lines):
return False, "Missing text line"
# Skip text and empty line
i += 2
return True, "Valid SRT format"
Use Cases for Different Formats
TXT Format
- Use for: Simple documentation, reading, archiving
- Best when: You need plain text without timestamps
- Example: Meeting notes, interview transcripts
SRT Format
- Use for: Video subtitles, YouTube, Vimeo
- Best when: You need subtitle files for video content
- Example: Video transcription, podcast subtitles
VTT Format
- Use for: Web video players, HTML5 video
- Best when: Building web applications with video
- Example: Online course transcripts, webinars
JSON Format
- Use for: Programmatic processing, APIs, data analysis
- Best when: You need structured data with metadata
- Example: Automated workflows, data pipelines
DOCX Format
- Use for: Professional documents, reports, sharing
- Best when: You need formatted documents for review
- Example: Legal transcripts, medical notes, reports
CSV Format
- Use for: Data analysis, spreadsheets, databases
- Best when: You need tabular data for analysis
- Example: Content analysis, keyword extraction
Complete Example: Multi-Format Export
import whisper
from pathlib import Path
def transcribe_and_export_all_formats(audio_path, output_dir="output"):
"""Transcribe audio and export in all common formats."""
# Create output directory
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
# Transcribe
print("Transcribing audio...")
model = whisper.load_model("base")
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en"
)
base_name = Path(audio_path).stem
# Initialize formatter
formatter = WhisperFormatter(result)
# Export all formats
print("Exporting formats...")
formatter.save(output_path / base_name, format="txt")
formatter.save(output_path / base_name, format="txt_ts")
formatter.save(output_path / base_name, format="srt")
formatter.save(output_path / base_name, format="vtt")
formatter.save(output_path / base_name, format="json")
formatter.save(output_path / base_name, format="csv")
print(f"\n✓ All formats exported to: {output_path}")
print(f" Language: {result['language']}")
print(f" Duration: {result['segments'][-1]['end']:.2f}s")
print(f" Segments: {len(result['segments'])}")
return result
# Usage
result = transcribe_and_export_all_formats("meeting.mp3", "transcripts")
Troubleshooting Common Issues
Issue 1: Timestamps Not Aligning
Problem: SRT/VTT timestamps don't match video playback.
Solution:
# Ensure word_timestamps is enabled
result = model.transcribe("audio.mp3", word_timestamps=True)
# Use word-level timing for subtitles
def create_precise_srt(result):
# Use word timestamps instead of segment timestamps
# for better synchronization
...
Issue 2: Text Formatting Issues
Problem: Extra spaces, missing punctuation.
Solution:
# Apply text cleaning
def clean_text(text):
text = re.sub(r'\s+', ' ', text)
text = text.replace(" ' ", "'")
return text.strip()
for segment in result["segments"]:
segment["text"] = clean_text(segment["text"])
Issue 3: Long Segments in Subtitles
Problem: Subtitles are too long for display.
Solution:
# Split long segments
def split_subtitle_text(text, max_length=42):
"""Split text into subtitle-friendly chunks."""
words = text.split()
chunks = []
current_chunk = []
current_length = 0
for word in words:
if current_length + len(word) + 1 > max_length and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_length = len(word)
else:
current_chunk.append(word)
current_length += len(word) + 1
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Conclusion
Properly formatting Whisper transcripts makes them more useful and compatible with different applications. Whether you need subtitles for video, structured data for processing, or professional documents for sharing, the right format makes all the difference.
Key takeaways:
- Use SRT/VTT for video subtitles
- Use JSON for programmatic processing
- Use TXT for simple documentation
- Use DOCX for professional documents
- Use CSV for data analysis
- Always enable word_timestamps for better accuracy
- Clean and normalize text for better readability
For more information about Whisper transcription, check out our guides on Whisper Python Example, Whisper Accuracy Tips, and Whisper for Meetings.
Looking for a professional speech-to-text solution with built-in formatting options? Visit SayToWords to explore our AI transcription platform with support for multiple output formats.
