Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output

2026-01-17SpeechToText Whisper Tutorial

Eric King

Author

Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output

When using OpenAI Whisper for speech-to-text transcription, the raw output is just the beginning. Formatting your transcripts properly makes them more useful, readable, and compatible with different applications and workflows.

This comprehensive guide covers everything you need to know about formatting Whisper transcripts, including code examples for multiple output formats, best practices, and real-world use cases.

Why Format Whisper Transcripts?

Raw Whisper output provides the transcribed text, but formatted transcripts offer:

Better readability with proper structure and timestamps
Subtitle compatibility (SRT, VTT) for video platforms
Structured data (JSON) for programmatic processing
Professional presentation (DOCX, PDF) for documentation
Search and navigation with timestamps and segments
Speaker identification and diarization formatting

Understanding Whisper Output Structure

Whisper returns a dictionary with the following structure:

{
    "text": "Full transcription text...",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 5.2,
            "text": "Segment text...",
            "tokens": [1234, 5678, ...],
            "temperature": 0.0,
            "avg_logprob": -0.5,
            "compression_ratio": 1.2,
            "no_speech_prob": 0.1
        },
        ...
    ],
    "language": "en"
}

Key fields:

text: Complete transcription as a single string
segments: List of time-stamped segments
language: Detected language code

Format 1: Plain Text (TXT)

The simplest format, suitable for basic documentation and reading.

Basic Text Formatting

import whisper

def format_as_text(result):
    """Format Whisper output as plain text."""
    return result["text"]

# Usage
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
formatted_text = format_as_text(result)

# Save to file
with open("transcript.txt", "w", encoding="utf-8") as f:
    f.write(formatted_text)

Enhanced Text Formatting with Timestamps

def format_text_with_timestamps(result):
    """Format with timestamps for each segment."""
    formatted = []
    for segment in result["segments"]:
        start_time = format_time(segment["start"])
        end_time = format_time(segment["end"])
        text = segment["text"].strip()
        formatted.append(f"[{start_time} - {end_time}] {text}")
    
    return "\n\n".join(formatted)

def format_time(seconds):
    """Format seconds to HH:MM:SS."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"

# Usage
formatted = format_text_with_timestamps(result)
with open("transcript_timestamped.txt", "w", encoding="utf-8") as f:
    f.write(formatted)

Output example:

[00:00:00 - 00:00:05] Hello everyone, welcome to today's meeting.

[00:00:05 - 00:00:12] We will discuss the project timeline and upcoming milestones.

Format 2: SRT (SubRip Subtitle)

SRT is the most common subtitle format, compatible with YouTube, Vimeo, and most video players.

SRT Formatting Function

def format_as_srt(result):
    """Format Whisper output as SRT subtitles."""
    srt_content = []
    
    for i, segment in enumerate(result["segments"], start=1):
        start_time = format_srt_timestamp(segment["start"])
        end_time = format_srt_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_content.append(f"{i}")
        srt_content.append(f"{start_time} --> {end_time}")
        srt_content.append(text)
        srt_content.append("")  # Empty line between entries
    
    return "\n".join(srt_content)

def format_srt_timestamp(seconds):
    """Format seconds to SRT timestamp (HH:MM:SS,mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Usage
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=False)
srt_content = format_as_srt(result)

with open("transcript.srt", "w", encoding="utf-8") as f:
    f.write(srt_content)

SRT Output example:

1
00:00:00,000 --> 00:00:05,200
Hello everyone, welcome to today's meeting.

2
00:00:05,200 --> 00:00:12,500
We will discuss the project timeline and upcoming milestones.

Advanced SRT with Word-Level Timestamps

def format_srt_with_words(result):
    """Create SRT with word-level timing for better synchronization."""
    if not result.get("segments") or not result["segments"][0].get("words"):
        # Fallback to segment-level if word timestamps not available
        return format_as_srt(result)
    
    srt_content = []
    subtitle_index = 1
    current_subtitle_words = []
    current_start = None
    current_end = None
    
    for segment in result["segments"]:
        words = segment.get("words", [])
        
        for word_info in words:
            word = word_info["word"].strip()
            start = word_info["start"]
            end = word_info["end"]
            
            if current_start is None:
                current_start = start
            
            current_subtitle_words.append(word)
            current_end = end
            
            # Create subtitle every ~3 seconds or 10 words
            if (end - current_start > 3.0) or (len(current_subtitle_words) >= 10):
                text = " ".join(current_subtitle_words)
                srt_content.append(f"{subtitle_index}")
                srt_content.append(f"{format_srt_timestamp(current_start)} --> {format_srt_timestamp(current_end)}")
                srt_content.append(text)
                srt_content.append("")
                
                subtitle_index += 1
                current_subtitle_words = []
                current_start = None
                current_end = None
        
        # Handle remaining words in segment
        if current_subtitle_words:
            text = " ".join(current_subtitle_words)
            srt_content.append(f"{subtitle_index}")
            srt_content.append(f"{format_srt_timestamp(current_start)} --> {format_srt_timestamp(current_end)}")
            srt_content.append(text)
            srt_content.append("")
            
            subtitle_index += 1
            current_subtitle_words = []
            current_start = None
            current_end = None
    
    return "\n".join(srt_content)

# Usage with word timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
srt_content = format_srt_with_words(result)

Format 3: VTT (WebVTT)

WebVTT is the web standard for subtitles, used by HTML5 video players and web applications.

VTT Formatting Function

def format_as_vtt(result):
    """Format Whisper output as WebVTT subtitles."""
    vtt_content = ["WEBVTT", ""]  # VTT header
    
    for segment in result["segments"]:
        start_time = format_vtt_timestamp(segment["start"])
        end_time = format_vtt_timestamp(segment["end"])
        text = segment["text"].strip()
        
        vtt_content.append(f"{start_time} --> {end_time}")
        vtt_content.append(text)
        vtt_content.append("")  # Empty line between entries
    
    return "\n".join(vtt_content)

def format_vtt_timestamp(seconds):
    """Format seconds to VTT timestamp (HH:MM:SS.mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

# Usage
vtt_content = format_as_vtt(result)
with open("transcript.vtt", "w", encoding="utf-8") as f:
    f.write(vtt_content)

VTT Output example:

WEBVTT

00:00:00.000 --> 00:00:05.200
Hello everyone, welcome to today's meeting.

00:00:05.200 --> 00:00:12.500
We will discuss the project timeline and upcoming milestones.

Enhanced VTT with Styling

def format_vtt_with_styling(result, title="Transcription"):
    """Create VTT with styling and metadata."""
    vtt_content = [
        "WEBVTT",
        f"Kind: captions",
        f"Language: {result.get('language', 'en')}",
        ""
    ]
    
    for segment in result["segments"]:
        start_time = format_vtt_timestamp(segment["start"])
        end_time = format_vtt_timestamp(segment["end"])
        text = segment["text"].strip()
        
        vtt_content.append(f"{start_time} --> {end_time}")
        vtt_content.append(text)
        vtt_content.append("")
    
    return "\n".join(vtt_content)

Format 4: JSON (Structured Data)

JSON format preserves all Whisper metadata and is ideal for programmatic processing.

Basic JSON Formatting

import json

def format_as_json(result, pretty=True):
    """Format Whisper output as JSON."""
    if pretty:
        return json.dumps(result, indent=2, ensure_ascii=False)
    else:
        return json.dumps(result, ensure_ascii=False)

# Usage
json_content = format_as_json(result)
with open("transcript.json", "w", encoding="utf-8") as f:
    f.write(json_content)

Custom JSON Structure

def format_custom_json(result, metadata=None):
    """Create custom JSON structure with additional metadata."""
    custom_result = {
        "metadata": {
            "language": result.get("language", "unknown"),
            "duration": result["segments"][-1]["end"] if result.get("segments") else 0,
            "segment_count": len(result.get("segments", [])),
            **(metadata or {})
        },
        "transcription": {
            "full_text": result["text"],
            "segments": [
                {
                    "id": seg["id"],
                    "start": seg["start"],
                    "end": seg["end"],
                    "text": seg["text"].strip(),
                    "duration": seg["end"] - seg["start"]
                }
                for seg in result.get("segments", [])
            ]
        }
    }
    
    return json.dumps(custom_result, indent=2, ensure_ascii=False)

# Usage with metadata
metadata = {
    "source_file": "meeting_audio.mp3",
    "transcribed_at": "2026-01-15T10:30:00Z",
    "model": "whisper-base"
}
json_content = format_custom_json(result, metadata)

Format 5: DOCX (Microsoft Word)

For professional documents and reports, DOCX format provides rich formatting options.

DOCX Formatting with python-docx

from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH

def format_as_docx(result, output_path="transcript.docx", title="Transcription"):
    """Format Whisper output as DOCX document."""
    doc = Document()
    
    # Add title
    title_para = doc.add_heading(title, 0)
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    # Add metadata
    doc.add_paragraph(f"Language: {result.get('language', 'Unknown')}")
    doc.add_paragraph(f"Total Segments: {len(result.get('segments', []))}")
    doc.add_paragraph("")  # Empty line
    
    # Add full transcription
    doc.add_heading("Full Transcription", level=1)
    full_text_para = doc.add_paragraph(result["text"])
    full_text_para.style = 'Normal'
    
    # Add segmented transcription with timestamps
    doc.add_heading("Segmented Transcription", level=1)
    
    for segment in result.get("segments", []):
        start_time = format_time(segment["start"])
        end_time = format_time(segment["end"])
        text = segment["text"].strip()
        
        # Timestamp paragraph
        time_para = doc.add_paragraph()
        time_run = time_para.add_run(f"[{start_time} - {end_time}]")
        time_run.bold = True
        time_run.font.color.rgb = RGBColor(0, 100, 200)
        
        # Text paragraph
        text_para = doc.add_paragraph(text)
        text_para.style = 'List Paragraph'
    
    # Save document
    doc.save(output_path)
    print(f"✓ DOCX saved: {output_path}")

# Install: pip install python-docx
# Usage
format_as_docx(result, "transcript.docx", "Meeting Transcription")

Enhanced DOCX with Speaker Labels

def format_docx_with_speakers(result, speakers=None, output_path="transcript.docx"):
    """Create DOCX with speaker identification."""
    doc = Document()
    doc.add_heading("Meeting Transcription", 0)
    
    if speakers:
        doc.add_paragraph(f"Participants: {', '.join(speakers)}")
    
    doc.add_paragraph("")  # Empty line
    
    for segment in result.get("segments", []):
        start_time = format_time(segment["start"])
        speaker = segment.get("speaker", "Unknown")
        text = segment["text"].strip()
        
        # Speaker and timestamp
        header_para = doc.add_paragraph()
        header_run = header_para.add_run(f"{speaker} [{start_time}]")
        header_run.bold = True
        header_run.font.size = Pt(11)
        
        # Text
        text_para = doc.add_paragraph(text)
        text_para.style = 'List Paragraph'
        text_para.add_run("")  # Empty line
    
    doc.save(output_path)

Format 6: CSV (Spreadsheet Format)

CSV format is useful for data analysis and spreadsheet applications.

CSV Formatting

import csv

def format_as_csv(result, output_path="transcript.csv"):
    """Format Whisper output as CSV."""
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        
        # Header
        writer.writerow(["Segment ID", "Start Time", "End Time", "Duration", "Text"])
        
        # Data rows
        for segment in result.get("segments", []):
            segment_id = segment.get("id", 0)
            start = segment["start"]
            end = segment["end"]
            duration = end - start
            text = segment["text"].strip()
            
            writer.writerow([segment_id, start, end, duration, text])
    
    print(f"✓ CSV saved: {output_path}")

# Usage
format_as_csv(result)

Complete Formatting Utility Class

Here's a comprehensive utility class that handles all formats:

import whisper
import json
import csv
from pathlib import Path
from datetime import datetime

class WhisperFormatter:
    """Utility class for formatting Whisper transcription results."""
    
    def __init__(self, result):
        self.result = result
        self.segments = result.get("segments", [])
        self.language = result.get("language", "unknown")
    
    def to_text(self, include_timestamps=False):
        """Convert to plain text."""
        if include_timestamps:
            lines = []
            for seg in self.segments:
                start = self._format_time(seg["start"])
                end = self._format_time(seg["end"])
                text = seg["text"].strip()
                lines.append(f"[{start} - {end}] {text}")
            return "\n\n".join(lines)
        return self.result["text"]
    
    def to_srt(self):
        """Convert to SRT subtitle format."""
        srt_lines = []
        for i, seg in enumerate(self.segments, start=1):
            start = self._format_srt_time(seg["start"])
            end = self._format_srt_time(seg["end"])
            text = seg["text"].strip()
            srt_lines.append(f"{i}\n{start} --> {end}\n{text}\n")
        return "\n".join(srt_lines)
    
    def to_vtt(self):
        """Convert to WebVTT format."""
        vtt_lines = ["WEBVTT", ""]
        for seg in self.segments:
            start = self._format_vtt_time(seg["start"])
            end = self._format_vtt_time(seg["end"])
            text = seg["text"].strip()
            vtt_lines.append(f"{start} --> {end}\n{text}\n")
        return "\n".join(vtt_lines)
    
    def to_json(self, pretty=True):
        """Convert to JSON format."""
        if pretty:
            return json.dumps(self.result, indent=2, ensure_ascii=False)
        return json.dumps(self.result, ensure_ascii=False)
    
    def to_csv(self):
        """Convert to CSV format."""
        import io
        output = io.StringIO()
        writer = csv.writer(output)
        writer.writerow(["ID", "Start", "End", "Duration", "Text"])
        
        for seg in self.segments:
            writer.writerow([
                seg.get("id", 0),
                seg["start"],
                seg["end"],
                seg["end"] - seg["start"],
                seg["text"].strip()
            ])
        
        return output.getvalue()
    
    def save(self, output_path, format="txt"):
        """Save transcription in specified format."""
        output_path = Path(output_path)
        format = format.lower()
        
        if format == "txt":
            content = self.to_text()
        elif format == "txt_ts":
            content = self.to_text(include_timestamps=True)
        elif format == "srt":
            content = self.to_srt()
        elif format == "vtt":
            content = self.to_vtt()
        elif format == "json":
            content = self.to_json()
        elif format == "csv":
            content = self.to_csv()
        else:
            raise ValueError(f"Unsupported format: {format}")
        
        # Determine file extension
        ext_map = {
            "txt": ".txt",
            "txt_ts": ".txt",
            "srt": ".srt",
            "vtt": ".vtt",
            "json": ".json",
            "csv": ".csv"
        }
        
        file_path = output_path.with_suffix(ext_map.get(format, ".txt"))
        
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(content)
        
        print(f"✓ Saved: {file_path}")
        return file_path
    
    def _format_time(self, seconds):
        """Format seconds to HH:MM:SS."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}"
    
    def _format_srt_time(self, seconds):
        """Format seconds to SRT timestamp."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
    
    def _format_vtt_time(self, seconds):
        """Format seconds to VTT timestamp."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"

# Usage example
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)

formatter = WhisperFormatter(result)

# Save in multiple formats
formatter.save("transcript", format="txt")
formatter.save("transcript", format="srt")
formatter.save("transcript", format="vtt")
formatter.save("transcript", format="json")
formatter.save("transcript", format="csv")

Best Practices for Transcript Formatting

1. Enable Word Timestamps for Better Accuracy

# Enable word-level timestamps for precise formatting
result = model.transcribe(
    "audio.mp3",
    word_timestamps=True  # Essential for SRT/VTT
)

2. Handle Long Segments

def split_long_segments(segments, max_duration=5.0):
    """Split segments longer than max_duration."""
    split_segments = []
    for seg in segments:
        duration = seg["end"] - seg["start"]
        if duration > max_duration:
            # Split into smaller chunks
            words = seg.get("words", [])
            if words:
                chunk_start = seg["start"]
                chunk_words = []
                
                for word_info in words:
                    chunk_words.append(word_info["word"].strip())
                    if word_info["end"] - chunk_start > max_duration:
                        split_segments.append({
                            "start": chunk_start,
                            "end": word_info["end"],
                            "text": " ".join(chunk_words)
                        })
                        chunk_start = word_info["end"]
                        chunk_words = []
                
                # Add remaining words
                if chunk_words:
                    split_segments.append({
                        "start": chunk_start,
                        "end": seg["end"],
                        "text": " ".join(chunk_words)
                    })
            else:
                split_segments.append(seg)
        else:
            split_segments.append(seg)
    
    return split_segments

3. Clean and Normalize Text

import re

def clean_transcript_text(text):
    """Clean and normalize transcript text."""
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Fix common transcription errors
    text = text.replace(" ' ", "'")
    text = text.replace(" ,", ",")
    text = text.replace(" .", ".")
    text = text.replace(" ?", "?")
    text = text.replace(" !", "!")
    
    # Capitalize sentences
    sentences = re.split(r'([.!?]\s+)', text)
    text = ''.join([s.capitalize() if i % 2 == 0 else s 
                    for i, s in enumerate(sentences)])
    
    return text.strip()

# Apply cleaning
for segment in result["segments"]:
    segment["text"] = clean_transcript_text(segment["text"])

4. Add Speaker Labels

def add_speaker_labels(result, speakers=None):
    """Add speaker identification to segments."""
    if not speakers:
        speakers = ["Speaker 1", "Speaker 2"]
    
    # Simple round-robin assignment (use proper diarization in production)
    for i, segment in enumerate(result["segments"]):
        speaker_index = i % len(speakers)
        segment["speaker"] = speakers[speaker_index]
    
    return result

5. Validate Format Output

def validate_srt(srt_content):
    """Validate SRT format."""
    lines = srt_content.strip().split('\n')
    i = 0
    while i < len(lines):
        # Check sequence number
        try:
            seq_num = int(lines[i])
            if seq_num <= 0:
                return False, f"Invalid sequence number at line {i+1}"
        except ValueError:
            return False, f"Expected sequence number at line {i+1}"
        
        i += 1
        if i >= len(lines):
            return False, "Missing timestamp line"
        
        # Check timestamp
        if '-->' not in lines[i]:
            return False, f"Invalid timestamp format at line {i+1}"
        
        i += 1
        if i >= len(lines):
            return False, "Missing text line"
        
        # Skip text and empty line
        i += 2
    
    return True, "Valid SRT format"

Use Cases for Different Formats

TXT Format

Use for: Simple documentation, reading, archiving
Best when: You need plain text without timestamps
Example: Meeting notes, interview transcripts

SRT Format

Use for: Video subtitles, YouTube, Vimeo
Best when: You need subtitle files for video content
Example: Video transcription, podcast subtitles

VTT Format

Use for: Web video players, HTML5 video
Best when: Building web applications with video
Example: Online course transcripts, webinars

JSON Format

Use for: Programmatic processing, APIs, data analysis
Best when: You need structured data with metadata
Example: Automated workflows, data pipelines

DOCX Format

Use for: Professional documents, reports, sharing
Best when: You need formatted documents for review
Example: Legal transcripts, medical notes, reports

CSV Format

Use for: Data analysis, spreadsheets, databases
Best when: You need tabular data for analysis
Example: Content analysis, keyword extraction

Complete Example: Multi-Format Export

import whisper
from pathlib import Path

def transcribe_and_export_all_formats(audio_path, output_dir="output"):
    """Transcribe audio and export in all common formats."""
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Transcribe
    print("Transcribing audio...")
    model = whisper.load_model("base")
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en"
    )
    
    base_name = Path(audio_path).stem
    
    # Initialize formatter
    formatter = WhisperFormatter(result)
    
    # Export all formats
    print("Exporting formats...")
    formatter.save(output_path / base_name, format="txt")
    formatter.save(output_path / base_name, format="txt_ts")
    formatter.save(output_path / base_name, format="srt")
    formatter.save(output_path / base_name, format="vtt")
    formatter.save(output_path / base_name, format="json")
    formatter.save(output_path / base_name, format="csv")
    
    print(f"\n✓ All formats exported to: {output_path}")
    print(f"  Language: {result['language']}")
    print(f"  Duration: {result['segments'][-1]['end']:.2f}s")
    print(f"  Segments: {len(result['segments'])}")
    
    return result

# Usage
result = transcribe_and_export_all_formats("meeting.mp3", "transcripts")

Troubleshooting Common Issues

Issue 1: Timestamps Not Aligning

Problem: SRT/VTT timestamps don't match video playback.

Solution:

# Ensure word_timestamps is enabled
result = model.transcribe("audio.mp3", word_timestamps=True)

# Use word-level timing for subtitles
def create_precise_srt(result):
    # Use word timestamps instead of segment timestamps
    # for better synchronization
    ...

Issue 2: Text Formatting Issues

Problem: Extra spaces, missing punctuation.

Solution:

# Apply text cleaning
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = text.replace(" ' ", "'")
    return text.strip()

for segment in result["segments"]:
    segment["text"] = clean_text(segment["text"])

Issue 3: Long Segments in Subtitles

Problem: Subtitles are too long for display.

Solution:

# Split long segments
def split_subtitle_text(text, max_length=42):
    """Split text into subtitle-friendly chunks."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    
    for word in words:
        if current_length + len(word) + 1 > max_length and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word)
        else:
            current_chunk.append(word)
            current_length += len(word) + 1
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks

Conclusion

Properly formatting Whisper transcripts makes them more useful and compatible with different applications. Whether you need subtitles for video, structured data for processing, or professional documents for sharing, the right format makes all the difference.

Key takeaways:

Use SRT/VTT for video subtitles
Use JSON for programmatic processing
Use TXT for simple documentation
Use DOCX for professional documents
Use CSV for data analysis
Always enable word_timestamps for better accuracy
Clean and normalize text for better readability

For more information about Whisper transcription, check out our guides on Whisper Python Example, Whisper Accuracy Tips, and Whisper for Meetings.

Looking for a professional speech-to-text solution with built-in formatting options? Visit SayToWords to explore our AI transcription platform with support for multiple output formats.

Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output

Whisper Transcript Formatting: Complete Guide to Formatting Speech-to-Text Output

Why Format Whisper Transcripts?

Understanding Whisper Output Structure

Format 1: Plain Text (TXT)

Basic Text Formatting

Enhanced Text Formatting with Timestamps

Format 2: SRT (SubRip Subtitle)

SRT Formatting Function

Advanced SRT with Word-Level Timestamps

Format 3: VTT (WebVTT)

VTT Formatting Function

Enhanced VTT with Styling

Format 4: JSON (Structured Data)

Basic JSON Formatting

Custom JSON Structure

Format 5: DOCX (Microsoft Word)

DOCX Formatting with python-docx

Enhanced DOCX with Speaker Labels

Format 6: CSV (Spreadsheet Format)

CSV Formatting

Complete Formatting Utility Class

Best Practices for Transcript Formatting

1. Enable Word Timestamps for Better Accuracy

2. Handle Long Segments

3. Clean and Normalize Text

4. Add Speaker Labels

5. Validate Format Output

Use Cases for Different Formats

TXT Format

SRT Format

VTT Format

JSON Format

DOCX Format

CSV Format

Complete Example: Multi-Format Export

Troubleshooting Common Issues

Issue 1: Timestamps Not Aligning

Issue 2: Text Formatting Issues

Issue 3: Long Segments in Subtitles

Conclusion

Related Posts

What Is Speech to Text and How to Use It: A Complete Beginner's Guide

How to Convert Audio to Text Online: Free & Accurate Methods (2026 Guide)

How to Remove Background Noise for STT: Complete Guide to Noise Reduction for Speech-to-Text

Try It Free Now