Voice Cleanup Prompt Experiment

December 3, 2025 · View on GitHub

Comparative evaluation of speech-to-text cleanup approaches: traditional STT+LLM pipelines versus native multimodal audio processing.

Objective

With the emergence of multimodal audio understanding in LLMs, it's now possible to process voice recordings directly rather than using a two-stage pipeline (speech-to-text followed by text cleanup). This experiment compares these two approaches:

OpenAI Pipeline: Separate STT (Whisper) + LLM cleanup
Gemini Native: Integrated multimodal audio processing

Experiment Design

Prompt Progression (1-10)

Prompts 1-5: Editing Liberty Spectrum

Prompt 1: Verbatim transcription only (most restrictive)
Prompts 2-4: Progressively more editing freedom
Prompt 5: Full restructuring and enhancement allowed (most liberal)

Prompts 6-10: Format Adherence Testing

Testing structured output generation (e.g., README format)
Evaluating prompt instruction following
Comparing format compliance between approaches

Structure

├── prompts/           # 10 prompt variations (1.md - 10.md)
├── outputs/
│   ├── openai/        # Results from STT + LLM pipeline
│   ├── gemini/        # Results from native multimodal
│   └── summary.md     # Comparison notes
├── sample-audio/      # Test audio files
├── scripts/           # Processing scripts
└── promptfooconfig.yaml

Key Questions

How does output quality compare at different editing liberty levels?
Does native multimodal processing preserve speaker intent better?
How well does each approach follow formatting instructions?
What are the practical tradeoffs (latency, cost, accuracy)?

Tools

OpenAI: Whisper (STT) + GPT-4 (cleanup)
Google Gemini: Native audio input processing
Promptfoo: Evaluation framework