Voice Cleanup Prompt Experiment
December 3, 2025 · View on GitHub
Comparative evaluation of speech-to-text cleanup approaches: traditional STT+LLM pipelines versus native multimodal audio processing.
Objective
With the emergence of multimodal audio understanding in LLMs, it's now possible to process voice recordings directly rather than using a two-stage pipeline (speech-to-text followed by text cleanup). This experiment compares these two approaches:
- OpenAI Pipeline: Separate STT (Whisper) + LLM cleanup
- Gemini Native: Integrated multimodal audio processing
Experiment Design
Prompt Progression (1-10)
Prompts 1-5: Editing Liberty Spectrum
- Prompt 1: Verbatim transcription only (most restrictive)
- Prompts 2-4: Progressively more editing freedom
- Prompt 5: Full restructuring and enhancement allowed (most liberal)
Prompts 6-10: Format Adherence Testing
- Testing structured output generation (e.g., README format)
- Evaluating prompt instruction following
- Comparing format compliance between approaches
Structure
├── prompts/ # 10 prompt variations (1.md - 10.md)
├── outputs/
│ ├── openai/ # Results from STT + LLM pipeline
│ ├── gemini/ # Results from native multimodal
│ └── summary.md # Comparison notes
├── sample-audio/ # Test audio files
├── scripts/ # Processing scripts
└── promptfooconfig.yaml
Key Questions
- How does output quality compare at different editing liberty levels?
- Does native multimodal processing preserve speaker intent better?
- How well does each approach follow formatting instructions?
- What are the practical tradeoffs (latency, cost, accuracy)?
Tools
- OpenAI: Whisper (STT) + GPT-4 (cleanup)
- Google Gemini: Native audio input processing
- Promptfoo: Evaluation framework