Voice Cleanup Prompt Experiment

December 3, 2025 · View on GitHub

Comparative evaluation of speech-to-text cleanup approaches: traditional STT+LLM pipelines versus native multimodal audio processing.

Objective

With the emergence of multimodal audio understanding in LLMs, it's now possible to process voice recordings directly rather than using a two-stage pipeline (speech-to-text followed by text cleanup). This experiment compares these two approaches:

  1. OpenAI Pipeline: Separate STT (Whisper) + LLM cleanup
  2. Gemini Native: Integrated multimodal audio processing

Experiment Design

Prompt Progression (1-10)

Prompts 1-5: Editing Liberty Spectrum

  • Prompt 1: Verbatim transcription only (most restrictive)
  • Prompts 2-4: Progressively more editing freedom
  • Prompt 5: Full restructuring and enhancement allowed (most liberal)

Prompts 6-10: Format Adherence Testing

  • Testing structured output generation (e.g., README format)
  • Evaluating prompt instruction following
  • Comparing format compliance between approaches

Structure

├── prompts/           # 10 prompt variations (1.md - 10.md)
├── outputs/
│   ├── openai/        # Results from STT + LLM pipeline
│   ├── gemini/        # Results from native multimodal
│   └── summary.md     # Comparison notes
├── sample-audio/      # Test audio files
├── scripts/           # Processing scripts
└── promptfooconfig.yaml

Key Questions

  • How does output quality compare at different editing liberty levels?
  • Does native multimodal processing preserve speaker intent better?
  • How well does each approach follow formatting instructions?
  • What are the practical tradeoffs (latency, cost, accuracy)?

Tools

  • OpenAI: Whisper (STT) + GPT-4 (cleanup)
  • Google Gemini: Native audio input processing
  • Promptfoo: Evaluation framework