The Speech-To-Text Chain
December 6, 2025 ยท View on GitHub
By: Daniel Rosehill Date: Dec 09, 2025
The purpose of this collection of notes is to provide an enumeration of the different factors that in my experience make the difference between partial and higher levels of success in speech to text workflows.
What I've learned over the course of using speech to text intensively and daily for just over a year now is that the ultimate word error rate or accuracy of ASR is a composite of factors rather than just about using a better model.
What I've also learned: it's possible to achieve high levels of accuracy with relatively small models by attempting to improve some of these constituent elements. This is particularly important and relevant when trying to achieve performant ASR on devices with limited compute, such as embedded hardware.
For example:
- Fine tune a small model on your vocabulary and speech to get outsized accuracy.
- Add a noise cleanup as a post processing workflow before setting it for a transcription.
- Record in a better environment and with a better microphone.
This applies across the range of ASR applications.
I thought it would be useful to enumerate these and to try to order them in a "chain." The practical application for this is informing the design and development of a couple of ASR projects that I'm working on.
The Chain
flowchart TD
A[1. Speaker] --> B[2. Noise Environment]
B --> C[3. Microphone]
C --> D[4. Mic Positioning]
D --> E[5. OS Audio Settings]
E --> F[6. Audio Processing]
F --> G[7. ASR Model]
G --> H[8. Punctuation Restoration]
H --> I[9. Post-Processing]
style A fill:#e1f5fe
style B fill:#e1f5fe
style C fill:#fff3e0
style D fill:#fff3e0
style E fill:#fff3e0
style F fill:#fff3e0
style G fill:#e8f5e9
style H fill:#e8f5e9
style I fill:#e8f5e9
Chain Components
| Step | Component | Type | Description |
|---|---|---|---|
| 1 | The Speaker | Human | Clarity of speech and pronunciation |
| 2 | The Noise Environment | Environmental | Background noise, competing audio sources |
| 3 | The Microphone | Hardware | Microphone type, wired vs wireless, codec |
| 4 | Microphone Positioning | Hardware | Polar patterns, address angle, distance |
| 5 | OS Audio Settings | Software | Gain, sample rate, system audio processing |
| 6 | Audio Processing | Software | Noise reduction, audio enhancement |
| 7 | The ASR Model | AI/ML | Speech-to-text model selection and configuration |
| 8 | Punctuation Restoration | AI/ML | Adding punctuation, VAD, diarisation |
| 9 | Post-Processing | AI/ML | LLM cleanup, formatting, filler removal |
Component Categories
- Human/Environmental (blue): Factors relating to the speaker and their surroundings
- Hardware/Signal (orange): Physical capture and signal processing
- AI/ML Processing (green): Model-based transcription and refinement