STT App
May 17, 2025 · View on GitHub
Speech to Text application using state-of-the-art open source models.
Features
- Convert speech to text using OpenAI's Whisper model
- Record audio directly from your microphone
- Open existing audio files for transcription
- Save transcriptions to text files
- Save recorded audio to WAV files
- Noise cancellation for improved audio quality
- Split long recordings into chunks for better processing
- View detailed segment information for each transcription
- Model finetuning on your voice for improved accuracy
- Support for multiple Whisper model sizes:
- tiny (fast but less accurate)
- base (good balance)
- small (better accuracy)
- medium (high accuracy)
- large (best accuracy but slower and requires more memory)
- NEW: Overlay mode for real-time transcription that can insert text anywhere
Installation
From Source
-
Clone the repository:
git clone https://github.com/example/stt-app.git cd stt-app -
Install dependencies:
pip install -e . -
For the overlay functionality, install additional system dependencies:
sudo apt-get install xdotool python3-xlib -
Run the application:
python -m stt_app.main # For the standard applicationOr run the overlay mode:
python run_overlay.py # For the overlay transcription mode
Debian Package
To build a Debian package:
-
Install build dependencies:
sudo apt-get install debhelper dh-python python3-setuptools -
Build the package:
cd stt-app dpkg-buildpackage -us -uc -
Install the package:
sudo dpkg -i ../stt-app_0.1.0-1_all.deb sudo apt-get install -f # Install any missing dependencies
Usage
Standard Application
-
Launch the application by running
stt-apporstt-app-guior from your applications menu. -
Choose a Whisper model from the dropdown menu.
-
Configure recording settings:
- Enable/disable noise cancellation
- Set the chunk duration for processing
-
Either:
- Click "Record" to start recording from your microphone, then "Stop Recording" when finished.
- Click "Open Audio File" to select an existing audio file.
-
Click "Transcribe" to convert the speech to text.
-
The transcription will appear in the text area. You can:
- View the full transcription in the "Transcription" tab
- Examine individual segments in the "Segments" tab
- View and manipulate chunks in the "Chunks" tab
- Save the transcription by clicking "Save Text"
Overlay Mode (NEW)
The overlay mode provides a Windows 11-like experience for speech-to-text input anywhere on your system.
-
Launch the overlay by running:
python run_overlay.py -
A floating window will appear on your screen.
-
Keyboard shortcuts:
Alt+Shift+S: Start/stop listeningAlt+Shift+I: Insert transcribed text at the current cursor positionAlt+Shift+C: Clear the transcriptionAlt+Shift+O: Show/hide the overlay window
-
To use:
- Start listening by clicking the "Start Listening" button or pressing
Alt+Shift+S - Speak clearly into your microphone
- The transcription appears in real-time in the overlay
- Position your cursor where you want to insert the text
- Click "Insert Text" or press
Alt+Shift+Ito paste the transcribed text
- Start listening by clicking the "Start Listening" button or pressing
-
The overlay window can be dragged to any position on your screen.
Finetuning
The application supports finetuning the Whisper model on your voice to improve transcription accuracy:
-
Go to Model → Finetune Model... to open the finetuning dialog.
-
In the Data Collection tab:
- Click "Next Prompt" to display a random text prompt
- Click "Record (5s)" to record yourself reading the prompt
- Repeat this process several times to build a training dataset
- It's recommended to record at least 10 samples for good results
-
In the Model Finetuning tab:
- Select the base model to finetune (e.g., "base")
- Set training parameters (epochs and batch size)
- Click "Start Finetuning" to begin the process
- Monitor progress in the log display
- Once completed, your finetuned model will be available to use
-
To use a finetuned model:
- Go to Model → Finetuned Models and select your model
- Or click "Load Selected Model" in the finetuning dialog
The finetuning process adapts the model to your voice, accent, and speech patterns, which can significantly improve transcription accuracy.
Command Line Options
usage: stt-app [-h] [--model {tiny,base,small,medium,large}] [--device DEVICE] [--audio-file AUDIO_FILE]
Speech to Text Application
optional arguments:
-h, --help show this help message and exit
--model {tiny,base,small,medium,large}, -m {tiny,base,small,medium,large}
Whisper model to use (default: base)
--device DEVICE, -d DEVICE
Device to use for inference (cpu, cuda, default: auto-detect)
--audio-file AUDIO_FILE, -a AUDIO_FILE
Audio file to transcribe on startup
Requirements for Overlay Mode
For the overlay mode to work properly, you'll need:
- Python 3.7 or later
- PyQt5
- whisper
- keyboard
- pyperclip
- python-xlib
- xdotool (system package)
The overlay mode allows you to use speech-to-text functionality system-wide, similar to the Windows 11 speech input feature (Win+H).
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- OpenAI Whisper for the speech recognition models
- PyQt5 for the GUI framework
- noisereduce for audio noise reduction
- Transformers for model finetuning capabilities