STT App

May 17, 2025 · View on GitHub

Speech to Text application using state-of-the-art open source models.

Features

Convert speech to text using OpenAI's Whisper model
Record audio directly from your microphone
Open existing audio files for transcription
Save transcriptions to text files
Save recorded audio to WAV files
Noise cancellation for improved audio quality
Split long recordings into chunks for better processing
View detailed segment information for each transcription
Model finetuning on your voice for improved accuracy
Support for multiple Whisper model sizes:
- tiny (fast but less accurate)
- base (good balance)
- small (better accuracy)
- medium (high accuracy)
- large (best accuracy but slower and requires more memory)
NEW: Overlay mode for real-time transcription that can insert text anywhere

Installation

From Source

Clone the repository:

git clone https://github.com/example/stt-app.git
cd stt-app

Install dependencies:
```
pip install -e .
```
For the overlay functionality, install additional system dependencies:
```
sudo apt-get install xdotool python3-xlib
```

Run the application:

python -m stt_app.main  # For the standard application

Or run the overlay mode:

python run_overlay.py  # For the overlay transcription mode

Debian Package

To build a Debian package:

Install build dependencies:

sudo apt-get install debhelper dh-python python3-setuptools

Build the package:
```
cd stt-app
dpkg-buildpackage -us -uc
```

Install the package:

sudo dpkg -i ../stt-app_0.1.0-1_all.deb
sudo apt-get install -f  # Install any missing dependencies

Usage

Standard Application

Launch the application by running stt-app or stt-app-gui or from your applications menu.
Choose a Whisper model from the dropdown menu.
Configure recording settings:
- Enable/disable noise cancellation
- Set the chunk duration for processing
Either:
- Click "Record" to start recording from your microphone, then "Stop Recording" when finished.
- Click "Open Audio File" to select an existing audio file.
Click "Transcribe" to convert the speech to text.
The transcription will appear in the text area. You can:
- View the full transcription in the "Transcription" tab
- Examine individual segments in the "Segments" tab
- View and manipulate chunks in the "Chunks" tab
- Save the transcription by clicking "Save Text"

Overlay Mode (NEW)

The overlay mode provides a Windows 11-like experience for speech-to-text input anywhere on your system.

Launch the overlay by running:
```
python run_overlay.py
```
A floating window will appear on your screen.
Keyboard shortcuts:
- Alt+Shift+S: Start/stop listening
- Alt+Shift+I: Insert transcribed text at the current cursor position
- Alt+Shift+C: Clear the transcription
- Alt+Shift+O: Show/hide the overlay window
To use:
- Start listening by clicking the "Start Listening" button or pressing Alt+Shift+S
- Speak clearly into your microphone
- The transcription appears in real-time in the overlay
- Position your cursor where you want to insert the text
- Click "Insert Text" or press Alt+Shift+I to paste the transcribed text
The overlay window can be dragged to any position on your screen.

Finetuning

The application supports finetuning the Whisper model on your voice to improve transcription accuracy:

Go to Model → Finetune Model... to open the finetuning dialog.
In the Data Collection tab:
- Click "Next Prompt" to display a random text prompt
- Click "Record (5s)" to record yourself reading the prompt
- Repeat this process several times to build a training dataset
- It's recommended to record at least 10 samples for good results
In the Model Finetuning tab:
- Select the base model to finetune (e.g., "base")
- Set training parameters (epochs and batch size)
- Click "Start Finetuning" to begin the process
- Monitor progress in the log display
- Once completed, your finetuned model will be available to use
To use a finetuned model:
- Go to Model → Finetuned Models and select your model
- Or click "Load Selected Model" in the finetuning dialog

The finetuning process adapts the model to your voice, accent, and speech patterns, which can significantly improve transcription accuracy.

Command Line Options

usage: stt-app [-h] [--model {tiny,base,small,medium,large}] [--device DEVICE] [--audio-file AUDIO_FILE]

Speech to Text Application

optional arguments:
  -h, --help            show this help message and exit
  --model {tiny,base,small,medium,large}, -m {tiny,base,small,medium,large}
                        Whisper model to use (default: base)
  --device DEVICE, -d DEVICE
                        Device to use for inference (cpu, cuda, default: auto-detect)
  --audio-file AUDIO_FILE, -a AUDIO_FILE
                        Audio file to transcribe on startup

Requirements for Overlay Mode

For the overlay mode to work properly, you'll need:

Python 3.7 or later
PyQt5
whisper
keyboard
pyperclip
python-xlib
xdotool (system package)

The overlay mode allows you to use speech-to-text functionality system-wide, similar to the Windows 11 speech input feature (Win+H).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI Whisper for the speech recognition models
PyQt5 for the GUI framework
noisereduce for audio noise reduction
Transformers for model finetuning capabilities