AI Transcription Notepad

July 23, 2026 · View on GitHub

AI Transcription Notepad

Multimodal Cloud Transcription for Desktop

Download · Documentation · User Manual (PDF)

AI Transcription Notepad Main Interface

Why AI Transcription Notepad?

Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. AI Transcription Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.

Traditional Approach	AI Transcription Notepad
Record → ASR → Raw text → LLM → Formatted output	Record → Multimodal AI → Formatted output
Two API calls, higher latency	Single API call, faster results
AI reads text only	AI "hears" your voice

The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.

Key Benefits

Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
Fast — Single API call with local preprocessing
Smart cleanup — Removes filler words, adds punctuation, formats output
Global hotkeys — Record from anywhere, even when minimized
Flexible output — App window, clipboard, or inject directly at cursor
Translation — Translate to 30+ languages in the same API call

Documentation

	Online Documentation Full documentation site with guides, reference, and troubleshooting.
	User Manual v3 (PDF) Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting.

Quick Start

Download from Releases (AppImage, .deb, or Windows installer)
Add your OpenRouter API key (get one here)
Press Record, speak naturally, press Transcribe
Get clean, formatted text

# Or run from source
git clone https://github.com/danielrosehill/AI-Transcription-Notepad.git
cd AI-Transcription-Notepad && ./run.sh

Dual-Pipeline Architecture

AI Transcription Notepad combines local preprocessing with cloud transcription for optimal cost and quality.

flowchart LR
    subgraph LOCAL["Local Preprocessing"]
        direction LR
        A[Record<br/>48kHz] --> B[AGC<br/>Normalize]
        B --> C[VAD<br/>Remove Silence]
        C --> D[Compress<br/>16kHz mono]
    end

    subgraph CLOUD["Cloud Transcription"]
        direction LR
        E[Prompt<br/>Concatenation] --> F[Gemini API<br/>Audio + Prompt]
        F --> G[Formatted<br/>Text]
    end

    D --> E

    style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Stage	Component	Purpose
Local	AGC	Normalizes audio levels (target -3 dBFS)
Local	VAD	Strips silence — typically 30-80% reduction
Local	Compress	Downsamples to 16kHz mono WAV
Cloud	Prompt Concatenation	Builds layered instructions
Cloud	Gemini API	Single-pass transcription + cleanup

Prompt Concatenation System

AI Transcription Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.

flowchart TB
    subgraph FOUNDATION["Foundation Layer (Always Applied)"]
        F1[Remove filler words]
        F2[Add punctuation]
        F3[Fix grammar & spelling]
        F4[Honor verbal commands]
        F5[Handle background audio]
    end

    subgraph FORMAT["Format Layer"]
        FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
    end

    subgraph STYLE["Style Layer"]
        S1[Formality<br/>Casual → Professional]
        S2[Verbosity<br/>None → Maximum reduction]
    end

    subgraph PERSONAL["Personalization"]
        P1[Email signatures]
        P2[User name]
    end

    FOUNDATION --> FORMAT
    FORMAT --> STYLE
    STYLE --> PERSONAL
    PERSONAL --> OUTPUT[Final Prompt]

    style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Prompt Stacks

Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:

Stack Example	Layers Combined
Meeting Notes + Actions	Foundation + Meeting format + Action item extraction
Technical Documentation	Foundation + Doc format + Code extraction + Markdown
Quick Email	Foundation + Email format + Professional tone + Signature

Create custom stacks in the Prompt Stacks tab, then apply them with a single click.

Supported Provider

Provider	Default Model	Notes
OpenRouter	`google/gemini-3.5-flash-lite`	Gemini 3.5 Flash Lite (default), Gemini 3.6 Flash (quality/fallback)

OpenRouter is the sole provider. It offers per-key cost tracking, low latency, and access to current Gemini models via an OpenAI-compatible API.

Screenshots

Click to expand screenshots

Main Interface

Analytics Dashboard

Analytics

Global Hotkeys

Hotkeys

Prompt Formats

Formats

Technology Stack

Component	Technology
Transcription	OpenRouter (Gemini 3.5 Flash Lite / 3.6 Flash)
Voice Activity Detection	TEN VAD
Text-to-Speech	Edge TTS
Database	Mongita
UI Framework	PyQt6

See Technology Stack for details.

Benchmark Data

Real usage from ~2,000 transcriptions shows excellent performance with OpenRouter's Gemini models:

Provider	Model	Avg Inference	Chars/sec
OpenRouter	google/gemini-2.5-flash	2.5s	204

Anonymized usage data available in data/.

AI-Human Co-Authorship

This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.

Audio-Multimodal-AI-Resources — Curated list of audio-capable multimodal models
Audio-Understanding-Test-Prompts — Test prompts for evaluating audio understanding

License

MIT