AI Transcription Notepad

February 16, 2026 · View on GitHub

AI Transcription Notepad

Multimodal Cloud Transcription for Desktop

License: MIT Platform Python


Download · Documentation · User Manual (PDF)


AI Transcription Notepad Main Interface


Why AI Transcription Notepad?

Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. AI Transcription Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.

Traditional ApproachAI Transcription Notepad
Record → ASR → Raw text → LLM → Formatted outputRecord → Multimodal AI → Formatted output
Two API calls, higher latencySingle API call, faster results
AI reads text onlyAI "hears" your voice

The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.


Key Benefits

  • Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
  • Fast — Single API call with local preprocessing
  • Smart cleanup — Removes filler words, adds punctuation, formats output
  • Global hotkeys — Record from anywhere, even when minimized
  • Flexible output — App window, clipboard, or inject directly at cursor
  • Translation — Translate to 30+ languages in the same API call

Documentation

Documentation Online Documentation
Full documentation site with guides, reference, and troubleshooting.
User Manual PDF User Manual v3 (PDF)
Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting.

Quick Start

  1. Download from Releases (AppImage, .deb, or Windows installer)
  2. Add your OpenRouter API key (get one here)
  3. Press Record, speak naturally, press Transcribe
  4. Get clean, formatted text
# Or run from source
git clone https://github.com/danielrosehill/AI-Transcription-Notepad.git
cd AI-Transcription-Notepad && ./run.sh

Dual-Pipeline Architecture

AI Transcription Notepad combines local preprocessing with cloud transcription for optimal cost and quality.

flowchart LR
    subgraph LOCAL["Local Preprocessing"]
        direction LR
        A[Record<br/>48kHz] --> B[AGC<br/>Normalize]
        B --> C[VAD<br/>Remove Silence]
        C --> D[Compress<br/>16kHz mono]
    end

    subgraph CLOUD["Cloud Transcription"]
        direction LR
        E[Prompt<br/>Concatenation] --> F[Gemini API<br/>Audio + Prompt]
        F --> G[Formatted<br/>Text]
    end

    D --> E

    style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
StageComponentPurpose
LocalAGCNormalizes audio levels (target -3 dBFS)
LocalVADStrips silence — typically 30-80% reduction
LocalCompressDownsamples to 16kHz mono WAV
CloudPrompt ConcatenationBuilds layered instructions
CloudGemini APISingle-pass transcription + cleanup

Prompt Concatenation System

AI Transcription Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.

flowchart TB
    subgraph FOUNDATION["Foundation Layer (Always Applied)"]
        F1[Remove filler words]
        F2[Add punctuation]
        F3[Fix grammar & spelling]
        F4[Honor verbal commands]
        F5[Handle background audio]
    end

    subgraph FORMAT["Format Layer"]
        FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
    end

    subgraph STYLE["Style Layer"]
        S1[Formality<br/>Casual → Professional]
        S2[Verbosity<br/>None → Maximum reduction]
    end

    subgraph PERSONAL["Personalization"]
        P1[Email signatures]
        P2[User name]
    end

    FOUNDATION --> FORMAT
    FORMAT --> STYLE
    STYLE --> PERSONAL
    PERSONAL --> OUTPUT[Final Prompt]

    style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Prompt Stacks

Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:

Stack ExampleLayers Combined
Meeting Notes + ActionsFoundation + Meeting format + Action item extraction
Technical DocumentationFoundation + Doc format + Code extraction + Markdown
Quick EmailFoundation + Email format + Professional tone + Signature

Create custom stacks in the Prompt Stacks tab, then apply them with a single click.


Supported Provider

ProviderDefault ModelNotes
OpenRoutergoogle/gemini-3-flash-previewGemini 3 Flash (default), Gemini 3 Pro (fallback)

OpenRouter is the sole provider. It offers per-key cost tracking, low latency, and access to Gemini 3 models via an OpenAI-compatible API.


Screenshots

Click to expand screenshots

Main Interface

Main Interface

Analytics Dashboard

Analytics

Global Hotkeys

Hotkeys

Prompt Formats

Formats


Technology Stack

ComponentTechnology
TranscriptionOpenRouter (Gemini 3 Flash / Pro)
Voice Activity DetectionTEN VAD
Text-to-SpeechEdge TTS
DatabaseMongita
UI FrameworkPyQt6

See Technology Stack for details.


Benchmark Data

Real usage from ~2,000 transcriptions shows excellent performance with OpenRouter's Gemini models:

ProviderModelAvg InferenceChars/sec
OpenRoutergoogle/gemini-2.5-flash2.5s204

Anonymized usage data available in data/.


AI-Human Co-Authorship

This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.



License

MIT