Voice Note Ragie Pipeline

November 30, 2025 · View on GitHub

This repository contains a small collection of synthetic context data about a non-existent individual, generated by large language models. The data is designed to be internally consistent, creating a coherent fictional persona for testing purposes.

Purpose

The objective of this repository is to test and validate a voice RAG (Retrieval-Augmented Generation) pipeline using Ragie.

Pipeline Architecture

The voice RAG pipeline consists of the following stages:

  1. Voice Recording: Raw audio recordings containing personal context data (stored in voice-data/)
  2. Transcription: Voice recordings are transcribed to text using speech-to-text processing
  3. LLM Processing & Reformatting: Transcribed text passes through a large language model layer that structures the content optimally for retrieval as pieces of personal context data
  4. Embedding & Storage: Processed text is embedded and ingested into a vector database via Ragie

Pipeline Flowchart

flowchart TD
    subgraph Input["Input"]
        A[Voice Recording<br/>MP3/WAV Audio Files]
    end

    subgraph STT["Speech-to-Text"]
        B[Transcription Service<br/>Whisper / Gemini / etc.]
    end

    subgraph LLM["LLM Processing"]
        C[OpenRouter API<br/>Claude Haiku / GPT-4o-mini]
        D[Context Standardization<br/>- Remove filler words<br/>- Structure content<br/>- Extract metadata<br/>- Categorize information]
    end

    subgraph RAG["RAG Storage"]
        E[Ragie API<br/>Document Upload]
        F[(Vector Database<br/>Embeddings + Metadata)]
    end

    subgraph Output["Output"]
        G[Ready for Retrieval<br/>Semantic Search Enabled]
    end

    A -->|Audio File| B
    B -->|Raw Transcript| C
    C --> D
    D -->|Structured JSON| E
    E -->|Embed & Index| F
    F --> G

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#f3e5f5
    style E fill:#e8f5e9
    style F fill:#e8f5e9
    style G fill:#c8e6c9

Data Flow Summary

StageInputOutputTool/Service
1. RecordingVoiceMP3/WAV fileAny recorder
2. TranscriptionAudio fileRaw textWhisper, Gemini, etc.
3. StandardizationRaw transcriptStructured JSONOpenRouter (LLM)
4. EmbeddingStructured textVector embeddingsRagie API

Repository Structure

.
├── voice-data/          # MP3 audio recordings of synthetic context data
│   ├── general-context.mp3
│   ├── 1.mp3
│   ├── 2.mp3
│   └── ...
├── texts/               # Text transcripts (for reference/validation)
│   ├── general.txt
│   ├── 1.txt
│   ├── 2.txt
│   └── ...
├── processed/           # LLM-processed structured outputs (generated)
├── pipeline.py          # Full pipeline script (STT -> LLM -> Ragie)
├── .env.example         # Example environment variables template
├── .env                 # Your API keys (create from .env.example, git-ignored)
└── README.md

Running the Pipeline

Prerequisites

  1. Install dependencies:
pip install openai ragie python-dotenv
  1. Copy the example environment file and add your API keys:
cp .env.example .env
# Edit .env with your actual API keys

You'll need two API keys:

Full Pipeline Usage

The pipeline.py script processes transcripts through the complete pipeline:

# Run the full pipeline on all transcripts in texts/
python pipeline.py

This will:

  1. Read each .txt transcript from texts/
  2. Send to OpenRouter LLM for context standardization
  3. Upload structured content to Ragie with metadata
  4. Save processed outputs to processed/

Simple Direct Upload

For simple direct upload without LLM processing:

import os
from ragie import Ragie

client = Ragie(auth="YOUR_RAGIE_API_KEY")

VOICE_DATA_DIR = "./voice-data"

for filename in os.listdir(VOICE_DATA_DIR):
    if filename.endswith(".mp3"):
        file_path = os.path.join(VOICE_DATA_DIR, filename)

        with open(file_path, "rb") as f:
            client.documents.create(
                file=f,
                metadata={"source": "voice-note", "filename": filename}
            )

        print(f"Uploaded: {filename}")

print("All voice notes uploaded to Ragie.")

Required API Keys

ServiceEnvironment VariablePurpose
OpenRouterOPENROUTER_API_KEYLLM for context standardization
RagieRAGIE_API_KEYVector storage and retrieval

Important Notes

  • All personal data in this repository is entirely fictional and generated by LLMs
  • The synthetic individual does not exist
  • This data is intended solely for pipeline testing and validation purposes
  • The contextual information is designed to be internally consistent across all recordings