Voice Note Ragie Pipeline
November 30, 2025 · View on GitHub
This repository contains a small collection of synthetic context data about a non-existent individual, generated by large language models. The data is designed to be internally consistent, creating a coherent fictional persona for testing purposes.
Purpose
The objective of this repository is to test and validate a voice RAG (Retrieval-Augmented Generation) pipeline using Ragie.
Pipeline Architecture
The voice RAG pipeline consists of the following stages:
- Voice Recording: Raw audio recordings containing personal context data (stored in
voice-data/) - Transcription: Voice recordings are transcribed to text using speech-to-text processing
- LLM Processing & Reformatting: Transcribed text passes through a large language model layer that structures the content optimally for retrieval as pieces of personal context data
- Embedding & Storage: Processed text is embedded and ingested into a vector database via Ragie
Pipeline Flowchart
flowchart TD
subgraph Input["Input"]
A[Voice Recording<br/>MP3/WAV Audio Files]
end
subgraph STT["Speech-to-Text"]
B[Transcription Service<br/>Whisper / Gemini / etc.]
end
subgraph LLM["LLM Processing"]
C[OpenRouter API<br/>Claude Haiku / GPT-4o-mini]
D[Context Standardization<br/>- Remove filler words<br/>- Structure content<br/>- Extract metadata<br/>- Categorize information]
end
subgraph RAG["RAG Storage"]
E[Ragie API<br/>Document Upload]
F[(Vector Database<br/>Embeddings + Metadata)]
end
subgraph Output["Output"]
G[Ready for Retrieval<br/>Semantic Search Enabled]
end
A -->|Audio File| B
B -->|Raw Transcript| C
C --> D
D -->|Structured JSON| E
E -->|Embed & Index| F
F --> G
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#f3e5f5
style E fill:#e8f5e9
style F fill:#e8f5e9
style G fill:#c8e6c9
Data Flow Summary
| Stage | Input | Output | Tool/Service |
|---|---|---|---|
| 1. Recording | Voice | MP3/WAV file | Any recorder |
| 2. Transcription | Audio file | Raw text | Whisper, Gemini, etc. |
| 3. Standardization | Raw transcript | Structured JSON | OpenRouter (LLM) |
| 4. Embedding | Structured text | Vector embeddings | Ragie API |
Repository Structure
.
├── voice-data/ # MP3 audio recordings of synthetic context data
│ ├── general-context.mp3
│ ├── 1.mp3
│ ├── 2.mp3
│ └── ...
├── texts/ # Text transcripts (for reference/validation)
│ ├── general.txt
│ ├── 1.txt
│ ├── 2.txt
│ └── ...
├── processed/ # LLM-processed structured outputs (generated)
├── pipeline.py # Full pipeline script (STT -> LLM -> Ragie)
├── .env.example # Example environment variables template
├── .env # Your API keys (create from .env.example, git-ignored)
└── README.md
Running the Pipeline
Prerequisites
- Install dependencies:
pip install openai ragie python-dotenv
- Copy the example environment file and add your API keys:
cp .env.example .env
# Edit .env with your actual API keys
You'll need two API keys:
- OpenRouter: Get yours at https://openrouter.ai/keys
- Ragie: Get yours at https://app.ragie.ai/settings/api-keys
Full Pipeline Usage
The pipeline.py script processes transcripts through the complete pipeline:
# Run the full pipeline on all transcripts in texts/
python pipeline.py
This will:
- Read each
.txttranscript fromtexts/ - Send to OpenRouter LLM for context standardization
- Upload structured content to Ragie with metadata
- Save processed outputs to
processed/
Simple Direct Upload
For simple direct upload without LLM processing:
import os
from ragie import Ragie
client = Ragie(auth="YOUR_RAGIE_API_KEY")
VOICE_DATA_DIR = "./voice-data"
for filename in os.listdir(VOICE_DATA_DIR):
if filename.endswith(".mp3"):
file_path = os.path.join(VOICE_DATA_DIR, filename)
with open(file_path, "rb") as f:
client.documents.create(
file=f,
metadata={"source": "voice-note", "filename": filename}
)
print(f"Uploaded: {filename}")
print("All voice notes uploaded to Ragie.")
Required API Keys
| Service | Environment Variable | Purpose |
|---|---|---|
| OpenRouter | OPENROUTER_API_KEY | LLM for context standardization |
| Ragie | RAGIE_API_KEY | Vector storage and retrieval |
Important Notes
- All personal data in this repository is entirely fictional and generated by LLMs
- The synthetic individual does not exist
- This data is intended solely for pipeline testing and validation purposes
- The contextual information is designed to be internally consistent across all recordings