๐ฆฅ Quansloth: TurboQuant Local AI Server
May 13, 2026 ยท View on GitHub
____ _ _ _
/ __ \ | | | | | |
| | | |_ _ __ _ _ __ ___| | ___ | |_| |__
| | | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \
| |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
\___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
[ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]
๐ฅ Achievement Unlocked: 128+ Stars!
We just bagged our second Starstruck medal!
A massive thank you to this amazing community. Hitting 128 starsโa beautiful power of twoโis a huge milestone. Your support, feedback, and contributions are what fuel the code and keep this project leveling up. Let's aim for the next high score (256)! ๐๐พ
Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) โ Quansloth brings elite KV cache compression to local LLM inference.
Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.
๐ Why Quansloth? (No More GPU Crashes)
Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.
Quansloth prevents these crashes by:
- 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
- Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
- Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.

๐ธ Interface Preview

๐ฅ๏ธ OS Compatibility
- Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click
.batlauncher. - Linux: Fully Supported (Native).
- macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).
โจ Features
- TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
- Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
- Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
- Dual-Routing: Auto-scan your local
models/folder, or input custom absolute paths to load any.gguffile. - Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.
๐ ๏ธ Prerequisites
- Windows with WSL2 (Ubuntu) OR native Linux
- NVIDIA GPU with updated drivers
- Miniconda or Anaconda installed
๐ Installation
1. Prepare Python Environment
conda create -n quansloth python=3.10 -y
conda activate quansloth
2. Clone Repository and Requirements
git clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth
3. Run Installer
chmod +x install.sh
./install.sh
๐ฎ Usage
Adding Models
Download .gguf models (e.g., Llama 3 8B) and place them in:
models/
Start Server (Windows - 1 Click)
- Use
Launch_Quansloth.bat - Double-click โ auto-launches WSL, Conda, and server
Start Server (Linux / WSL)
conda activate quansloth
python quansloth_gui.py
Connect
http://127.0.0.1:7860
๐๏ธ Pro Tips
- Symmetric (Turbo3) โ Best overall compression
- Asymmetric (Q8/Turbo4) โ Better for Q4_K_M models (e.g., Qwen)
- Monitor Hardware Stats for real-time VRAM savings
๐ License & Credits
- License: This project is licensed under the Apache 2.0 License.
- Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
- Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
- CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.
๐ค Author
Dr. Manuel Herrador ๐ง mherrador@ujaen.es
University of Jaรฉn (UJA) - Spain
Made with โค๏ธ for the Local AI Community by PacifAIst