LLM Model Converter and Quantizer

March 4, 2025 ยท View on GitHub

Large Language Models (LLMs) are typically distributed in formats optimized for training (like PyTorch) and can be extremely large (hundreds of gigabytes), making them impractical for most real-world applications. This tool addresses two critical challenges in LLM deployment:

  1. Size: Original models are too large to run on consumer hardware
  2. Format: Training formats are not optimized for inference

Try Yourself

Explore and experiment with the LLM Quantization tool on Hugging Face Spaces: LLM Quantization Demo

Quantization

Why This Tool?

I built this tool to help AI Researchers achieve the following:

  • Converting models from Hugging Face to GGUF format (optimized for inference)
  • Quantizing models to reduce their size while maintaining acceptable performance
  • Making deployment possible on consumer hardware (laptops, desktops) with limited resources

The Problem

  • LLMs in their original format require significant computational resources
  • Running these models typically needs:
    • High-end GPUs
    • Large amounts of RAM (32GB+)
    • Substantial storage space
    • Complex software dependencies

The Solution

This tool provides:

  1. Format Conversion

    • Converts from PyTorch/Hugging Face format to GGUF
    • GGUF is specifically designed for efficient inference
    • Enables memory mapping for faster loading
    • Reduces dependency requirements
  2. Quantization

    • Reduces model size by up to 4-8x
    • Converts from FP16/FP32 to more efficient formats (INT8/INT4)
    • Maintains reasonable model performance
    • Makes models runnable on consumer-grade hardware
  3. Accessibility

    • Enables running LLMs on standard laptops
    • Reduces RAM requirements
    • Speeds up model loading and inference
    • Simplifies deployment process

๐ŸŽฏ Purpose

This tool helps developers and researchers to:

  • Download LLMs from Hugging Face Hub
  • Convert models to GGUF (GPT-Generated Unified Format)
  • Quantize models for efficient deployment
  • Upload processed models back to Hugging Face

๐Ÿš€ Features

  • Model Download: Direct integration with Hugging Face Hub
  • GGUF Conversion: Convert PyTorch models to GGUF format
  • Quantization Options: Support for various quantization levels
  • Batch Processing: Automate the entire conversion pipeline
  • HF Upload: Option to upload processed models back to Hugging Face

Quantization Types Overview

Quantizer NamePurposeBenefitsWhen to Use
Q2_KQuantizes model to 2 bits using K modeMinimizes memory usage, faster inferenceUse for highly memory-constrained environments
Q3_K_l3-bit quantization using low precision modeBalance between size reduction and inference qualityWhen a small model size with moderate precision is needed
Q3_K_M3-bit quantization with medium precision modeBetter performance with slight increase in memory usageWhen moderate precision and size reduction are desired
Q3_K_S3-bit quantization using high precision modeHigher inference quality with minimal size reductionWhen inference quality is a higher priority than size
Q4_04-bit quantization with zero modeReduced model size with minimal impact on performanceUse when a larger model is required but memory is limited
Q4_14-bit quantization with another precision modeBetter performance than Q4_0 with slight increase in sizeWhen a balance of size and performance is required
Q4_K_M4-bit quantization using K mode with medium precisionFurther optimized performance with reduced model sizeFor performance optimization in moderately sized models
Q4_K_S4-bit quantization using K mode with high precisionOptimized for size with higher precisionWhen slightly higher precision and smaller size are needed
Q5_05-bit quantization using zero modeLarger model size with enhanced precisionUse when memory is not a major constraint and high precision is required
Q5_15-bit quantization with an alternative modeOffers trade-off between size and performanceFor improved performance at the cost of some additional memory usage
Q5_K_M5-bit quantization using K mode with medium precisionBetter model compression and performanceWhen model performance is crucial and space is a concern
Q5_K_S5-bit quantization using K mode with high precisionOptimal performance with minimal size increaseUse for high-performance applications with moderate memory limits
Q6_K6-bit quantization using K modeLarger model size but better precisionFor applications where precision is critical and space is more available
Q8_08-bit quantization with zero modeMaximum size reduction with reasonable precisionUse when model size is most critical and higher precision is not needed
BF1616-bit Brain Floating Point quantizationBalances precision and size with higher performanceWhen a high level of performance is needed with moderate memory usage
F1616-bit Floating Point quantizationOffers good precision and performance with moderate memory usageWhen maintaining a high precision model is essential
F3232-bit Floating Point quantizationHighest precision, best for model training and inferenceUse when maximum precision is required for sensitive tasks

๐Ÿ’ก Why GGUF?

GGUF (GPT-Generated Unified Format) offers several advantages:

GGUF (GPT-Generated Unified Format)

GGUF (GPT-Generated Unified Format) is a file format specifically designed for efficient deployment and inference of large language models.

Key Benefits of GGUF:

Optimized for Inference:

  • GGUF is specifically designed for model inference (running predictions) rather than training.
  • It's the native format used by llama.cpp, a popular framework for running LLMs on consumer hardware.

Memory Efficiency:

  • Reduces memory usage compared to the original PyTorch/Hugging Face formats.
  • Allows running larger models on devices with limited RAM.
  • Supports various quantization levels (reducing model precision from FP16/FP32 to INT8/INT4).

Faster Loading:

  • Models in GGUF format can be memory-mapped (mmap), meaning they can be loaded partially as needed.
  • Reduces initial loading time and memory overhead.

Cross-Platform Compatibility:

  • Works well across different operating systems and hardware.
  • Doesn't require Python or PyTorch installation.
  • Can run on CPU-only systems effectively.

Embedded Metadata:

  • Contains model configuration, tokenizer, and other necessary information in a single file.
  • Makes deployment simpler as all required information is bundled together.

๐Ÿ› ๏ธ Installation

# Clone the repository
git clone https://github.com/bhaskatripathi/LLM_Quantization.git

# Install dependencies
pip install -r requirements.txt

๐Ÿ“– Usage

# Run the Streamlit application
streamlit run app.py

Star History Chart

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License

โš ๏ธ Requirements

  • Python 3.8+
  • Streamlit
  • Hugging Face Hub account (for model download/upload)
  • Sufficient storage space for model processing

๐Ÿ“š Supported Models

The tool currently supports various model architectures including:

  • DeepSeek models
  • Mistral models
  • Llama models
  • Qwen models
  • And more...

๐Ÿค” Need Help?

If you encounter any issues or have questions:

  1. Check the existing issues
  2. Create a new issue with a detailed description
  3. Include relevant error messages and environment details

๐Ÿ™ Acknowledgments

  • Hugging Face for the model hub
  • llama.cpp for GGUF format implementation
  • All contributors and maintainers

Made with โค๏ธ for the AI community