Use auto-detection based on model's tensor types
December 18, 2025 ยท View on GitHub
GGUF Converter v2.0
Convert Hugging Face models to GGUF format with advanced quantization options
Transform large language models from Hugging Face into optimized GGUF format for efficient inference on consumer hardware. Now with remote mode, multiple output types, and the latest llama.cpp features.
๐ Table of Contents
- ๐ Features
- ๐ What's New in v2.0
- โ๏ธ Installation
- ๐ ๏ธ Basic Usage
- ๐งฉ Advanced Options
- ๐ Quantization Methods
- ๐ Size & Quality Comparison
- ๐ฆ Hardware Requirements
- โก Examples
- ๐ Troubleshooting
- ๐ง Technical Details
๐ Features
Core Capabilities
- Multiple Output Types: Support for f32, f16, bf16, q8_0, tq1_0, tq2_0, and auto-detection
- 30+ Quantization Methods: From 1-bit to 32-bit precision with K-quant and IQ variants
- Importance Matrix Quantization: Enhanced precision with IQ1, IQ2, IQ3, IQ4 methods
- Remote Mode (Experimental): Convert models without downloading full weights to disk
- Model Splitting: Split large models into manageable chunks with K/M/G size units
- Hardware Acceleration Detection: Automatically detects CUDA, Metal, OpenCL, Vulkan, ROCm
- Hugging Face Integration: Direct download from and upload to Hugging Face repositories
- README Generation: Automatically creates documentation for your quantized models
- Dry Run Mode: Preview split plans before writing any files
๐ What's New in v2.0
New Features (December 2024)
| Feature | Description |
|---|---|
| Output Types | New --outtype option: f32, f16, bf16, q8_0, tq1_0, tq2_0, auto |
| Remote Mode | --remote flag to read tensors without full download |
| Dry Run | --dry-run to preview split plans without writing files |
| Vocab Only | --vocab-only to extract just vocabulary (no weights) |
| No Lazy | --no-lazy to disable lazy evaluation (use more RAM) |
| Model Name | --model-name to override the model name in output |
| Small First Shard | --small-first-shard for metadata-only first split |
| New IQ Types | IQ1, IQ2, IQ3, IQ4 importance-based quantization |
| K-Quant Updates | Added q2_k_s, q4_k_l, q5_k_l variants |
| Ternary Quants | tq1_0, tq2_0 experimental ternary quantization |
โ๏ธ Installation
The GGUF Converter is included with the WebScout package:
pip install -U webscout
๐ ๏ธ Basic Usage
The simplest way to convert a model is with the default settings:
python -m webscout.Extra.gguf convert -m "organization/model-name"
This will:
- Download the model from Hugging Face
- Convert it to GGUF format with f16 base and q4_k_m quantization
- Save the converted model in your current directory
๐งฉ Advanced Options
Command Reference
The full command syntax is:
python -m webscout.Extra.gguf convert [OPTIONS]
| Option | Description | Default |
|---|---|---|
-m, --model-id | The HuggingFace model ID (e.g., 'OEvortex/HelpingAI-Lite-1.5T') | Required |
-u, --username | Your HuggingFace username for uploads | None |
-t, --token | Your HuggingFace API token for uploads | None |
-q, --quantization | Comma-separated quantization methods | "q4_k_m" |
-o, --outtype | Output type: f32, f16, bf16, q8_0, tq1_0, tq2_0, auto | "f16" |
-i, --use-imatrix | Use importance matrix for quantization | False |
--train-data | Training data file for imatrix quantization | None |
-s, --split-model | Split the model into smaller chunks | False |
--split-max-tensors | Maximum number of tensors per file when splitting | 256 |
--split-max-size | Maximum file size when splitting (e.g., '256M', '5G') | None |
--vocab-only | Only extract vocabulary (no model weights) | False |
--remote | (Experimental) Read tensors remotely without full download | False |
--dry-run | Only print split plan without writing files | False |
--no-lazy | Disable lazy evaluation (use more RAM) | False |
--model-name | Custom model name override | None |
--small-first-shard | Do not add tensors to the first split | False |
Output Types
Choose the base output precision before quantization:
# Use bfloat16 (good for models trained with bf16)
python -m webscout.Extra.gguf convert -m "organization/model-name" -o "bf16"
# Use auto-detection based on model's tensor types
python -m webscout.Extra.gguf convert -m "organization/model-name" -o "auto"
# Use 8-bit quantization directly
python -m webscout.Extra.gguf convert -m "organization/model-name" -o "q8_0"
Multiple Quantization Methods
Apply multiple quantization methods at once:
python -m webscout.Extra.gguf convert -m "organization/model-name" -q "q4_k_m,q5_k_m"
This will create two versions of the model with different quantization methods.
Remote Mode (Experimental)
Convert models without downloading the full weights to disk:
python -m webscout.Extra.gguf convert -m "organization/model-name" --remote
This downloads only config and tokenizer files, streaming tensor data directly from HuggingFace. Useful for systems with limited disk space.
Uploading to Hugging Face
Convert and upload the model to your Hugging Face account:
python -m webscout.Extra.gguf convert -m "organization/model-name" -u "your-username" -t "your-token"
This will create a new repository in your account named model-name-GGUF containing the converted model.
Importance Matrix Quantization
Use importance matrix for more efficient quantization:
python -m webscout.Extra.gguf convert -m "organization/model-name" -i --train-data "train_data.txt" -q "iq4_nl"
Importance matrix helps focus more bits on weights that matter most for the model's performance.
Model Splitting
Split large models for easier distribution:
# Split by number of tensors
python -m webscout.Extra.gguf convert -m "organization/model-name" -s --split-max-tensors 256
# Split by file size (supports K, M, G units)
python -m webscout.Extra.gguf convert -m "organization/model-name" -s --split-max-size "2G"
# Preview split plan without writing files
python -m webscout.Extra.gguf convert -m "organization/model-name" -s --split-max-size "2G" --dry-run
This is useful for very large models that may be difficult to distribute as a single file.
๐ Quantization Methods
Standard K-Quant Methods
| Method | Description |
|---|---|
f32 | 32-bit floating point - full precision, largest size |
f16 / fp16 | 16-bit floating point - maximum accuracy, large size |
bf16 | bfloat16 - good balance for training and some models |
auto | Auto-detect best 16-bit type based on model tensors |
q8_0 | 8-bit quantization - near-original quality |
q6_k | 6-bit K-quant - near-lossless quality |
q5_k_l | 5-bit K-quant large - highest quality 5-bit |
q5_k_m | 5-bit K-quant medium - best balance for quality/size |
q5_k_s | 5-bit K-quant small - optimized for speed |
q4_k_l | 4-bit K-quant large - highest quality 4-bit |
q4_k_m | 4-bit K-quant medium - balanced for most models |
q4_k_s | 4-bit K-quant small - optimized for speed |
q3_k_l | 3-bit K-quant large - balanced for size/accuracy |
q3_k_m | 3-bit K-quant medium - good balance for most use cases |
q3_k_s | 3-bit K-quant small - optimized for speed |
q2_k | 2-bit K-quant - smallest size, lowest accuracy |
q2_k_s | 2-bit K-quant small - maximum compression |
Legacy Quantization Methods
| Method | Description |
|---|---|
q4_0 | 4-bit quantization (legacy) - auto-repacks for ARM |
q4_1 | 4-bit quantization (legacy) - improved accuracy |
q5_0 | 5-bit quantization (legacy) - high accuracy |
q5_1 | 5-bit quantization (legacy) - improved accuracy |
Ternary Quantization (Experimental)
| Method | Description |
|---|---|
tq1_0 | 1-bit ternary quantization - extreme compression |
tq2_0 | 2-bit ternary quantization - very small size |
Importance Matrix (IQ) Methods
These methods require --use-imatrix and optionally --train-data:
| Method | Description |
|---|---|
iq1_s | 1-bit IQ small - extreme compression |
iq1_m | 1-bit IQ medium - extreme compression |
iq2_xxs | 2-bit IQ extra extra small - maximum compression |
iq2_xs | 2-bit IQ extra small - very high compression |
iq2_s | 2-bit IQ small - high compression |
iq2_m | 2-bit IQ medium - balanced compression |
iq3_xxs | 3-bit IQ extra extra small - maximum compression |
iq3_xs | 3-bit IQ extra small - high compression |
iq3_s | 3-bit IQ small - balanced compression |
iq3_m | 3-bit IQ medium - balanced importance-based |
iq4_nl | 4-bit IQ non-linear - best accuracy for 4-bit |
iq4_xs | 4-bit IQ extra small - maximum 4-bit compression |
๐ Size & Quality Comparison
TIP: When choosing a quantization method, consider the tradeoff between model size and quality. Here's a quick guide:
1. Maximum Quality (largest size)
- f32: 100% of original size, best quality
- f16/bf16: 50% of original size, excellent quality
- q8_0: 50% of original size, nearly identical to f16
2. Balanced Quality/Size
- q6_k: 38% of original size, near-lossless
- q5_k_m with imatrix: 31% of original size, excellent quality
- q4_k_m with imatrix: 25% of original size, good quality for most use cases
3. Minimum Size (reduced quality)
- q3_k_s: 18% of original size, acceptable for some tasks
- q2_k: 12% of original size, significantly reduced quality
- iq2_xxs with imatrix: 10% of original size, requires careful calibration
๐ฆ Hardware Requirements
Hardware requirements vary based on quantization method and model size:
Memory Requirements
| Quantization | RAM Required |
|---|---|
| f32 | ~4x model size |
| f16/bf16 | ~2x model size |
| q8_0 | ~1x model size |
| q4_k_m | ~0.5x model size |
| q2_k | ~0.25x model size |
For example, a 7B parameter model requires:
- f16: ~14GB RAM
- q4_k_m: ~3.5GB RAM
Hardware Acceleration
The converter automatically detects and utilizes:
- CUDA for NVIDIA GPUs
- Metal for Apple Silicon and AMD GPUs on macOS
- OpenCL for cross-platform GPU acceleration
- Vulkan for cross-platform GPU acceleration
- ROCm for AMD GPUs on Linux
If no acceleration is available, the converter will use CPU-only mode.
NOTE: GPU acceleration is highly recommended for converting larger models (13B+).
โก Examples
Basic Conversion with Upload
python -m webscout.Extra.gguf convert \
-m "mistralai/Mistral-7B-Instruct-v0.2" \
-q "q4_k_m" \
-o "f16" \
-u "your-username" \
-t "your-token"
This will convert Mistral-7B to q4_k_m quantization and upload it to your Hugging Face account.
bfloat16 Base with Multiple Quantizations
python -m webscout.Extra.gguf convert \
-m "meta-llama/Meta-Llama-3-8B" \
-o "bf16" \
-q "q4_k_m,q5_k_m,q6_k"
This will create three versions of the model starting from bf16 base.
IQ Quantization with Importance Matrix
python -m webscout.Extra.gguf convert \
-m "mistralai/Mistral-7B-Instruct-v0.2" \
-q "iq4_nl,iq3_m" \
-i \
--train-data "my_training_data.txt"
This will create two IQ-quantized versions using importance matrix for better quality.
Remote Mode (Experimental)
python -m webscout.Extra.gguf convert \
-m "meta-llama/Meta-Llama-3-8B" \
--remote \
-q "q4_k_m"
Convert without downloading full model weights (streams from HuggingFace).
Split Large Model with Dry Run
# Preview the split plan first
python -m webscout.Extra.gguf convert \
-m "meta-llama/Llama-2-70b-chat-hf" \
-q "q4_k_m" \
-s \
--split-max-size "4G" \
--dry-run
# Then actually perform the split
python -m webscout.Extra.gguf convert \
-m "meta-llama/Llama-2-70b-chat-hf" \
-q "q4_k_m" \
-s \
--split-max-size "4G"
This will split the large 70B model into multiple files, each no larger than 4GB.
๐ Troubleshooting
Missing Dependencies
Error: Missing required dependencies: git, cmake
Solution: Install the required system dependencies:
- Ubuntu/Debian:
sudo apt install git cmake python3-dev build-essential - macOS:
brew install git cmake - Windows: Install Git and CMake from their respective websites
For hardware acceleration, install relevant drivers (CUDA, ROCm, etc.)
Out of Memory
Error: CUDA out of memory
Solutions:
- Try a lower precision quantization method:
q3_k_sorq2_k - Enable model splitting with
-s - Use
--no-lazyto disable lazy evaluation (uses more RAM but may help) - Try remote mode with
--remoteto reduce disk I/O - Increase your system's swap space/virtual memory
Invalid Output Type
Error: Invalid output type: xxx
Solution: Use a valid output type:
f32- Full 32-bit precisionf16- Half 16-bit precisionbf16- bfloat16 precisionq8_0- 8-bit quantizationtq1_0- 1-bit ternarytq2_0- 2-bit ternaryauto- Auto-detect
Download Failures
Error: Failed to download model
Solutions:
- Check your internet connection
- Verify you have access to the model on Hugging Face
- Try using a Hugging Face token with
-t - Check if the model repository exists and is public
- Try
--remotemode for gated models
Build Failures
Error: Failed to build llama.cpp
Solutions:
- Check if you have a C++ compiler installed
- Ensure you have sufficient disk space
- Try building with CPU-only mode if GPU builds fail
- Update your GPU drivers if using acceleration
๐ง Technical Details
The converter works by following these steps:
- Setup: Clone and build llama.cpp with appropriate hardware acceleration
- Download: Fetch the model from Hugging Face (or use remote streaming)
- Convert: Transform the model to base GGUF format (f16/bf16/f32/q8_0/auto)
- Quantize: Apply the requested quantization methods using llama-quantize
- Split: Optionally split the model into smaller chunks
- Upload: If credentials are provided, upload to Hugging Face
Advanced Configuration
For special cases, you may want to modify llama.cpp's build parameters. The converter automatically detects and enables available hardware acceleration, but you can also build llama.cpp manually with custom options before running the converter.
The converter uses the latest llama.cpp features including:
- GGML_CUDA for NVIDIA GPU acceleration
- GGML_METAL for Apple Silicon
- GGML_VULKAN for cross-platform GPU support
- GGML_OPENCL for OpenCL acceleration
- GGML_HIPBLAS for AMD ROCm support
๐ Part of the WebScout Project | ๐ Powered by llama.cpp
Made with โค๏ธ by the Webscout team
Developer Notes (auto-generated)
This section was appended by an automated documentation task for webscout/Extra/gguf.py and summarizes the module's structure and usage:
- Provides ModelConverter for HF->GGUF conversion, quantization, optional imatrix and splitting.
- Includes robust llama.cpp build orchestration, cross-platform hardware detection, and Hugging Face upload helpers.
- CLI entrypoint:
gguf convertexposing options such as--quantization,--outtype,--use-imatrix,--split-model, and upload credentials.
For more details, see the rest of this document and inspect webscout/Extra/gguf.py source directly.