LLM-Export
September 3, 2025 · View on GitHub
English | 中文
An efficient Large Language Model export tool that converts LLM models to ONNX and MNN formats, supporting quantization optimization and multimodal models.
✨ Key Features
- 🚀 Dynamic Shape Support: Optimized original code with dynamic input shape support
- 🚀 Model Optimization: Reduced constant parts for improved inference performance
- 🚀 Automatic Optimization: Integrated OnnxSlim for ONNX model optimization, ~5% performance improvement (Thanks @inisis)
- 🚀 LoRA Support: Support for LoRA weight merging/splitting export
- 🚀 Quantization Methods: Support for AWQ, GPTQ, HQQ, and other quantization methods
- 🚀 Multimodal Support: Support for text, image, audio, and other multimodal models
- 🚀 Inference Frameworks: Provides MNN and ONNX inference code
📖 Quick Start
Installation
# Install from PyPI (Recommended)
pip install llmexport
# Install latest version from GitHub
pip install git+https://github.com/wangzhaode/llm-export@master
# Local development installation
git clone https://github.com/wangzhaode/llm-export
cd llm-export
pip install -e .
Basic Usage
1. Download Model
# Using Hugging Face CLI
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir Qwen2.5-1.5B-Instruct
# Or using ModelScope (Recommended for users in China)
modelscope download Qwen/Qwen2.5-1.5B-Instruct --local_dir Qwen2.5-1.5B-Instruct
2. Model Testing
# Text conversation testing
llmexport --path Qwen2.5-1.5B-Instruct --test "Hello, please introduce yourself"
# Multimodal testing (Image + Text)
llmexport --path Qwen2-VL-2B-Instruct --test "<img>image_url</img>Describe this image"
3. Model Export
# Export to ONNX format
llmexport --path Qwen2.5-1.5B-Instruct --export onnx
# Export to MNN format (Default 4bit quantization)
llmexport --path Qwen2.5-1.5B-Instruct --export mnn
# Custom quantization parameters
llmexport --path Qwen2.5-1.5B-Instruct --export mnn --quant_bit 8 --quant_block 128
🔧 Advanced Features
Model Export Options
- ONNX Export: Use
--export onnxto export to ONNX format - MNN Export: Use
--export mnnto export to MNN format - Model Optimization: OnnxSlim optimization enabled by default, use
--onnx_slimto explicitly enable
Quantization Configuration
- Quantization Bits:
--quant_bit 4/8(Default 4bit) - Quantization Block Size:
--quant_block 64/128(Default 64) - LM Head Quantization:
--lm_quant_bitseparate setting for output layer quantization - Symmetric Quantization:
--symenable symmetric quantization (no zero point)
Quantization Algorithm Support
- AWQ Quantization:
--awqenable AWQ quantization - HQQ Quantization:
--hqqenable HQQ quantization - GPTQ Quantization:
--gptq_pathload GPTQ quantized model - Smooth Quantization:
--smoothenable Smooth quantization
LoRA Support
- LoRA Merging:
--lora_pathspecify LoRA weight path - LoRA Splitting:
--lora_splitexport LoRA weights separately
Multimodal Support
- Visual Quantization:
--visual_quant_bit,--visual_quant_blockset visual module quantization - Visual Symmetric:
--visual_symvisual module symmetric quantization
Other Options
- Verbose Output:
--verboseshow detailed logs - Performance Evaluation:
--pplget logits for all tokens - Custom Output:
--dst_pathspecify output directory (default./model)
📎 Command Line Parameters
Basic Parameters
| Parameter | Type | Description |
|---|---|---|
--path | Required | Model path, supports local directory or Hugging Face model ID |
--export | Optional | Export format: onnx or mnn |
--test | Optional | Test query string |
--dst_path | Optional | Output directory (default ./model) |
--verbose | Flag | Show detailed logs |
Quantization Parameters
| Parameter | Default | Description |
|---|---|---|
--quant_bit | 4 | Quantization bits (4 or 8) |
--quant_block | 64 | Quantization block size (0 means channel-wise) |
--lm_quant_bit | Same as quant_bit | LM Head layer quantization bits |
--visual_quant_bit | Model dependent | Visual module quantization bits |
--visual_quant_block | Model dependent | Visual module quantization block size |
Quantization Algorithms
| Parameter | Description |
|---|---|
--awq | Enable AWQ quantization |
--hqq | Enable HQQ quantization |
--smooth | Enable Smooth quantization |
--sym | Enable symmetric quantization (no zero point) |
--visual_sym | Visual module symmetric quantization |
LoRA Support
| Parameter | Description |
|---|---|
--lora_path | LoRA weight path |
--lora_split | Export LoRA weights separately |
Other Options
| Parameter | Description |
|---|---|
--tokenizer_path | Tokenizer path (default uses --path) |
--gptq_path | GPTQ quantized model path |
--mnnconvert | Local MNNConvert path |
--onnx_slim | Enable ONNX-Slim optimization |
--ppl | Get logits for all tokens |
--seperate_embed | Separate embedding layer to avoid quantization |
--calib_data | Calibration data path |
Commad Args
usage: llmexport.py [-h] --path PATH [--type TYPE] [--tokenizer_path TOKENIZER_PATH] [--lora_path LORA_PATH] [--gptq_path GPTQ_PATH] [--dst_path DST_PATH]
[--verbose] [--test TEST] [--export EXPORT] [--onnx_slim] [--quant_bit QUANT_BIT] [--quant_block QUANT_BLOCK] [--lm_quant_bit LM_QUANT_BIT]
[--mnnconvert MNNCONVERT] [--ppl] [--awq] [--sym] [--tie_embed] [--lora_split]
llm_exporter
options:
-h, --help show this help message and exit
--path PATH path(`str` or `os.PathLike`):
Can be either:
- A string, the *model id* of a pretrained model like `THUDM/chatglm-6b`. [TODO]
- A path to a *directory* clone from repo like `../chatglm-6b`.
--type TYPE type(`str`, *optional*):
The pretrain llm model type.
--tokenizer_path TOKENIZER_PATH
tokenizer path, defaut is `None` mean using `--path` value.
--lora_path LORA_PATH
lora path, defaut is `None` mean not apply lora.
--gptq_path GPTQ_PATH
gptq path, defaut is `None` mean not apply gptq.
--dst_path DST_PATH export onnx/mnn model to path, defaut is `./model`.
--verbose Whether or not to print verbose.
--test TEST test model inference with query `TEST`.
--export EXPORT export model to an onnx/mnn model.
--onnx_slim Whether or not to use onnx-slim.
--quant_bit QUANT_BIT
mnn quant bit, 4 or 8, default is 4.
--quant_block QUANT_BLOCK
mnn quant block, default is 0 mean channle-wise.
--lm_quant_bit LM_QUANT_BIT
mnn lm_head quant bit, 4 or 8, default is `quant_bit`.
--mnnconvert MNNCONVERT
local mnnconvert path, if invalid, using pymnn.
--ppl Whether or not to get all logits of input tokens.
--awq Whether or not to use awq quant.
--sym Whether or not to using symmetric quant (without zeropoint), defualt is False.
--tie_embed Whether or not to using tie_embedding, defualt is False.
--lora_split Whether or not export lora split, defualt is False.
📋 Supported Models
Currently supports the following model types:
Text Models
- Qwen Series: Qwen2.5, Qwen2, Qwen1.5, Qwen-VL, etc.
- LLaMA Series: Llama-3.2, Llama-3, Llama-2, etc.
- ChatGLM Series: ChatGLM4, ChatGLM3, ChatGLM2, etc.
- Baichuan Series: Baichuan2-7B-Chat, etc.
- Yi Series: Yi-6B-Chat, etc.
- Others: InternLM, DeepSeek, Phi, Gemma, TinyLlama, etc.
Multimodal Models
- Vision Models: Qwen2-VL, Qwen2.5-VL, Llama-3.2-Vision, InternVL, etc.
- Audio Models: Qwen2-Audio, Qwen2.5-Omni, etc.
Embedding Models
- Text Embedding: bge-large-zh, gte-multilingual, etc.
💾 Model Downloads
We provide optimized model downloads:
- Hugging Face: taobao-mnn
- ModelScope: MNN
Popular models:
| Model | Hugging Face | ModelScope |
|---|---|---|
| DeepSeek-R1-1.5B-Qwen | Q4_1 | Q4_1 |
| Qwen2.5-0.5B-Instruct | Q4_1 | Q4_1 |
| Qwen2.5-1.5B-Instruct | Q4_1 | Q4_1 |
| GPT-OSS-20B | Q4_1 | Q4_1 |
| Qwen3-4B-Instruct-2507 | Q4_1 | Q4_1 |
See the complete list for more models.
🔗 Related Projects
- MNN Inference: mnn-llm - LLM inference library for MNN framework
- ONNX Inference: onnx-llm, OnnxLLM - ONNX format inference libraries
- Model Optimization: OnnxSlim - ONNX model optimization tool
📄 License
This project is licensed under the MIT License.1.7B-Instruct-MNN) |