ONNX Runtime GenAI Model Builder
May 22, 2026 ยท View on GitHub
This folder contains the model builder for quickly creating optimized and quantized ONNX models within a few minutes that run with ONNX Runtime GenAI.
Contents
- Current Support
- Usage
- Full Usage
- Original PyTorch Model from Hugging Face
- Original PyTorch Model from Disk
- Customized or Finetuned PyTorch Model
- Quantized PyTorch Model
- GGUF Model
- Extra Options
- Config Only
- Hugging Face Authentication
- Hugging Face Remote Code
- Exclude Embedding Layer
- Exclude Language Modeling Head
- Prune Language Modeling Head
- Include Last Hidden States Output
- Enable Shared Embeddings
- Disable QKV Projections Fusion
- Enable CUDA Graph
- Use 8 Bits Quantization in QMoE
- Use QDQ Pattern for Quantization
- LoRA Models
- Unit Testing Models
- Design
Current Support
The tool currently supports the following model architectures.
- AMD OLMo
- ChatGLM
- DeepSeek
- ERNIE 4.5
- Gemma
- gpt-oss
- Granite
- HunYuan Dense V1
- InternLM2
- Llama
- Mistral
- Nemotron
- Phi
- Qwen
- SmolLM3
- Whisper
It is intended for supporting the latest, popular state-of-the-art models.
Usage
Full Usage
For all available options, please use the -h/--help flag.
# From wheel:
python -m onnxruntime_genai.models.builder --help
# From source:
python builder.py --help
Original PyTorch Model from Hugging Face
This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
Original PyTorch Model from Disk
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk).
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
Customized or Finetuned PyTorch Model
This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files
Quantized PyTorch Model
This scenario is where your PyTorch model is one of the currently supported model architectures, has already been quantized to INT4 precision, and your model can be loaded in the Hugging Face style via AutoGPTQ or AutoAWQ.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p int4 -e execution_provider -c cache_dir_to_store_temp_files
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p int4 -e execution_provider -c cache_dir_to_store_temp_files
GGUF Model
This scenario is where your float16/float32 GGUF model is already on disk.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
# From source:
python builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files
Extra Options
This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to --extra_options.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx
To see all available options through --extra_options, please use the help commands in the Full Usage section above.
Config Only
This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime GenAI.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true
Afterwards, please open the genai_config.json file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well.
Hugging Face Authentication
This scenario is for when you need to disable the Hugging Face authentication or use a different authentication token than the one stored in huggingface-cli login.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options hf_token=false
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options hf_token=false
Hugging Face Remote Code
This scenario is for when you need to disable trusting remote code from a Hugging Face repo.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options hf_remote=false
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options hf_remote=false
Exclude Embedding Layer
This scenario is for when you want to exclude the embedding layer from your ONNX model.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options exclude_embeds=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options exclude_embeds=true
Exclude Language Modeling Head
This scenario is for when you want to exclude the language modeling head from your ONNX model.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options exclude_lm_head=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options exclude_lm_head=true
Prune Language Modeling Head
This scenario is for when you want to prune the language modeling head to only compute the last token's logits.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options prune_lm_head=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options prune_lm_head=true
Include Last Hidden States Output
This scenario is for when you want to include the last hidden states as an output to your ONNX model.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options include_hidden_states=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options include_hidden_states=true
Note that this is the same as outputting embeddings since the last hidden states are also known as the embeddings.
Enable Shared Embeddings
This scenario is for when you want to enable weight sharing between the embedding layer and the language modeling head. This reduces model size and can improve memory efficiency, especially useful for models with tied embeddings (where tie_word_embeddings=true in config.json). Shared embeddings are automatically enabled if tie_word_embeddings=true in the model's config.json (can be overridden with shared_embeddings=false), but cannot be used with exclude_embeds=true or exclude_lm_head=true.
Option 1: INT4 (for RTN and K-Quant)
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=k_quant
# From source:
python builder.py -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=k_quant
Option 2: INT4 + INT8 embeddings (for RTN Last and K-Quant Last)
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=k_quant_last
# From source:
python builder.py -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=k_quant_last
Option 3: INT4 embeddings + FP16 embeddings
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=rtn int4_nodes_to_exclude=/lm_head/MatMul
# From source:
python builder.py -m model_name -o path_to_output_folder -p int4 -e cuda --extra_options shared_embeddings=true int4_algo_config=rtn int4_nodes_to_exclude=/lm_head/MatMul
Option 4: FP16 embeddings
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p fp16 -e cuda --extra_options shared_embeddings=true
# From source:
python builder.py -m model_name -o path_to_output_folder -p fp16 -e cuda --extra_options shared_embeddings=true
Disable QKV Projections Fusion
This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
Enable CUDA Graph
This scenario is for when you want to enable CUDA graph for your ONNX model.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options enable_cuda_graph=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options enable_cuda_graph=true
Use 8 Bits Quantization in QMoE
This scenario is for when you want to use 8-bit quantization for MoE layers. Default is using 4-bit quantization.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options use_8bits_moe=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options use_8bits_moe=true
Use QDQ Pattern for Quantization
This scenario is for when you want to use the QDQ pattern when quantizing the model to 4 bits.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options use_qdq=true
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options use_qdq=true
LoRA Models
This scenario is where you have a finetuned model with LoRA adapters and your model can be loaded in the Hugging Face style via PEFT.
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p fp16 -e execution_provider -c cache_dir_to_store_temp_files --extra_options adapter_path=path_to_adapter_files
# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p fp16 -e execution_provider -c cache_dir_to_store_temp_files --extra_options adapter_path=path_to_adapter_files
Base weights should be located in path_to_local_folder_on_disk and adapter weights should be located in path_to_adapter_files.
Unit Testing Models
This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your_model_name"
cache_dir = "cache_dir_to_save_hf_files"
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
model.save_pretrained(cache_dir)
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
tokenizer.save_pretrained(cache_dir)
Option 1: Use the model builder directly
This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4
Option 2: Edit the config.json file on disk and then run the model builder
- Navigate to where the PyTorch model and its associated files are saved on disk.
- Modify
num_hidden_layersinconfig.jsonto your desired target (e.g. 4 layers). - Run the below command for the model builder.
# From wheel:
python -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
# From source:
python builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved
Design
Please read the design document for more details and for how to contribute.