Section 3 : Microsoft Olive Optimization Suite
October 30, 2025 · View on GitHub
Table of Contents
- Introduction
- What is Microsoft Olive?
- Installation
- Quick Start Guide
- Example: Converting Qwen3 to ONNX INT4
- Advanced Usage
- Olive Recipes Repository
- Best Practices
- Troubleshooting
- Additional Resources
Introduction
Microsoft Olive is a powerful, easy-to-use hardware-aware model optimization toolkit that simplifies the process of optimizing machine learning models for deployment across different hardware platforms. Whether you're targeting CPUs, GPUs, or specialized AI accelerators, Olive helps you achieve optimal performance while maintaining model accuracy.
What is Microsoft Olive?
Olive is an easy-to-use hardware-aware model optimization tool that composes industry-leading techniques across model compression, optimization, and compilation. It works with ONNX Runtime as an E2E inference optimization solution.
Key Features
- Hardware-Aware Optimization: Automatically selects the best optimization techniques for your target hardware
- 40+ Built-in Optimization Components: Covers model compression, quantization, graph optimization, and more
- Easy CLI Interface: Simple commands for common optimization tasks
- Multi-Framework Support: Works with PyTorch, Hugging Face models, and ONNX
- Popular Model Support: Olive can automatically optimize popular model architectures like Llama, Phi, Qwen, Gemma, etc out-of-the-box
Benefits
- Reduced Development Time: No need to manually experiment with different optimization techniques
- Performance Gains: Significant speed improvements (up to 6x in some cases)
- Cross-Platform Deployment: Optimized models work across different hardware and operating systems
- Maintained Accuracy: Optimizations preserve model quality while improving performance
Installation
Prerequisites
- Python 3.8 or higher
- pip package manager
- Virtual environment (recommended)
Basic Installation
Create and activate a virtual environment:
# Create virtual environment
python -m venv olive-env
# Activate virtual environment
# On Windows:
olive-env\Scripts\activate
# On macOS/Linux:
source olive-env/bin/activate
Install Olive with auto-optimization features:
pip install olive-ai[auto-opt]
pip install transformers onnxruntime-genai
Optional Dependencies
Olive offers various optional dependencies for additional features:
# For Azure ML integration
pip install olive-ai[azureml]
# For DirectML (Windows GPU acceleration)
pip install olive-ai[directml]
# For CPU optimization
pip install olive-ai[cpu]
# For all features
pip install olive-ai[all]
Verify Installation
olive --help
If successful, you should see the Olive CLI help message.
Quick Start Guide
Your First Optimization
Let's optimize a small language model using Olive's auto-optimization feature:
olive auto-opt \
--model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct \
--output_path models/smolm2-optimized \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
What This Command Does
The optimization process involves: acquiring the model from the local cache, capturing the ONNX Graph and storing the weights in an ONNX data file, optimizing the ONNX Graph, and quantizing the model to int4 using RTN method.
Command Parameters Explained
--model_name_or_path: Hugging Face model identifier or local path--output_path: Directory where optimized model will be saved--device: Target device (cpu, gpu)--provider: Execution provider (CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider)--use_ort_genai: Use ONNX Runtime Generate AI for inference--precision: Quantization precision (int4, int8, fp16)--log_level: Logging verbosity (0=minimal, 1=verbose)
Example: Converting Qwen3 to ONNX INT4
Based on the provided Hugging Face example at lokinfey/Qwen3-8B-ONNX-INT4-CPU, here's how to optimize a Qwen3 model:
Step 1: Download Model (Optional)
To minimize download time, cache only essential files:
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct *.json *.safetensors *.txt
Step 2: Optimize Qwen3 Model
olive auto-opt \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--output_path models/qwen3-onnx-int4 \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
Step 3: Test the Optimized Model
Create a simple Python script to test your optimized model:
import onnxruntime_genai as og
# Load the optimized model
model = og.Model('models/qwen3-onnx-int4')
tokenizer = og.Tokenizer(model)
# Create a chat template
chat_template = '<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n'
# Generate text
prompt = "What is machine learning?"
input_tokens = tokenizer.encode(chat_template.format(input=prompt))
params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = input_tokens
generator = og.Generator(model, params)
print("Generated response:")
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer.decode([new_token]), end='', flush=True)
print()
Output Structure
After optimization, your output directory will contain:
models/qwen3-onnx-int4/
├── model.onnx # Optimized ONNX model
├── model.onnx.data # Model weights
├── genai_config.json # Generation configuration
├── tokenizer.json # Tokenizer files
├── tokenizer_config.json
└── special_tokens_map.json
Advanced Usage
Configuration Files
For more complex optimization workflows, you can use JSON configuration files:
{
"input_model": {
"type": "PyTorchModel",
"config": {
"hf_config": {
"model_name": "Qwen/Qwen2.5-0.5B-Instruct",
"task": "text-generation"
}
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"config": {
"accelerators": [
{
"device": "cpu",
"execution_providers": ["CPUExecutionProvider"]
}
]
}
}
},
"evaluators": {
"common_evaluator": {
"metrics": [
{
"name": "latency",
"type": "latency",
"sub_types": [{"name": "avg"}]
}
]
}
},
"passes": {
"conversion": {
"type": "ModelBuilder",
"config": {
"precision": "int4"
}
},
"optimization": {
"type": "OrtTransformersOptimization",
"config": {
"model_type": "gpt2"
}
}
},
"engine": {
"search_strategy": {
"execution_order": "joint",
"search_algorithm": "tpe"
},
"evaluator": "common_evaluator",
"host": "local_system",
"target": "local_system",
"cache_dir": "cache",
"output_dir": "models/optimized"
}
}
Run with configuration:
olive run --config config.json
GPU Optimization
For CUDA GPU optimization:
olive auto-opt \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--output_path models/qwen3-gpu-int4 \
--device gpu \
--provider CUDAExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
For DirectML (Windows):
olive auto-opt \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--output_path models/qwen3-directml-int4 \
--device gpu \
--provider DmlExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
Fine-tuning with Olive
Olive also supports fine-tuning models:
olive finetune \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--data_name microsoft/dolly-15k \
--text_template "### Question: {instruction}\n### Answer: {response}" \
--max_steps 100 \
--output_path models/llama-finetuned
Best Practices
1. Model Selection
- Start with smaller models for testing (e.g., 0.5B-7B parameters)
- Ensure your target model architecture is supported by Olive
2. Hardware Considerations
- Match your optimization target to your deployment hardware
- Use GPU optimization if you have CUDA-compatible hardware
- Consider DirectML for Windows machines with integrated graphics
3. Precision Selection
- INT4: Maximum compression, slight accuracy loss
- INT8: Good balance of size and accuracy
- FP16: Minimal accuracy loss, moderate size reduction
4. Testing and Validation
- Always test optimized models with your specific use cases
- Compare performance metrics (latency, throughput, accuracy)
- Use representative input data for evaluation
5. Iterative Optimization
- Start with auto-optimization for quick results
- Use configuration files for fine-grained control
- Experiment with different optimization passes
Troubleshooting
Common Issues
1. Installation Problems
# If you encounter dependency conflicts:
pip install --upgrade pip
pip install olive-ai[auto-opt] --force-reinstall
2. CUDA/GPU Issues
# Verify CUDA installation:
nvidia-smi
# Install correct ONNX Runtime GPU package:
pip install onnxruntime-gpu
3. Memory Issues
- Use smaller batch sizes during optimization
- Try quantization with higher precision first (int8 instead of int4)
- Ensure sufficient disk space for model caching
4. Model Loading Errors
- Verify model path and access permissions
- Check if model requires
trust_remote_code=True - Ensure all required model files are downloaded
Getting Help
- Documentation: microsoft.github.io/Olive
- GitHub Issues: github.com/microsoft/Olive/issues
- Examples: microsoft.github.io/Olive/examples.html
Olive Recipes Repository
Introduction to Olive Recipes
The microsoft/olive-recipes repository complements the main Olive toolkit by providing a comprehensive collection of ready-to-use optimization recipes for popular AI models. This repository serves as a practical reference for both optimizing publicly available models and creating optimization workflows for proprietary models.
Key Features
- 100+ Pre-built Recipes: Ready-to-use optimization configurations for popular models
- Multi-Architecture Support: Covers transformer models, vision models, and multimodal architectures
- Hardware-Specific Optimizations: Recipes tailored for CPU, GPU, and specialized accelerators
- Popular Model Families: Includes Phi, Llama, Qwen, Gemma, Mistral, and many more
Supported Model Families
The repository includes optimization recipes for:
Language Models
- Microsoft Phi: Phi-3-mini, Phi-3.5-mini, Phi-4-mini, Phi-4-reasoning
- Meta Llama: Llama-2-7b, Llama-3.1-8B, Llama-3.2-1B/3B
- Alibaba Qwen: Qwen1.5-7B, Qwen2-7B, Qwen2.5 series (0.5B to 14B)
- Google Gemma: Various Gemma model configurations
- Mistral AI: Mistral-7B series
- DeepSeek: R1-Distill series models
Vision and Multimodal Models
- Stable Diffusion: v1.4, XL-base-1.0
- CLIP Models: Various CLIP-ViT configurations
- ResNet: ResNet-50 optimizations
- Vision Transformers: ViT-base-patch16-224
Specialized Models
- Whisper: OpenAI Whisper-large-v3
- BERT: Base and multilingual variants
- Sentence Transformers: all-MiniLM-L6-v2
Using Olive Recipes
Method 1: Clone Specific Recipe
# Clone the recipes repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes
# Navigate to a specific model recipe
cd microsoft-Phi-4-mini-instruct
# Run the optimization
olive run --config olive_config.json
Method 2: Use Recipe as Template
# Copy a recipe configuration for your model
cp olive-recipes/microsoft-Phi-3-mini-4k-instruct/olive_config.json ./my_config.json
# Modify the configuration for your needs
# Update model paths, optimization parameters, etc.
# Run with your custom configuration
olive run --config my_config.json
Recipe Structure
Each recipe directory typically contains:
model-name/
├── olive_config.json # Main optimization configuration
├── requirements.txt # Python dependencies
├── README.md # Model-specific instructions
├── user_script.py # Custom preprocessing/evaluation scripts
└── sample_data/ # Sample input data for testing
Example: Using Phi-4-mini Recipe
Let's use the Phi-4-mini recipe as an example:
# Clone the repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes/microsoft-Phi-4-mini-instruct
# Install dependencies
pip install -r requirements.txt
# Run the optimization
olive run --config olive_config.json
The configuration file typically includes:
{
"input_model": {
"type": "PyTorchModel",
"config": {
"hf_config": {
"model_name": "microsoft/Phi-4-mini-instruct",
"task": "text-generation",
"trust_remote_code": true
}
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"config": {
"accelerators": [
{
"device": "cpu",
"execution_providers": ["CPUExecutionProvider"]
}
]
}
}
},
"passes": {
"convert": {
"type": "ModelBuilder",
"config": {
"precision": "int4"
}
}
}
}
Customizing Recipes
Modifying Target Hardware
To change the target hardware, update the systems section:
{
"systems": {
"gpu_system": {
"type": "LocalSystem",
"config": {
"accelerators": [
{
"device": "gpu",
"execution_providers": ["CUDAExecutionProvider"]
}
]
}
}
}
}
Adjusting Optimization Parameters
Modify the passes section for different optimization levels:
{
"passes": {
"convert": {
"type": "ModelBuilder",
"config": {
"precision": "int8", // Change from int4 to int8
"use_ort_genai": true,
"use_dynamo_exporter": true
}
},
"optimize": {
"type": "OrtTransformersOptimization",
"config": {
"optimization_level": "all"
}
}
}
}
Creating Your Own Recipe
- Start with a Similar Model: Find a recipe for a model with similar architecture
- Update Model Configuration: Change the model name/path in the configuration
- Adjust Parameters: Modify optimization parameters as needed
- Test and Validate: Run the optimization and validate results
- Contribute Back: Consider contributing your recipe to the repository
Benefits of Using Recipes
1. Proven Configurations
- Tested optimization settings for specific models
- Avoids trial-and-error in finding optimal parameters
2. Hardware-Specific Tuning
- Pre-optimized for different execution providers
- Ready-to-use configurations for CPU, GPU, and NPU targets
3. Comprehensive Coverage
- Supports the most popular open-source models
- Regular updates with new model releases
4. Community Contributions
- Collaborative development with the AI community
- Shared knowledge and best practices
Contributing to Olive Recipes
If you've optimized a model not covered in the repository:
- Fork the Repository: Create your own fork of olive-recipes
- Create Recipe Directory: Add a new directory for your model
- Include Configuration: Add olive_config.json and supporting files
- Document Usage: Provide clear README with instructions
- Submit Pull Request: Contribute back to the community
Performance Benchmarks
Many recipes include performance benchmarks showing:
- Latency Improvements: Typical 2-6x speedup over baseline
- Memory Reduction: 50-75% memory usage reduction with quantization
- Accuracy Retention: 95-99% accuracy preservation
Integration with AI Toolkit
The recipes work seamlessly with:
- VS Code AI Toolkit: Direct integration for model optimization
- Azure Machine Learning: Cloud-based optimization workflows
- ONNX Runtime: Optimized inference deployment
Additional Resources
Official Links
- GitHub Repository: github.com/microsoft/Olive
- Olive Recipes Repository: github.com/microsoft/olive-recipes
- ONNX Runtime Documentation: onnxruntime.ai/docs/performance/olive.html
- Hugging Face Example: huggingface.co/lokinfey/Qwen3-8B-ONNX-INT4-CPU
Community Examples
- Jupyter Notebooks: Available in the Olive GitHub repository — https://github.com/microsoft/Olive/tree/main/examples
- VS Code Extension: AI Toolkit for VS Code overview — https://learn.microsoft.com/azure/ai-toolkit/overview
- Blog Posts: Microsoft Open Source Blog — https://opensource.microsoft.com/blog/
Related Tools
- ONNX Runtime: High-performance inference engine — https://onnxruntime.ai/
- Hugging Face Transformers: Source of many compatible models — https://huggingface.co/docs/transformers/index
- Azure Machine Learning: Cloud-based optimization workflows — https://learn.microsoft.com/azure/machine-learning/