Section 3 : Microsoft Olive Optimization Suite

October 30, 2025 · View on GitHub

Introduction
What is Microsoft Olive?
Installation
Quick Start Guide
Example: Converting Qwen3 to ONNX INT4
Advanced Usage
Olive Recipes Repository
Best Practices
Troubleshooting
Additional Resources

Introduction

Microsoft Olive is a powerful, easy-to-use hardware-aware model optimization toolkit that simplifies the process of optimizing machine learning models for deployment across different hardware platforms. Whether you're targeting CPUs, GPUs, or specialized AI accelerators, Olive helps you achieve optimal performance while maintaining model accuracy.

What is Microsoft Olive?

Olive is an easy-to-use hardware-aware model optimization tool that composes industry-leading techniques across model compression, optimization, and compilation. It works with ONNX Runtime as an E2E inference optimization solution.

Key Features

Hardware-Aware Optimization: Automatically selects the best optimization techniques for your target hardware
40+ Built-in Optimization Components: Covers model compression, quantization, graph optimization, and more
Easy CLI Interface: Simple commands for common optimization tasks
Multi-Framework Support: Works with PyTorch, Hugging Face models, and ONNX
Popular Model Support: Olive can automatically optimize popular model architectures like Llama, Phi, Qwen, Gemma, etc out-of-the-box

Benefits

Reduced Development Time: No need to manually experiment with different optimization techniques
Performance Gains: Significant speed improvements (up to 6x in some cases)
Cross-Platform Deployment: Optimized models work across different hardware and operating systems
Maintained Accuracy: Optimizations preserve model quality while improving performance

Installation

Prerequisites

Python 3.8 or higher
pip package manager
Virtual environment (recommended)

Basic Installation

Create and activate a virtual environment:

# Create virtual environment
python -m venv olive-env

# Activate virtual environment
# On Windows:
olive-env\Scripts\activate
# On macOS/Linux:
source olive-env/bin/activate

Install Olive with auto-optimization features:

pip install olive-ai[auto-opt]
pip install transformers onnxruntime-genai

Optional Dependencies

Olive offers various optional dependencies for additional features:

# For Azure ML integration
pip install olive-ai[azureml]

# For DirectML (Windows GPU acceleration)
pip install olive-ai[directml]

# For CPU optimization
pip install olive-ai[cpu]

# For all features
pip install olive-ai[all]

Verify Installation

olive --help

If successful, you should see the Olive CLI help message.

Quick Start Guide

Your First Optimization

Let's optimize a small language model using Olive's auto-optimization feature:

olive auto-opt \
  --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct \
  --output_path models/smolm2-optimized \
  --device cpu \
  --provider CPUExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

What This Command Does

The optimization process involves: acquiring the model from the local cache, capturing the ONNX Graph and storing the weights in an ONNX data file, optimizing the ONNX Graph, and quantizing the model to int4 using RTN method.

Command Parameters Explained

--model_name_or_path: Hugging Face model identifier or local path
--output_path: Directory where optimized model will be saved
--device: Target device (cpu, gpu)
--provider: Execution provider (CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider)
--use_ort_genai: Use ONNX Runtime Generate AI for inference
--precision: Quantization precision (int4, int8, fp16)
--log_level: Logging verbosity (0=minimal, 1=verbose)

Example: Converting Qwen3 to ONNX INT4

Based on the provided Hugging Face example at lokinfey/Qwen3-8B-ONNX-INT4-CPU, here's how to optimize a Qwen3 model:

Step 1: Download Model (Optional)

To minimize download time, cache only essential files:

huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct *.json *.safetensors *.txt

Step 2: Optimize Qwen3 Model

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-onnx-int4 \
  --device cpu \
  --provider CPUExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

Step 3: Test the Optimized Model

Create a simple Python script to test your optimized model:

import onnxruntime_genai as og

# Load the optimized model
model = og.Model('models/qwen3-onnx-int4')
tokenizer = og.Tokenizer(model)

# Create a chat template
chat_template = '<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n'

# Generate text
prompt = "What is machine learning?"
input_tokens = tokenizer.encode(chat_template.format(input=prompt))

params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = input_tokens

generator = og.Generator(model, params)

print("Generated response:")
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode([new_token]), end='', flush=True)

print()

Output Structure

After optimization, your output directory will contain:

models/qwen3-onnx-int4/
├── model.onnx              # Optimized ONNX model
├── model.onnx.data         # Model weights
├── genai_config.json       # Generation configuration
├── tokenizer.json          # Tokenizer files
├── tokenizer_config.json
└── special_tokens_map.json

Advanced Usage

Configuration Files

For more complex optimization workflows, you can use JSON configuration files:

{
  "input_model": {
    "type": "PyTorchModel",
    "config": {
      "hf_config": {
        "model_name": "Qwen/Qwen2.5-0.5B-Instruct",
        "task": "text-generation"
      }
    }
  },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "cpu",
            "execution_providers": ["CPUExecutionProvider"]
          }
        ]
      }
    }
  },
  "evaluators": {
    "common_evaluator": {
      "metrics": [
        {
          "name": "latency",
          "type": "latency",
          "sub_types": [{"name": "avg"}]
        }
      ]
    }
  },
  "passes": {
    "conversion": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int4"
      }
    },
    "optimization": {
      "type": "OrtTransformersOptimization",
      "config": {
        "model_type": "gpt2"
      }
    }
  },
  "engine": {
    "search_strategy": {
      "execution_order": "joint",
      "search_algorithm": "tpe"
    },
    "evaluator": "common_evaluator",
    "host": "local_system",
    "target": "local_system",
    "cache_dir": "cache",
    "output_dir": "models/optimized"
  }
}

Run with configuration:

olive run --config config.json

GPU Optimization

For CUDA GPU optimization:

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-gpu-int4 \
  --device gpu \
  --provider CUDAExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

For DirectML (Windows):

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-directml-int4 \
  --device gpu \
  --provider DmlExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

Fine-tuning with Olive

Olive also supports fine-tuning models:

olive finetune \
  --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
  --data_name microsoft/dolly-15k \
  --text_template "### Question: {instruction}\n### Answer: {response}" \
  --max_steps 100 \
  --output_path models/llama-finetuned

Best Practices

1. Model Selection

Start with smaller models for testing (e.g., 0.5B-7B parameters)
Ensure your target model architecture is supported by Olive

2. Hardware Considerations

Match your optimization target to your deployment hardware
Use GPU optimization if you have CUDA-compatible hardware
Consider DirectML for Windows machines with integrated graphics

3. Precision Selection

INT4: Maximum compression, slight accuracy loss
INT8: Good balance of size and accuracy
FP16: Minimal accuracy loss, moderate size reduction

4. Testing and Validation

Always test optimized models with your specific use cases
Compare performance metrics (latency, throughput, accuracy)
Use representative input data for evaluation

5. Iterative Optimization

Start with auto-optimization for quick results
Use configuration files for fine-grained control
Experiment with different optimization passes

Troubleshooting

Common Issues

1. Installation Problems

# If you encounter dependency conflicts:
pip install --upgrade pip
pip install olive-ai[auto-opt] --force-reinstall

2. CUDA/GPU Issues

# Verify CUDA installation:
nvidia-smi

# Install correct ONNX Runtime GPU package:
pip install onnxruntime-gpu

3. Memory Issues

Use smaller batch sizes during optimization
Try quantization with higher precision first (int8 instead of int4)
Ensure sufficient disk space for model caching

4. Model Loading Errors

Verify model path and access permissions
Check if model requires trust_remote_code=True
Ensure all required model files are downloaded

Getting Help

Documentation: microsoft.github.io/Olive
GitHub Issues: github.com/microsoft/Olive/issues
Examples: microsoft.github.io/Olive/examples.html

Olive Recipes Repository

Introduction to Olive Recipes

The microsoft/olive-recipes repository complements the main Olive toolkit by providing a comprehensive collection of ready-to-use optimization recipes for popular AI models. This repository serves as a practical reference for both optimizing publicly available models and creating optimization workflows for proprietary models.

Key Features

100+ Pre-built Recipes: Ready-to-use optimization configurations for popular models
Multi-Architecture Support: Covers transformer models, vision models, and multimodal architectures
Hardware-Specific Optimizations: Recipes tailored for CPU, GPU, and specialized accelerators
Popular Model Families: Includes Phi, Llama, Qwen, Gemma, Mistral, and many more

Supported Model Families

The repository includes optimization recipes for:

Language Models

Microsoft Phi: Phi-3-mini, Phi-3.5-mini, Phi-4-mini, Phi-4-reasoning
Meta Llama: Llama-2-7b, Llama-3.1-8B, Llama-3.2-1B/3B
Alibaba Qwen: Qwen1.5-7B, Qwen2-7B, Qwen2.5 series (0.5B to 14B)
Google Gemma: Various Gemma model configurations
Mistral AI: Mistral-7B series
DeepSeek: R1-Distill series models

Vision and Multimodal Models

Stable Diffusion: v1.4, XL-base-1.0
CLIP Models: Various CLIP-ViT configurations
ResNet: ResNet-50 optimizations
Vision Transformers: ViT-base-patch16-224

Specialized Models

Whisper: OpenAI Whisper-large-v3
BERT: Base and multilingual variants
Sentence Transformers: all-MiniLM-L6-v2

Using Olive Recipes

Method 1: Clone Specific Recipe

# Clone the recipes repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes

# Navigate to a specific model recipe
cd microsoft-Phi-4-mini-instruct

# Run the optimization
olive run --config olive_config.json

Method 2: Use Recipe as Template

# Copy a recipe configuration for your model
cp olive-recipes/microsoft-Phi-3-mini-4k-instruct/olive_config.json ./my_config.json

# Modify the configuration for your needs
# Update model paths, optimization parameters, etc.

# Run with your custom configuration
olive run --config my_config.json

Recipe Structure

Each recipe directory typically contains:

model-name/
├── olive_config.json       # Main optimization configuration
├── requirements.txt        # Python dependencies
├── README.md              # Model-specific instructions
├── user_script.py         # Custom preprocessing/evaluation scripts
└── sample_data/           # Sample input data for testing

Example: Using Phi-4-mini Recipe

Let's use the Phi-4-mini recipe as an example:

# Clone the repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes/microsoft-Phi-4-mini-instruct

# Install dependencies
pip install -r requirements.txt

# Run the optimization
olive run --config olive_config.json

The configuration file typically includes:

{
  "input_model": {
    "type": "PyTorchModel",
    "config": {
      "hf_config": {
        "model_name": "microsoft/Phi-4-mini-instruct",
        "task": "text-generation",
        "trust_remote_code": true
      }
    }
  },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "cpu",
            "execution_providers": ["CPUExecutionProvider"]
          }
        ]
      }
    }
  },
  "passes": {
    "convert": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int4"
      }
    }
  }
}

Customizing Recipes

Modifying Target Hardware

To change the target hardware, update the systems section:

{
  "systems": {
    "gpu_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "gpu",
            "execution_providers": ["CUDAExecutionProvider"]
          }
        ]
      }
    }
  }
}

Adjusting Optimization Parameters

Modify the passes section for different optimization levels:

{
  "passes": {
    "convert": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int8",           // Change from int4 to int8
        "use_ort_genai": true,
        "use_dynamo_exporter": true
      }
    },
    "optimize": {
      "type": "OrtTransformersOptimization",
      "config": {
        "optimization_level": "all"
      }
    }
  }
}

Creating Your Own Recipe

Start with a Similar Model: Find a recipe for a model with similar architecture
Update Model Configuration: Change the model name/path in the configuration
Adjust Parameters: Modify optimization parameters as needed
Test and Validate: Run the optimization and validate results
Contribute Back: Consider contributing your recipe to the repository

Benefits of Using Recipes

1. Proven Configurations

Tested optimization settings for specific models
Avoids trial-and-error in finding optimal parameters

2. Hardware-Specific Tuning

Pre-optimized for different execution providers
Ready-to-use configurations for CPU, GPU, and NPU targets

3. Comprehensive Coverage

Supports the most popular open-source models
Regular updates with new model releases

4. Community Contributions

Collaborative development with the AI community
Shared knowledge and best practices

Contributing to Olive Recipes

If you've optimized a model not covered in the repository:

Fork the Repository: Create your own fork of olive-recipes
Create Recipe Directory: Add a new directory for your model
Include Configuration: Add olive_config.json and supporting files
Document Usage: Provide clear README with instructions
Submit Pull Request: Contribute back to the community

Performance Benchmarks

Many recipes include performance benchmarks showing:

Latency Improvements: Typical 2-6x speedup over baseline
Memory Reduction: 50-75% memory usage reduction with quantization
Accuracy Retention: 95-99% accuracy preservation

Integration with AI Toolkit

The recipes work seamlessly with:

VS Code AI Toolkit: Direct integration for model optimization
Azure Machine Learning: Cloud-based optimization workflows
ONNX Runtime: Optimized inference deployment

Additional Resources

Official Links

GitHub Repository: github.com/microsoft/Olive
Olive Recipes Repository: github.com/microsoft/olive-recipes
ONNX Runtime Documentation: onnxruntime.ai/docs/performance/olive.html
Hugging Face Example: huggingface.co/lokinfey/Qwen3-8B-ONNX-INT4-CPU

Community Examples

Jupyter Notebooks: Available in the Olive GitHub repository — https://github.com/microsoft/Olive/tree/main/examples
VS Code Extension: AI Toolkit for VS Code overview — https://learn.microsoft.com/azure/ai-toolkit/overview
Blog Posts: Microsoft Open Source Blog — https://opensource.microsoft.com/blog/

ONNX Runtime: High-performance inference engine — https://onnxruntime.ai/
Hugging Face Transformers: Source of many compatible models — https://huggingface.co/docs/transformers/index
Azure Machine Learning: Cloud-based optimization workflows — https://learn.microsoft.com/azure/machine-learning/

➡️ What's next

04: OpenVINO Toolkit Optimization Suite