Section 3 : Microsoft Olive Optimization Suite

October 30, 2025 · View on GitHub

Table of Contents

  1. Introduction
  2. What is Microsoft Olive?
  3. Installation
  4. Quick Start Guide
  5. Example: Converting Qwen3 to ONNX INT4
  6. Advanced Usage
  7. Olive Recipes Repository
  8. Best Practices
  9. Troubleshooting
  10. Additional Resources

Introduction

Microsoft Olive is a powerful, easy-to-use hardware-aware model optimization toolkit that simplifies the process of optimizing machine learning models for deployment across different hardware platforms. Whether you're targeting CPUs, GPUs, or specialized AI accelerators, Olive helps you achieve optimal performance while maintaining model accuracy.

What is Microsoft Olive?

Olive is an easy-to-use hardware-aware model optimization tool that composes industry-leading techniques across model compression, optimization, and compilation. It works with ONNX Runtime as an E2E inference optimization solution.

Key Features

  • Hardware-Aware Optimization: Automatically selects the best optimization techniques for your target hardware
  • 40+ Built-in Optimization Components: Covers model compression, quantization, graph optimization, and more
  • Easy CLI Interface: Simple commands for common optimization tasks
  • Multi-Framework Support: Works with PyTorch, Hugging Face models, and ONNX
  • Popular Model Support: Olive can automatically optimize popular model architectures like Llama, Phi, Qwen, Gemma, etc out-of-the-box

Benefits

  • Reduced Development Time: No need to manually experiment with different optimization techniques
  • Performance Gains: Significant speed improvements (up to 6x in some cases)
  • Cross-Platform Deployment: Optimized models work across different hardware and operating systems
  • Maintained Accuracy: Optimizations preserve model quality while improving performance

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Virtual environment (recommended)

Basic Installation

Create and activate a virtual environment:

# Create virtual environment
python -m venv olive-env

# Activate virtual environment
# On Windows:
olive-env\Scripts\activate
# On macOS/Linux:
source olive-env/bin/activate

Install Olive with auto-optimization features:

pip install olive-ai[auto-opt]
pip install transformers onnxruntime-genai

Optional Dependencies

Olive offers various optional dependencies for additional features:

# For Azure ML integration
pip install olive-ai[azureml]

# For DirectML (Windows GPU acceleration)
pip install olive-ai[directml]

# For CPU optimization
pip install olive-ai[cpu]

# For all features
pip install olive-ai[all]

Verify Installation

olive --help

If successful, you should see the Olive CLI help message.

Quick Start Guide

Your First Optimization

Let's optimize a small language model using Olive's auto-optimization feature:

olive auto-opt \
  --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct \
  --output_path models/smolm2-optimized \
  --device cpu \
  --provider CPUExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

What This Command Does

The optimization process involves: acquiring the model from the local cache, capturing the ONNX Graph and storing the weights in an ONNX data file, optimizing the ONNX Graph, and quantizing the model to int4 using RTN method.

Command Parameters Explained

  • --model_name_or_path: Hugging Face model identifier or local path
  • --output_path: Directory where optimized model will be saved
  • --device: Target device (cpu, gpu)
  • --provider: Execution provider (CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider)
  • --use_ort_genai: Use ONNX Runtime Generate AI for inference
  • --precision: Quantization precision (int4, int8, fp16)
  • --log_level: Logging verbosity (0=minimal, 1=verbose)

Example: Converting Qwen3 to ONNX INT4

Based on the provided Hugging Face example at lokinfey/Qwen3-8B-ONNX-INT4-CPU, here's how to optimize a Qwen3 model:

Step 1: Download Model (Optional)

To minimize download time, cache only essential files:

huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct *.json *.safetensors *.txt

Step 2: Optimize Qwen3 Model

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-onnx-int4 \
  --device cpu \
  --provider CPUExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

Step 3: Test the Optimized Model

Create a simple Python script to test your optimized model:

import onnxruntime_genai as og

# Load the optimized model
model = og.Model('models/qwen3-onnx-int4')
tokenizer = og.Tokenizer(model)

# Create a chat template
chat_template = '<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n'

# Generate text
prompt = "What is machine learning?"
input_tokens = tokenizer.encode(chat_template.format(input=prompt))

params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = input_tokens

generator = og.Generator(model, params)

print("Generated response:")
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode([new_token]), end='', flush=True)

print()

Output Structure

After optimization, your output directory will contain:

models/qwen3-onnx-int4/
├── model.onnx              # Optimized ONNX model
├── model.onnx.data         # Model weights
├── genai_config.json       # Generation configuration
├── tokenizer.json          # Tokenizer files
├── tokenizer_config.json
└── special_tokens_map.json

Advanced Usage

Configuration Files

For more complex optimization workflows, you can use JSON configuration files:

{
  "input_model": {
    "type": "PyTorchModel",
    "config": {
      "hf_config": {
        "model_name": "Qwen/Qwen2.5-0.5B-Instruct",
        "task": "text-generation"
      }
    }
  },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "cpu",
            "execution_providers": ["CPUExecutionProvider"]
          }
        ]
      }
    }
  },
  "evaluators": {
    "common_evaluator": {
      "metrics": [
        {
          "name": "latency",
          "type": "latency",
          "sub_types": [{"name": "avg"}]
        }
      ]
    }
  },
  "passes": {
    "conversion": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int4"
      }
    },
    "optimization": {
      "type": "OrtTransformersOptimization",
      "config": {
        "model_type": "gpt2"
      }
    }
  },
  "engine": {
    "search_strategy": {
      "execution_order": "joint",
      "search_algorithm": "tpe"
    },
    "evaluator": "common_evaluator",
    "host": "local_system",
    "target": "local_system",
    "cache_dir": "cache",
    "output_dir": "models/optimized"
  }
}

Run with configuration:

olive run --config config.json

GPU Optimization

For CUDA GPU optimization:

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-gpu-int4 \
  --device gpu \
  --provider CUDAExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

For DirectML (Windows):

olive auto-opt \
  --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
  --output_path models/qwen3-directml-int4 \
  --device gpu \
  --provider DmlExecutionProvider \
  --use_ort_genai \
  --precision int4 \
  --log_level 1

Fine-tuning with Olive

Olive also supports fine-tuning models:

olive finetune \
  --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
  --data_name microsoft/dolly-15k \
  --text_template "### Question: {instruction}\n### Answer: {response}" \
  --max_steps 100 \
  --output_path models/llama-finetuned

Best Practices

1. Model Selection

  • Start with smaller models for testing (e.g., 0.5B-7B parameters)
  • Ensure your target model architecture is supported by Olive

2. Hardware Considerations

  • Match your optimization target to your deployment hardware
  • Use GPU optimization if you have CUDA-compatible hardware
  • Consider DirectML for Windows machines with integrated graphics

3. Precision Selection

  • INT4: Maximum compression, slight accuracy loss
  • INT8: Good balance of size and accuracy
  • FP16: Minimal accuracy loss, moderate size reduction

4. Testing and Validation

  • Always test optimized models with your specific use cases
  • Compare performance metrics (latency, throughput, accuracy)
  • Use representative input data for evaluation

5. Iterative Optimization

  • Start with auto-optimization for quick results
  • Use configuration files for fine-grained control
  • Experiment with different optimization passes

Troubleshooting

Common Issues

1. Installation Problems

# If you encounter dependency conflicts:
pip install --upgrade pip
pip install olive-ai[auto-opt] --force-reinstall

2. CUDA/GPU Issues

# Verify CUDA installation:
nvidia-smi

# Install correct ONNX Runtime GPU package:
pip install onnxruntime-gpu

3. Memory Issues

  • Use smaller batch sizes during optimization
  • Try quantization with higher precision first (int8 instead of int4)
  • Ensure sufficient disk space for model caching

4. Model Loading Errors

  • Verify model path and access permissions
  • Check if model requires trust_remote_code=True
  • Ensure all required model files are downloaded

Getting Help

Olive Recipes Repository

Introduction to Olive Recipes

The microsoft/olive-recipes repository complements the main Olive toolkit by providing a comprehensive collection of ready-to-use optimization recipes for popular AI models. This repository serves as a practical reference for both optimizing publicly available models and creating optimization workflows for proprietary models.

Key Features

  • 100+ Pre-built Recipes: Ready-to-use optimization configurations for popular models
  • Multi-Architecture Support: Covers transformer models, vision models, and multimodal architectures
  • Hardware-Specific Optimizations: Recipes tailored for CPU, GPU, and specialized accelerators
  • Popular Model Families: Includes Phi, Llama, Qwen, Gemma, Mistral, and many more

Supported Model Families

The repository includes optimization recipes for:

Language Models

  • Microsoft Phi: Phi-3-mini, Phi-3.5-mini, Phi-4-mini, Phi-4-reasoning
  • Meta Llama: Llama-2-7b, Llama-3.1-8B, Llama-3.2-1B/3B
  • Alibaba Qwen: Qwen1.5-7B, Qwen2-7B, Qwen2.5 series (0.5B to 14B)
  • Google Gemma: Various Gemma model configurations
  • Mistral AI: Mistral-7B series
  • DeepSeek: R1-Distill series models

Vision and Multimodal Models

  • Stable Diffusion: v1.4, XL-base-1.0
  • CLIP Models: Various CLIP-ViT configurations
  • ResNet: ResNet-50 optimizations
  • Vision Transformers: ViT-base-patch16-224

Specialized Models

  • Whisper: OpenAI Whisper-large-v3
  • BERT: Base and multilingual variants
  • Sentence Transformers: all-MiniLM-L6-v2

Using Olive Recipes

Method 1: Clone Specific Recipe

# Clone the recipes repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes

# Navigate to a specific model recipe
cd microsoft-Phi-4-mini-instruct

# Run the optimization
olive run --config olive_config.json

Method 2: Use Recipe as Template

# Copy a recipe configuration for your model
cp olive-recipes/microsoft-Phi-3-mini-4k-instruct/olive_config.json ./my_config.json

# Modify the configuration for your needs
# Update model paths, optimization parameters, etc.

# Run with your custom configuration
olive run --config my_config.json

Recipe Structure

Each recipe directory typically contains:

model-name/
├── olive_config.json       # Main optimization configuration
├── requirements.txt        # Python dependencies
├── README.md              # Model-specific instructions
├── user_script.py         # Custom preprocessing/evaluation scripts
└── sample_data/           # Sample input data for testing

Example: Using Phi-4-mini Recipe

Let's use the Phi-4-mini recipe as an example:

# Clone the repository
git clone https://github.com/microsoft/olive-recipes.git
cd olive-recipes/microsoft-Phi-4-mini-instruct

# Install dependencies
pip install -r requirements.txt

# Run the optimization
olive run --config olive_config.json

The configuration file typically includes:

{
  "input_model": {
    "type": "PyTorchModel",
    "config": {
      "hf_config": {
        "model_name": "microsoft/Phi-4-mini-instruct",
        "task": "text-generation",
        "trust_remote_code": true
      }
    }
  },
  "systems": {
    "local_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "cpu",
            "execution_providers": ["CPUExecutionProvider"]
          }
        ]
      }
    }
  },
  "passes": {
    "convert": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int4"
      }
    }
  }
}

Customizing Recipes

Modifying Target Hardware

To change the target hardware, update the systems section:

{
  "systems": {
    "gpu_system": {
      "type": "LocalSystem",
      "config": {
        "accelerators": [
          {
            "device": "gpu",
            "execution_providers": ["CUDAExecutionProvider"]
          }
        ]
      }
    }
  }
}

Adjusting Optimization Parameters

Modify the passes section for different optimization levels:

{
  "passes": {
    "convert": {
      "type": "ModelBuilder",
      "config": {
        "precision": "int8",           // Change from int4 to int8
        "use_ort_genai": true,
        "use_dynamo_exporter": true
      }
    },
    "optimize": {
      "type": "OrtTransformersOptimization",
      "config": {
        "optimization_level": "all"
      }
    }
  }
}

Creating Your Own Recipe

  1. Start with a Similar Model: Find a recipe for a model with similar architecture
  2. Update Model Configuration: Change the model name/path in the configuration
  3. Adjust Parameters: Modify optimization parameters as needed
  4. Test and Validate: Run the optimization and validate results
  5. Contribute Back: Consider contributing your recipe to the repository

Benefits of Using Recipes

1. Proven Configurations

  • Tested optimization settings for specific models
  • Avoids trial-and-error in finding optimal parameters

2. Hardware-Specific Tuning

  • Pre-optimized for different execution providers
  • Ready-to-use configurations for CPU, GPU, and NPU targets

3. Comprehensive Coverage

  • Supports the most popular open-source models
  • Regular updates with new model releases

4. Community Contributions

  • Collaborative development with the AI community
  • Shared knowledge and best practices

Contributing to Olive Recipes

If you've optimized a model not covered in the repository:

  1. Fork the Repository: Create your own fork of olive-recipes
  2. Create Recipe Directory: Add a new directory for your model
  3. Include Configuration: Add olive_config.json and supporting files
  4. Document Usage: Provide clear README with instructions
  5. Submit Pull Request: Contribute back to the community

Performance Benchmarks

Many recipes include performance benchmarks showing:

  • Latency Improvements: Typical 2-6x speedup over baseline
  • Memory Reduction: 50-75% memory usage reduction with quantization
  • Accuracy Retention: 95-99% accuracy preservation

Integration with AI Toolkit

The recipes work seamlessly with:

  • VS Code AI Toolkit: Direct integration for model optimization
  • Azure Machine Learning: Cloud-based optimization workflows
  • ONNX Runtime: Optimized inference deployment

Additional Resources

Community Examples

➡️ What's next