Running Quantized LLMs on ONNX Runtime
May 22, 2024 ยท View on GitHub
To run LLM efficiently and quickly on CPUs, we provide a method for running quantized LLMs using the optimum-intel. We recommend checking out our full notebook with all the details, including the quantization and pipeline construction.
Installation
Run the following command to install our dependencies:
pip install -e .[intel]
For more information regarding the installation process, we recommend checking out the optimum-intel repository.
LLM Quantization
To quantize a model, we first export it the model to the ONNX format, and then use a quantizer to save the quantized version of our model:
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
import os
model_name = 'my_llm_model_name'
converted_model_path = "my/local/path"
model = ORTModelForCausalLM.from_pretrained(model_name, export=True)
model.save_pretrained(converted_model_path)
model = ORTModelForCausalLM.from_pretrained(converted_model_path, session_options=session_options)
qconfig = AutoQuantizationConfig.avx2(is_static=False)
quantizer = ORTQuantizer.from_pretrained(model)
quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), quantization_config=qconfig)
Loading the Quantized Model
Now that our model is quantized, we can load it in our framework, by using the ORTGenerator generator.
generator = ORTGenerator(
model="my/local/path/quantized",
task="text-generation",
generation_kwargs={
"max_new_tokens": 100,
}
)
We also included some additional parameters for the ort.SessionOptions for loading the model:
- graph_optimization_level: str
- intra_op_num_threads: int
- session_config_entries: dict
For example:
generator = ORTGenerator(
model="my/local/path/quantized",
task="text-generation",
generation_kwargs={
"max_new_tokens": 100,
}
huggingface_pipeline_kwargs=dict(
graph_optimization_level="ORT_ENABLE_ALL",
intra_op_num_threads=6,
session_config_entries={
"session.intra_op_thread_affinities": "3,4;5,6;7,8;9,10;11,12"
}
)
)
We also include an example YAML file of deploying the ORT model as an API here.
For more information regarding the different settings, we refer to the ort.SessionOptions documentation.