Intel® LLM Library for PyTorch*

May 12, 2025 · View on GitHub

< English | 中文 >

ipex-llm 是一个将大语言模型高效地运行于 Intel GPU (如搭载集成显卡的个人电脑，Arc 独立显卡、Flex 及 Max 数据中心 GPU 等)、NPU 和 CPU 上的大模型 XPU 加速库¹。

Note

ipex-llm可以与 llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope 等无缝衔接。
70+ 模型已经在 ipex-llm 上得到优化和验证（如 Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V 等）, 以获得先进的 大模型算法优化, XPU 加速 以及 低比特（FP8FP8/FP6/FP4/INT4）支持；更多模型信息请参阅这里。

`ipex-llm` Demo

以下分别是使用 ipex-llm 在英特尔酷睿Ultra iGPU、酷睿Ultra NPU、单卡 Arc GPU 或双卡 Arc GPU 上运行本地 LLM 的 DEMO 演示，

Intel Core Ultra iGPU	Intel Core Ultra NPU	2-Card Intel Arc dGPUs	Intel Xeon + Arc dGPU

Ollama (Mistral-7B, Q4_K)	HuggingFace (Llama3.2-3B, SYM_INT4)	llama.cpp (DeepSeek-R1-Distill-Qwen-32B, Q4_K)	FlashMoE (Qwen3MoE-235B, Q4_K)

`ipex-llm` 性能

下图展示了在 Intel Core Ultra 和 Intel Arc GPU 上的 Token 生成速度¹（更多详情可点击 [2][3][4])。

如果需要自己进行 ipex-llm 性能基准测试，可参考基准测试指南。

模型准确率

部分模型的 Perplexity 结果如下所示（使用 Wikitext 数据集和此处的脚本进行测试)。

Perplexity	sym_int4	q4_k	fp6	fp8_e5m2	fp8_e4m3	fp16
Llama-2-7B-chat-hf	6.364	6.218	6.092	6.180	6.098	6.096
Mistral-7B-Instruct-v0.2	5.365	5.320	5.270	5.273	5.246	5.244
Baichuan2-7B-chat	6.734	6.727	6.527	6.539	6.488	6.508
Qwen1.5-7B-chat	8.865	8.816	8.557	8.846	8.530	8.607
Llama-3.1-8B-Instruct	6.705	6.566	6.338	6.383	6.325	6.267
gemma-2-9b-it	7.541	7.412	7.269	7.380	7.268	7.270
Baichuan2-13B-Chat	6.313	6.160	6.070	6.145	6.086	6.031
Llama-2-13b-chat-hf	5.449	5.422	5.341	5.384	5.332	5.329
Qwen1.5-14B-Chat	7.529	7.520	7.367	7.504	7.297	7.334

`ipex-llm` 快速入门

使用

Ollama: 在 Intel GPU 上直接免安装运行 Ollama
llama.cpp: 在 Intel GPU 上直接免安装运行llama.cpp
Arc B580: 在 Intel Arc B580 GPU 上运行 ipex-llm（包括 Ollama, llama.cpp, PyTorch, HuggingFace 等）
NPU: 在 Intel NPU 上运行 ipex-llm（支持 Python/C++ 及 llama.cpp API）
PyTorch/HuggingFace: 使用 Windows 和 Linux 在 Intel GPU 上运行 PyTorch、HuggingFace、LangChain、LlamaIndex 等 (使用 ipex-llm 的 Python 接口)
vLLM: 在 Intel GPU 和 CPU 上使用 ipex-llm 运行 vLLM
FastChat: 在 Intel GPU 和 CPU 上使用 ipex-llm 运行 FastChat 服务
Serving on multiple Intel GPUs: 利用 DeepSpeed AutoTP 和 FastAPI 在 多个 Intel GPU 上运行 ipex-llm 推理服务
Text-Generation-WebUI: 使用 ipex-llm 运行 oobabooga WebUI
Axolotl: 使用 Axolotl 和 ipex-llm 进行 LLM 微调
Benchmarking: 在 Intel GPU 和 CPU 上运行性能基准测试（延迟和吞吐量）

Docker

GPU Inference in C++: 在 Intel GPU 上使用 ipex-llm 运行 llama.cpp, ollama等
GPU Inference in Python : 在 Intel GPU 上使用 ipex-llm 运行 HuggingFace transformers, LangChain, LlamaIndex, ModelScope，等
vLLM on GPU: 在 Intel GPU 上使用 ipex-llm 运行 vLLM 推理服务
vLLM on CPU: 在 Intel CPU 上使用 ipex-llm 运行 vLLM 推理服务
FastChat on GPU: 在 Intel GPU 上使用 ipex-llm 运行 FastChat 推理服务
VSCode on GPU: 在 Intel GPU 上使用 VSCode 开发并运行基于 Python 的 ipex-llm 应用

应用

GraphRAG: 基于 ipex-llm 使用本地 LLM 运行 Microsoft 的 GraphRAG
RAGFlow: 基于 ipex-llm 运行 RAGFlow (一个开源的 RAG 引擎)
LangChain-Chatchat: 基于 ipex-llm 运行 LangChain-Chatchat (使用 RAG pipline 的知识问答库)
Coding copilot: 基于 ipex-llm 运行 Continue (VSCode 里的编码智能助手)
Open WebUI: 基于 ipex-llm 运行 Open WebUI
PrivateGPT: 基于 ipex-llm 运行 PrivateGPT 与文档进行交互
Dify platform: 在Dify(一款开源的大语言模型应用开发平台) 里接入 ipex-llm 加速本地 LLM

安装

Windows GPU: 在带有 Intel GPU 的 Windows 系统上安装 ipex-llm
Linux GPU: 在带有 Intel GPU 的Linux系统上安装 ipex-llm
更多内容, 请参考完整安装指南

代码示例

低比特推理
- INT4 inference: 在 Intel GPU 和 CPU 上进行 INT4 LLM 推理
- FP8/FP6/FP4 inference: 在 Intel GPU 上进行 FP8，FP6 和 FP4 LLM 推理
- INT8 inference: 在 Intel GPU 和 CPU 上进行 INT8 LLM 推理
- INT2 inference: 在 Intel GPU 上进行 INT2 LLM 推理 (基于 llama.cpp IQ2 机制)
FP16/BF16 推理
- 在 Intel GPU 上进行 FP16 LLM 推理（并使用 self-speculative decoding 优化）
- 在 Intel CPU 上进行 BF16 LLM 推理（并使用 self-speculative decoding 优化）
分布式推理
- 在 Intel GPU 上进行 流水线并行 推理
- 在 Intel GPU 上进行 DeepSpeed AutoTP 推理
保存和加载
- Low-bit models: 保存和加载 ipex-llm 低比特模型 (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
- GGUF: 直接将 GGUF 模型加载到 ipex-llm 中
- AWQ: 直接将 AWQ 模型加载到 ipex-llm 中
- GPTQ: 直接将 GPTQ 模型加载到 ipex-llm 中
微调
- 在 Intel GPU 进行 LLM 微调，包括 LoRA，QLoRA，DPO，QA-LoRA 和 ReLoRA
- 在 Intel CPU 进行 QLoRA 微调
与社区库集成
教程

API 文档

FAQ

常见问题解答

模型验证

50+ 模型已经在 ipex-llm 上得到优化和验证，包括 LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM, 更多模型请参看下表，

模型	CPU 示例	GPU 示例	NPU 示例
LLaMA	link1, link2	link
LLaMA 2	link1, link2	link	Python link, C++ link
LLaMA 3	link	link	Python link, C++ link
LLaMA 3.1	link	link
LLaMA 3.2		link	Python link, C++ link
LLaMA 3.2-Vision		link
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
GLM-4	link	link
GLM-4V	link	link
GLM-Edge		link	Python link
GLM-Edge-V		link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link	Python link
InternLM	link	link
InternVL2		link
Qwen	link	link
Qwen1.5	link	link
Qwen2	link	link	Python link, C++ link
Qwen2.5		link	Python link, C++ link
Qwen-VL	link	link
Qwen2-VL		link
Qwen2-Audio		link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Phi-3	link	link
Phi-3-vision	link	link
Yuan2	link	link
Gemma	link	link
Gemma2		link
DeciLM-7B	link	link
Deepseek	link	link
StableLM	link	link
CodeGemma	link	link
Command-R/cohere	link	link
CodeGeeX2	link	link
MiniCPM	link	link	Python link, C++ link
MiniCPM3		link
MiniCPM-V		link
MiniCPM-V-2	link	link
MiniCPM-Llama3-V-2_5		link	Python link
MiniCPM-V-2_6	link	link	Python link
MiniCPM-o-2_6		link
Janus-Pro		link
Moonlight		link
StableDiffusion		link
Bce-Embedding-Base-V1			Python link
Speech_Paraformer-Large			Python link

官方支持

如果遇到问题，或者请求新功能支持，请提交 Github Issue 告诉我们
如果发现漏洞，请在 GitHub Security Advisory 提交漏洞报告

Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex ↩ ↩²

最新更新 🔥

模型准确率

使用

应用

安装

代码示例

低比特推理

分布式推理

保存和加载

微调

与社区库集成

模型验证

官方支持