ERNIEKit: ERNIE Development Toolkit Based on PaddlePaddle

December 19, 2025 ยท View on GitHub

ERNIEKit is an industrial-grade development toolkit for ERNIE 4.5. It provides training and compression capabilities, including Pre-Training, Supervised Fine-Tuning (SFT), Low-Rank Adaptation (LoRA), Direct Preference Optimization (DPO), and Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques. It includes practical applications and tutorials for leveraging ERNIE models.

1. Features

  • ๐Ÿš€ Industrial-grade High-Performance Pre-Training Optimized ERNIE 4.5 pre-training implementation featuring 3D hybrid parallelism and FP8 mixed precision acceleration. Please refer to Pre-Training for more details.

  • ๐Ÿช™ Low-bit Quantization-aware Fine-tuning To significantly lower the barriers and costs of fine-tuning and deploying the ERNIE 4.5 model, we introduce a novel FP8 Quantization-Aware Training (QAT) methodology. This solution synergistically integrates low-precision training with optimizer offloading. Consequently, the minimum resources for fine-tuning ERNIE 4.5-300B-A47B has been substantially reduced from 96 GPUs to only 16 GPUs, while maintaining the model's original performance. Crucially, unlike prevalent FP8 mixed-precision schemes that rely on online block-wise and tile-wise quantization, the models produced by ERNIEKit's QAT solution achieve a significant advantage: they support highly efficient offline tensor-wise FP8 quantization for inference. This eliminates the computational overhead associated with dynamic quantization at inference time. For more information, please refer to the FP8-QAT and WINT4/8-LoRA.

  • ๐Ÿ‘๏ธ Visual Training & Debugging Interface Gradio-based WebUI for zero-code fine-tuning, alignment, and inference. Please refer to WebUI & CLI for more details.

  • ๐Ÿ”Œ Multiple Hardware Support Support NVDIA GPU, Kunlunxin XPU and Ascend NPU Training.

2. Installation

2.1 Prerequisites

DependencyRecommended Version
CUDAโ‰ฅ 12.3
CUDA Driverโ‰ฅ 535.171
nvccโ‰ฅ 12.3
gccโ‰ฅ 12.2
Python3.10 - 3.12
GPU ArchitectureAmpere/Hopper (80GB+HBM)

2.2 Installing PaddlePaddle

Docker-Based Installation (Recommended)

To ensure environment consistency across different hardware configurations, we recommend using our pre-configured Docker images. These images include CUDA, cuDNN, and NCCL dependencies with PaddlePaddle v3.2 pre-installed:

# Choose based on your CUDA version requirements:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.2.0-gpu-cuda12.9-cudnn9.9
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.2.0-gpu-cuda12.6-cudnn9.5

Source Code Installation

If not using Docker, ensure your environment meets the prerequisites in 2.1. ERNIEKit requires PaddlePaddle v3.2+. See official PaddlePaddle Installation Guide for details.

Verify installation with:

python -c "import paddle;paddle.utils.run_check()"

Successful installation shows:

PaddlePaddle works well on 8 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

2.3 Install ERNIEKit

git clone https://github.com/PaddlePaddle/ERNIE
cd ERNIE
python -m pip install -r requirements/gpu/requirements.txt
python -m pip install -e . # We recommend install in editable mode

You can also build docker image yourself which includes all the dependencies listed in requirements.txt. Please refer to build docker for more details.

2.4 Install FastDeploy

Please refer to FastDeploy installation.

3. Model Training

3.1 Training Resources

ERNIEKit supports training for the following models. Before initiating training please ensure:

  1. Environment setup is completed
  2. Your hardware meets the minimum resource requirements
ModelMultimodal ModelPost-Training MethodSeq LengthMin ResourcesRecommended Config
ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47Bโœ…SFT-LORA8K16x80G A/H GPUsrun_sft_lora_8k.yaml
ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47Bโœ…SFT-LORA32K16x80G A/H GPUsrun_sft_lora_32k.yaml
ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47Bโœ…SFT-LORA(wint4/8)8K8x80G A/H GPUsrun_sft_wint8mix_lora_8k.yaml
ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47Bโœ…SFT-LORA(wint4/8)32K8x80G A/H GPUsrun_sft_wint8mix_lora_32k.yaml
ERNIE-4.5-VL-424B-A47B-Base/ERNIE-4.5-VL-424B-A47Bโœ…SFT-LORA(wint4/8)128K16x80G A/H GPUsrun_sft_wint8mix_lora_128k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT8K96x80G A/H GPUsrun_sft_8k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT32K112x80G A/H GPUsrun_sft_32k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT(FP8)8K16x80G H GPUs + 2TB CPU RAMrun_sft_fp8_8k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT(FP8)32K16x80G H GPUs + 2TB CPU RAMrun_sft_fp8_32k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT-LoRA(wint4/8)8K4x80G A/H GPUsrun_sft_wint8mix_lora_8k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒSFT-LoRA(wint4/8)32K8x80G A/H GPUsrun_sft_wint8mix_lora_32k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒDPO8K112x80G A/H GPUsrun_dpo_8k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒDPO32K112x80G A/H GPUsrun_dpo_32k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒDPO-LoRA8K16x80G A/H GPUsrun_dpo_lora_8k.yaml
ERNIE-4.5-300B-A47B-Base/ERNIE-4.5-300B-A47BโŒDPO-LoRA32K16x80G A/H GPUsrun_dpo_lora_32k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT8K8x80G A/H GPUsrun_sft_8k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT32K8x80G A/H GPUsrun_sft_32k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT128K8x80G A/H GPUsrun_sft_128k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT-LoRA8K4x80G A/H GPUsrun_sft_lora_8k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT-LoRA32K4x80G A/H GPUsrun_sft_lora_32k.yaml
ERNIE-4.5-VL-28B-A3B-Base/ERNIE-4.5-VL-28B-A3Bโœ…SFT-LoRA128K4x80G A/H GPUsrun_sft_lora_128k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒSFT8K8x80G A/H GPUsrun_sft_8k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒSFT32K8x80G A/H GPUsrun_sft_32k.yaml
ERNIE-4.5-21B-A3B-B base/ERNIE-4.5-21B-A3BโŒSFT128K8x80G A/H GPUsrun_sft_128k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒSFT-LoRA(wint4/8)8K1x80G A/H GPUsrun_sft_wint8mix_lora_8k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒSFT-LoRA(wint4/8)32K1x80G A/H GPUsrun_sft_wint8mix_lora_32k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒDPO8K8x80G A/H GPUsrun_dpo_8k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒDPO32K8x80G A/H GPUsrun_dpo_32k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒDPO128K8x80G A/H GPUsrun_dpo_128k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒDPO-LoRA8K1x80G A/H GPUsrun_dpo_lora_8k.yaml
ERNIE-4.5-21B-A3B-Base/ERNIE-4.5-21B-A3BโŒDPO-LoRA32K1x80G A/H GPUsrun_dpo_lora_32k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒSFT8K1x80G A/H GPUrun_sft_8k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒSFT32K1x80G A/H GPUrun_sft_32k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒSFT128K1x80G A/H GPUrun_sft_128k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒSFT-LoRA(wint4/8)8K1x80G A/H GPUrun_sft_wint8mix_lora_8k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒSFT-LoRA(wint4/8)32K1x80G A/H GPUrun_sft_wint8mix_lora_32k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒDPO8K1x80G A/H GPUrun_dpo_8k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒDPO32K1x80G A/H GPUrun_dpo_32k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒDPO128K1x80G A/H GPUrun_dpo_128k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒDPO-LoRA8K1x80G A/H GPUrun_dpo_lora_8k.yaml
ERNIE-4.5-0.3B-Base/ERNIE-4.5-0.3BโŒDPO-LoRA32K1x80G A/H GPUrun_dpo_lora_32k.yaml

3.2 Data Preparation

ERNIEKit supports both alpaca and erniekit dataset formats. For detailed format specifications, refer to Dataset Guide.

We provide sample datasets in erniekit format for quick start, please refer to Demo Datasets .

Subsequent sections will demonstrate workflows using these sample datasets.

3.3 Supervised Fine-tuning

Supervised Fine-Tuning (SFT) adapts pre-trained language models using labeled datasets to enhance task-specific performance and instruction-following capabilities. This parameter-updating method:

  • Requires high-quality annotated data
  • Adjusts all model parameters
  • Ideal for precision-critical specialized tasks

For configuration details: โš™๏ธ General Training Settings โš™๏ธ SFT Settings

Example 1: Full-Parameter Supervised Fine-tuning

The following example requires training on a single 80G A/H GPU machine.

# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, SFT
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_8k.yaml
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 32K Sequence Length, SFT
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_32k.yaml

Example 2: Parameter Efficient Fine-tuning

LoRA (Low-Rank Adaptation) leverages matrix low-rank decomposition techniques to achieve model fine-tuning by only adjusting a small number of new parameters. LoRA training reduces resource requirements while often delivering comparable or even superior performance to full-parameter fine-tuning on small datasets.

Compared to standard SFT, enabling LoRA training simply requires adding fine_tuning: LoRA to the training configuration. For more training parameters, refer to LoRA configurations.

The following example requires training on a single 80GB A/H GPU card.

# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, SFT-LoRA
erniekit train examples/configs/ERNIE-4.5-0.3B/sft/run_sft_lora_8k.yaml

Viewing Training Logs

If your script specifies the logging_dir argument, we save VisualDL visualization results to that directory. Otherwise, results are stored at the path specified by output_dir.

Start VisualDL with the following command to view training logs:

visualdl --logdir ${YOUR_LOG_DIR} --host ${HOST_IP} --port ${PORT}

3.4 DPO

Alignment Training is a crucial technique for ensuring the behavior of Large Language Models (LLMs) aligns with human intentions, values, or specific objectives. Its core goal is to address the issue of pretrained models being "powerful but uncontrollable," making model outputs safer, more reliable, and better aligned with human expectations.

Direct Preference Optimization (DPO) is a representative method for achieving human preference alignment. It directly fine-tunes model parameters on annotated preference data. Compared to RLHF, DPO offers higher training stability and lower computational overhead, establishing itself as a mainstream preference alignment approach.

For more training configurations, refer to Training configuration and DPO configuration.

Example 1: Full-Parameter Direct Preference Optimization

The following example requires training on a single 80G A/H GPU machine.

# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, DPO
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_8k.yaml
# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 32K Sequence Length, DPO
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_32k.yaml

Example 2: Direct Preference Optimization with LoRA

The following example requires training on a single 80G A/H GPU machine.

# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
# 8K Sequence Length, DPO-LoRA
erniekit train examples/configs/ERNIE-4.5-0.3B/dpo/run_dpo_lora_8k.yaml

3.5 Weight Merging

After LoRA fine-tuning, merge LoRA weights with the main model weights. In multi-machine training scenarios: โš ๏ธ Each machine stores partial model parameters (checkpoint) โš ๏ธ Must synchronize parameter files across all machines before merging LoRA weights or deployment

path_to_checkpoints/
    โ”œโ”€โ”€ added_tokens.json
    โ”œโ”€โ”€ config.json
    โ”œโ”€โ”€ model-00001-of-00xxx.safetensors
    โ”œโ”€โ”€ model-00002-of-00xxx.safetensors
    โ”œโ”€โ”€ ...
    โ”œโ”€โ”€ model-00xxx-of-00xxx.safetensors
    โ”œโ”€โ”€ model.safetensors.index.json
    โ”œโ”€โ”€ special_tokens_map.json
    โ”œโ”€โ”€ tokenizer_config.json
    โ”œโ”€โ”€ tokenizer.model

To merge LoRA parameters into the base model after training:

erniekit export examples/configs/ERNIE-4.5-0.3B/run_export.yaml lora=True

4. Model Deployment

Trained ERNIEKit weights can be directly deployed using FastDeploy through integrated CLI tools. Below is an example for ERNIE-4.5-0.3B:

# download model from huggingface
huggingface-cli download baidu/ERNIE-4.5-0.3B-Paddle --local-dir baidu/ERNIE-4.5-0.3B-Paddle
erniekit server examples/configs/ERNIE-4.5-0.3B/run_chat.yaml
erniekit chat examples/configs/ERNIE-4.5-0.3B/run_chat.yaml