HFPrune: High-Fidelity Pruning for Large Language Models

March 17, 2026 ยท View on GitHub

This repository is the official implementation of the paper "HIGH-FIDELITY PRUNING FOR LARGE LANGUAGE MODELS".

[Paper] [BibTex] [HuggingFace]

Introduction

Large Language Models (LLMs) have demonstrated exceptional performance but require significant computational resources for deployment. HFPrune addresses this through a novel pruning method using information entropy for importance evaluation, achieving superior performance compared to existing approaches. Traditional pruning methods rely on one-hot cross-entropy loss, focusing only on single ground-truth tokens. HFPrune uses information entropy to evaluate neuron importance based on the global prediction distribution, preserving the model's full knowledge while requiring no teacher model.

criterion

Highlights

๐Ÿ† Surpasses Dense Model: At 20% pruning on LLaMA-2-7B, achieves 59.0 average score vs 58.3 for the original model

๐Ÿ“Š Consistent Improvements: Outperforms LLM-Pruner, LoRAPrune, and SDMPrune across all benchmarks

โšก Efficient: No teacher model required, label-free importance evaluation

๐Ÿ”ง Practical: Supports both full fine-tuning and LoRA, with sequence packing and Flash Attention 2

Supported Models

  • โœ… LLaMA series (LLaMA-2-7B, LLaMA-3, etc.)
  • โœ… Qwen series (Qwen2.5-7B, etc.)
  • โœ… Any transformer-based LLM with similar architecture

Installation

pip install -r requirements.txt

Quick Start

Complete Workflow Example

Here's a complete workflow from pruning to evaluation:

# Step 1: Prune the model
python prune.py \
    --seed 42 \
    --mlp_ratio 2.8 \
    --origin_path "meta-llama/Llama-2-7b-hf"

# Step 2: Fine-tune with LoRA (recommended for efficiency)
accelerate launch --num_processes=4 \
    --mixed_precision bf16 \
    fintune_lora.py \
    --model "path/to/pruned/model" \
    --lr 4e-4 \
    --lora_r 32 \
    --lora_alpha 64

# Step 3: Evaluate on benchmarks
lm_eval --model hf \
    --model_args pretrained=$MODEL_PATH,trust_remote_code=True \
    --tasks hellaswag,piqa,arc_challenge,arc_easy,openbookqa,boolq,winogrande \
    --batch_size 4

1. Pruning

Prune a LLaMA model using Taylor-based importance estimation:

python prune.py \
    --seed 42 \
    --mlp_ratio 2.8 \
    --origin_path "meta-llama/Llama-2-7b-hf"

2. Fine-tuning

Full Fine-tuning

accelerate launch --gpu_ids '0,1,2,3' --num_processes=4 --num_machines 1 \
    --mixed_precision bf16 --dynamo_backend no \
    fintune_full.py \
    --model "path/to/pruned/model" \
    --lr 1e-4

LoRA Fine-tuning

accelerate launch --num_processes=4 --num_machines 1 \
    --mixed_precision bf16 --dynamo_backend no \
    fintune_lora.py \
    --model "path/to/pruned/model" \
    --lr 4e-4 \
    --dataset_name "out/cache/Lamini-llama3.2-clean-1024/" \
    --tokenizer_max 1024 \
    --lora_r 32 \
    --lora_alpha 64

3. Evaluation

Evaluate the pruned/fine-tuned model using lm-evaluation-harness:

lm_eval --model hf \
    --model_args pretrained=$MODEL_PATH,trust_remote_code=True,add_bos_token=True \
    --tasks hellaswag,piqa,arc_challenge,arc_easy,openbookqa,boolq,winogrande \
    --batch_size 4 \
    --output_path "$MODEL_PATH/results"

Project Structure

โ”œโ”€โ”€ prune.py                 # Main pruning script
โ”œโ”€โ”€ fintune_full.py          # Full fine-tuning script
โ”œโ”€โ”€ fintune_lora.py          # LoRA fine-tuning script
โ”œโ”€โ”€ cfg/
โ”‚   โ””โ”€โ”€ prune_llama.py       # Training configurations
โ”œโ”€โ”€ dataset/
โ”‚   โ”œโ”€โ”€ LaMini_dataset.py    # Dataset loading utilities
โ”‚   โ”œโ”€โ”€ prompter.py          # Prompt templates
โ”‚   โ””โ”€โ”€ packing/             # Sequence packing implementation
โ”‚       โ”œโ”€โ”€ packed_dataset.py
โ”‚       โ””โ”€โ”€ monkey_patch_packing.py
โ”œโ”€โ”€ module/
โ”‚   โ”œโ”€โ”€ trainer.py           # Custom trainer with packing support
โ”‚   โ””โ”€โ”€ anyprecisionAdamw.py # Mixed-precision AdamW optimizer
โ”œโ”€โ”€ script/
โ”‚   โ”œโ”€โ”€ prune.sh             # SLURM script for pruning
โ”‚   โ”œโ”€โ”€ train_full.sh        # SLURM script for full fine-tuning
โ”‚   โ””โ”€โ”€ train_lora.sh        # SLURM script for LoRA fine-tuning
โ””โ”€โ”€ utils/
    โ””โ”€โ”€ __init__.py          # Utility functions

Experimental Results

Highlights

๐Ÿ† Surpasses Dense Model: At 20% pruning on LLaMA-2-7B, HFPrune achieves 59.0 average score, exceeding the original model's 58.3

๐Ÿ“Š Consistent Improvements: Outperforms all baseline methods across all pruning ratios and model sizes

โšก Efficient Compression: Achieves 20-30% reduction in both parameters and FLOPs while maintaining or improving performance

๐ŸŽฏ Robust Across Benchmarks: Strong performance on diverse tasks including reasoning (ARC), common sense (PIQA, OBQA), and reading comprehension (BoolQ, Winogrande)

Performance on LLaMA-2-7B

We evaluated HFPrune on multiple zero-shot benchmarks and compared it against state-of-the-art structured pruning methods including LLM-Pruner, LoRAPrune, LoRAP, and SDMPrune.

Key Result: At a 20% pruning ratio, HFPrune's average score (59.0) surpasses the original dense model (58.3), demonstrating that our method not only preserves but can actually enhance model capabilities through better-targeted compression.

Pruning RatioMethodARCCARCEBoolQCrowsOBQAPIQARaceSiQATfQAWinoAverage
0%Llama-2-7B45.173.879.467.444.278.740.146.538.869.358.3
20%LLM-pruner40.470.180.261.738.875.839.047.143.964.356.1
20%LoRAPrune41.671.081.758.741.476.740.444.065.965.956.7
20%LoRAP38.566.070.9--39.678.1------65.7--
20%SDMPrune43.972.381.762.142.077.041.348.544.968.458.2
20%HFPrune (Ours)47.173.885.260.243.277.343.349.544.766.259.0
30%LLM-pruner38.064.875.662.336.473.435.747.342.362.953.9
30%LoRAPrune38.665.174.161.437.472.939.046.344.866.554.6
30%LoRAP35.560.669.6--37.876.7------63.0--
30%SDMPrune39.667.980.458.537.275.240.047.843.765.455.6
30%HFPrune (Ours)41.970.282.958.140.075.239.548.844.262.456.3

Model Weights

We provide the model weights pruned by HFPrune for reproducibility and downstream use.

ModelDownload Link
LLaMA series๐Ÿค— HuggingFace
Qwen series๐Ÿค— HuggingFace

Citation

@article{zhu2026high,
  title={High-Fidelity Pruning for Large Language Models},
  author={Zhu, Yijun and Wang, Jianxin and Shen, Chengchao},
  journal={arXiv preprint arXiv:2603.08083},
  year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.