HASS

March 14, 2025 · View on GitHub

HASS (Learning Harmonized Representations for Speculative Sampling, ICLR 2025) is a speculative sampling method improved over EAGLE-2 by harmonizing the draft model's objective and context between its training and decoding stages, achieving significant improvements on speedup ratio and acceptance length.

HASS Performance

DeepSeek-R1

Datasets
MT-benchHumanEvalGSM8KAlpacaQASummarizationMean
TemperatureMethodThroughputSpeedupThroughputSpeedupThroughputSpeedupThroughputSpeedupThroughputSpeedupThroughputSpeedupThroughputSpeedup
T=0DeepSeek Vanilla35.67091.00x35.59781.00x35.67361.00x35.69021.00x35.67301.00x34.40391.00x35.45161.00x
DeepSeek MTP62.24851.75x67.36261.89x66.32771.86x61.02971.71x62.25041.75x55.56751.62x62.46441.76x
HASS66.27891.86x (+6.47%)73.69442.07x (+9.40%)71.41702.00x (+7.67%)65.83831.84x (+7.88%)67.42841.89x (+8.32%)61.86891.80x (+11.34%)67.75431.91x (+8.47%)
T=1DeepSeek Vanilla35.67981.00x34.98641.00x35.08661.00x35.12351.00x35.76311.00x35.08621.00x35.28761.00x
DeepSeek MTP54.58951.53x58.63451.68x58.92651.68x49.07871.40x53.59411.50x46.11181.31x53.48921.52x
HASS58.86191.65x (+7.83%)62.08781.77x (+5.89%)65.59681.87x (+11.32%)55.05531.57x (+12.18%)56.04701.57x (+4.58%)49.52681.41x (+7.41%)57.86261.64x (+8.18%)
  • Evaluations of DeepSeek-R1 are under the SGLang framework, where the batch size is set as 1. DeepSeek Vanilla, DeepSeek MTP, and HASS represent the auto-regressive decoding, speculative sampling with the official MTP of DeepSeek-R1, and speculative sampling with the MTP continually trained with HASS, respectively. Throughput denotes output token throughput and is evaluated by token/s.
  • Here, we continually train the MTP with HASS for 2 epochs on the ShareGPT dataset, where the training data is extremely less than the official MTP and the data distribution is inconsistent with that of DeepSeek-R1.
  • On 2\times\8 H800 GPUs, HASS achieves 1.41x-2.07x speedup ratio compared to the auto-regressive decoding, surpassing DeepSeek-R1's official MTP by 8.47% / 8.18% under temperature=0 / 1.

DeepSeek-R1-Distill-Qwen-32B

Datasets
MT-benchHumanEvalGSM8KAlpacaQASummarizationMean
TemperatureMethodSpeedupτ\tauSpeedupτ\tauSpeedupτ\tauSpeedupτ\tauSpeedupτ\tauSpeedupτ\tauSpeedupτ\tau
T=0HASS3.4115x4.10523.2300x4.29113.4758x4.98093.1918x3.77382.7392x3.47792.9798x3.75363.1714x4.0638
T=1HASS2.8712x3.74022.9311x3.86583.4341x4.77702.7634x3.53672.5586x3.26542.8616x3.42442.9033x3.7683
  • τ\tau denotes the acceptance length.
  • On 2 H800 GPUs, HASS achieves 2.56x-3.48x speedup ratio compared to the auto-regressive decoding.

LLaMA-series models

  • On H800 GPU, HASS achieves 2.81x-4.05x speedup ratio compared to the auto-regressive decoding, surpassing EAGLE-2 by 8%-20%.
  • Please refer to Tables 1&2 in the paper.

HASS Weights

Base ModelHASS WeightsBase ModelHASS Weights
DeepSeek-R1HArmonizedSS/HASS-DeepSeek-R1DeepSeek-R1-Distill-Qwen-32BHArmonizedSS/HASS-DeepSeek-R1-Distill-Qwen-32B
LLaMA3-Instruct-8BHArmonizedSS/HASS-LLaMA3-Instruct-8BLLaMA3-Instruct-70BHArmonizedSS/HASS-LLaMA3-Instruct-70B
LLaMA2-Chat-7BHArmonizedSS/HASS-LLaMA2-Chat-7BLLaMA2-Chat-13BHArmonizedSS/HASS-LLaMA2-Chat-13B

Reference

@inproceedings{zhang2025learning,
  title={Learning Harmonized Representations for Speculative Sampling},
  author={Zhang, Lefan and Wang, Xiaodan and Huang, Yanhua and Xu, Ruiwen},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

Acknowledgements

This project has been influenced by many excellent projects in the LLM community, such as EAGLE and others.