BaldEagle
July 3, 2025 ยท View on GitHub
Unofficial Implementation of EAGLE Speculative Decoding.
Read our launch announcement: https://frugalgpu.substack.com/p/introducing-baldeagle
Read our guide on how to train your own EAGLE model: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative
Features
Training
- Clean model implementation on top of HuggingFace Transformers that can be replicated for all models
- Abstracts away attention, causal mask, etc.
- Training loop is implemented using HuggingFace Trainer for more readable and modular code
- Easily modify learning rate scheduler
- Abstracts away gradient accumulation, autocasting, checkpointing, logging, resuming, etc.
Data Generation
- Improved data generation scripts that modularizes data formatting, tokenization, and loss mask generation
- Easy to switch to other datasets and tokenizers
- Ultrachat and ShareGPT implementations already included
view_data.pyscript that shows loss mask on original text for validation purposes (see here for more details)
Benchmarking
Models Trained with BaldEagle
| Target Model | BaldEagle Model |
|---|---|
| Llama-3.1-8B-Instruct | BaldEagle-Llama-3.1-8B-Instruct |
| Qwen-2.5-7B-Instruct | BaldEagle-Qwen-2.5-7B-Instruct |
Getting Started with Training
1. Data Generation
Note: Data requires a significant amount of disk space since we're saving sequence_length x hidden_dim for each sample. ShareGPT (68k rows) requires ~650GB and Ultrachat (200k rows) requires ~2TB
- Edit
generate_data.pyfor the dataset and model you are using.- Section 1 is focused on the dataset and reformatting it if necessary; by default we use Ultrachat and ShareGPT is availble in the commented blocks
- Section 2 tokenizes and generates the loss mask based on the tokenizer's chat template.
- In
allocation.pyset the GPU's you want to use for data generation- This will split the data and call
generate_data.pyon separate slices on different GPUs - Modify
outdirvariable
- This will split the data and call
- Call
allocation.pywhile specifying the output directory--outdir- ie.
python allocation.py --outdir {output_directory}
- ie.
2. Training
- In
train.py, modify the necessary variables- Specify
pathto a local path for the main model you're training for - Modify the datapaths in the
Load datasection to match your data paths from the previous section - Modify any trainer parameters
- Specify
- Launch the training script on 1 GPU with
python3 train.py
Eagle 3 Status
Training Time Test
Currently, training-time test from Eagle 3 paper is being worked on in the train/train_eagle_ttt.py and train/modules/trainer/trainer_eagle_ttt.py files.
Eagle 2 + Training Time Test Model: https://huggingface.co/NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha
- 11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline
Fused Features
Fused features requires new data generation and EAGLE 3 trains on target model generations rather than fixed dataset, which EAGLE 1 does. Fused features will require
- [Experimental ]new data generation to extract high, medium, and low features
- this will require 3x more storage
- currently,
generate_data_fused_features.pycan generate low, mid, and high features- this is based on EAGLE repos's layer selection here
- faster data generation since target model generation will be required
- ideally we can use a faster inference server like VLLM or sglang rather than huggingface
- modifications to model and trainer code for feature fusion
Feel free to open an issue to discuss implementation and results!
Citation
If you found this project useful, please cite this with:
Liu, N. (2025). BaldEagle (Version 1.0.0) [Computer software]. https://github.com/NickL77/BaldEagle/
or
@software{Liu_BaldEagle_2025,
title = {BaldEagle},
author = {Liu, Nicholas},
year = {2025},
month = {May},
url = {https://github.com/NickL77/BaldEagle/},
license = {MIT},
version = {1.0.0}
}