BaldEagle

July 3, 2025 · View on GitHub

Unofficial Implementation of EAGLE Speculative Decoding.

Read our launch announcement: https://frugalgpu.substack.com/p/introducing-baldeagle

Read our guide on how to train your own EAGLE model: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative

Features

Training

Clean model implementation on top of HuggingFace Transformers that can be replicated for all models
- Abstracts away attention, causal mask, etc.
Training loop is implemented using HuggingFace Trainer for more readable and modular code
- Easily modify learning rate scheduler
- Abstracts away gradient accumulation, autocasting, checkpointing, logging, resuming, etc.

Data Generation

Improved data generation scripts that modularizes data formatting, tokenization, and loss mask generation
- Easy to switch to other datasets and tokenizers
- Ultrachat and ShareGPT implementations already included
view_data.py script that shows loss mask on original text for validation purposes (see here for more details)

Benchmarking

Benchmarking scripts using sglang for production quality inference (see here for more details)

Models Trained with BaldEagle

Target Model	BaldEagle Model
Llama-3.1-8B-Instruct	BaldEagle-Llama-3.1-8B-Instruct
Qwen-2.5-7B-Instruct	BaldEagle-Qwen-2.5-7B-Instruct

Getting Started with Training

1. Data Generation

Note: Data requires a significant amount of disk space since we're saving sequence_length x hidden_dim for each sample. ShareGPT (68k rows) requires ~650GB and Ultrachat (200k rows) requires ~2TB

Edit generate_data.py for the dataset and model you are using.
- Section 1 is focused on the dataset and reformatting it if necessary; by default we use Ultrachat and ShareGPT is availble in the commented blocks
- Section 2 tokenizes and generates the loss mask based on the tokenizer's chat template.
In allocation.py set the GPU's you want to use for data generation
- This will split the data and call generate_data.py on separate slices on different GPUs
- Modify outdir variable
Call allocation.py while specifying the output directory --outdir
- ie. python allocation.py --outdir {output_directory}

2. Training

In train.py, modify the necessary variables
- Specify path to a local path for the main model you're training for
- Modify the datapaths in the Load data section to match your data paths from the previous section
- Modify any trainer parameters
Launch the training script on 1 GPU with python3 train.py

Eagle 3 Status

Training Time Test

Currently, training-time test from Eagle 3 paper is being worked on in the train/train_eagle_ttt.py and train/modules/trainer/trainer_eagle_ttt.py files.

Eagle 2 + Training Time Test Model: https://huggingface.co/NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha

11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline

Fused Features

Fused features requires new data generation and EAGLE 3 trains on target model generations rather than fixed dataset, which EAGLE 1 does. Fused features will require

[Experimental ]new data generation to extract high, medium, and low features
- this will require 3x more storage
- currently, generate_data_fused_features.py can generate low, mid, and high features
  - this is based on EAGLE repos's layer selection here
faster data generation since target model generation will be required
- ideally we can use a faster inference server like VLLM or sglang rather than huggingface
modifications to model and trainer code for feature fusion

Feel free to open an issue to discuss implementation and results!

Citation

If you found this project useful, please cite this with:

Liu, N. (2025). BaldEagle (Version 1.0.0) [Computer software]. https://github.com/NickL77/BaldEagle/

@software{Liu_BaldEagle_2025,
  title    = {BaldEagle},
  author   = {Liu, Nicholas},
  year     = {2025},
  month    = {May},
  url      = {https://github.com/NickL77/BaldEagle/},
  license  = {MIT},
  version  = {1.0.0}
}