README.md
May 11, 2026 Β· View on GitHub
Β DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Poster
Introduction
DEL is a plug-and-play self-speculative decoding algorithm that dynamically selects both the exit layer and speculation length during LLM inference to maximize throughput. Unlike prior methods that rely on fixed hyperparameters or offline tuning, DEL uses real-time token acceptance signals to adaptively configure the draft model for each input.
DEL builds on LayerSkip, a self-speculative framework that reuses the early layers of the target model to generate draft tokens. DEL enhances this method by introducing:
- Token-per-Layer (TPL): A metric that balances acceptance rate and computation cost to guide exit layer selection.
- Shadow Token Analysis: Efficient use of cached hidden states to estimate acceptance probabilities for all exit layers simultaneously.
- Dynamic Draft Exiting: A confidence-driven mechanism that determines when to stop drafting tokens, even mid-round.
These components allow DEL to perform on-the-fly optimization of speculative decoding parameters for each prompt and context window.

π§ Installation
# Setup Conda environment
conda create --name del python=3.10
conda activate del
# Install dependencies
pip install -r requirements.txt
π Reproduce Main Results
Run the full benchmark suite using:
bash run_benchmarks.sh
This script evaluates DEL and several baselines (self_speculative, FSM_speculative, DV_speculative, autoregressive) across 7 datasets and multiple LayerSkip LLaMA variants.
- Logs will be saved under
./logs/ - You can modify
run_benchmarks.shto adjustnum_samples,max_steps, or target models.
π Project Structure
.
βββ benchmark.py # Main benchmarking entry point
βββ arguments.py # Argument parser for benchmarking and generation
βββ generate.py # Generation script for non-benchmarking use
βββ eval.py # Evaluation and scoring utilities
βββ correctness.py # Unit-level checks for speculative correctness
βββ sweep.py # Hyperparameter sweep support
βββ utils.py # Miscellaneous utilities
βββ run_benchmarks.sh # Shell script to reproduce all benchmarks
βββ requirements.txt
βββ README.md
βββ self_speculation/ # All generation strategies implemented here
βββ DEL.py # Dynamic Exit Layer (DEL) core logic
βββ DEL_speculation_generator.py # DEL-based generation
βββ DV_speculation_generator.py # Draft&Verify speculative decoding baseline
βββ DELE_speculation_generator.py # DEL without dynamic draft exiting variant
βββ FSM_speculation_generator.py # FSM speculation baseline
βββ autoregressive_generator.py # Vanilla greedy decoding
βββ self_speculation_generator.py # Standard self speculative decoding
βββ generator_base.py
βββ llama_model_utils.py
βββ speculative_streamer.py
π Datasets and Models
We benchmark DEL using:
Models
facebook/layerskip-llama3.2-1Bfacebook/layerskip-llama3-8Bfacebook/layerskip-llama2-[7B,13B,70B]
Datasets
gsm8k,aqua_rat(math reasoning)cnn_dm_lm,cnn_dm_summarization,xsum_summarization(long-form/text)wmt14_de_en(translation)human_eval(code generation)
π§ Key Features
-
DEL: Dynamic Exit Layer
A plug-and-play module for LayerSkip that dynamically selects the exit layer and speculation length per generation round based on real-time context. -
Context-Aware Adaptation
Tracks token-level acceptance rates across layers and uses a confidence-aware thresholding mechanism to adapt speculation dynamically. -
Token-per-Layer (TPL) Optimization
Introduces a novel efficiency metric, TPL, to guide the optimal choice of exit layer and speculation length with negligible overhead. -
Shadow Token Analysis
Computes expected acceptance rates using cached hidden states and shadow tokens, without any additional forward passes through the model. -
Streaming & Scalability
Efficient across diverse tasks (reasoning, summarization, code) and scales from 1B to 70B LLMs, with up to 2.84Γ speedup over greedy decoding. -
Fully Compatible with LayerSkip
Seamlessly integrates with early-exit models without any retraining or architectural changes. -
Lightweight & Practical
Adds minimal runtime and memory overhead (~1β2%), making it suitable for real-world deployment.
π Cite Us
If you use DEL in your work, please cite:
@inproceedings{entezari2025del,
title={DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding},
author={Entezari Zarch, Hossein and Gao, Lei and Jiang, Chaoyi and Annavaram, Murali},
booktitle={Proceedings of the Conference on Language Modeling (COLM) 2025},
year={2025}
}
π€ Acknowledgements
- LayerSkip models provided by Meta AI.