๐ฆ Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
May 28, 2026 ยท View on GitHub
Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.
This repository provides the end-to-end guidance and scripts for the environment setup, data preparation, training, and inference of the Eagle VLM.
Highlight
๐ Strong Results Across The Board
- SOTA on 6 out of 10 long video benchmarks
- Outperforms GPT-4o (0806) on 3/5 video tasks
- Outperforms Gemini 1.5 Pro on 4/6 video tasks
- Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
- 72.4% on Video-MME with 512 input frames
- Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.
๐ฏ Key Innovations
- Information-First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
- Progressive Mixed Post-Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
- Diversity-Driven Data Recipe:
- Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.
โก Efficiency & Framework Optimization
- GPU Memory Optimization:
- Integrate Triton-based fused operators replacing PyTorchโs MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
- Sufficient to fit up to 32K context length with an 8B model on a single GPU.
- Distributed Context Parallelism:
- Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
- Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
- Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.
Model Details
- Model Type: Long-context vision-language model
- Architecture:
- Vision encoder: Siglip2-So400m-Patch16-512
- Language model: Qwen2.5-7B-Instruct
- Multimodal base architecture: LLaVA with tiling-based vision input
- Supported Inputs:
- Long video sequences (up to 512 frames)
- High-resolution images (up to 4K HD input size)
- Multi-page documents
- Long text
- Training Strategy:
- Progressive mixed post-training, expanding from 32K to 128K context length
- Information-first sampling for optimal visual and textual information retention
- Training Data:
- Open-source video and document datasets
- Eagle-Video-110K (110K long videos with dual-level annotation)
Getting Started
๐ Onboarding
Recommended order:
- Set environment variables โ 2) Install โ 3) Prepare data โ 4) Train โ 5) Demo โ 6) Inference
- Onboarding overview: see
./document/0.onboarding.md
โ๏ธ Installation & Environment
- Detailed steps and dependencies:
./document/1.installing.md- Conda environment (Python 3.10)
- PyTorch and FlashAttention (match your CUDA)
- Install this repo with
pip install -e . - Troubleshooting notes (specific Transformers version, OpenCV dependencies, etc.)
๐ Data Preparation (Playground)
- Directory structure and JSONL/LMDB examples:
./document/2.preparing_playground.mdplayground/sft_recipe(data recipe)playground/sft_jsonlandplayground/sft_data(annotations and raw data)- Example parquetโLMDB conversion scripts are not included in this repo
- Use
shell/prepare.shto normalize and generate.prepare.json(internalsubmit_prepare_job.shis not included) - LMDB reading example and tips:
./document/how_to_use_lmdb_to_read_images.md
๐ช Training (Stage-2 / Finetuning)
- Full training entry points and multinode/multigpu options:
./document/3.training.md- Single-node example:
GPUS=8 bash shell/train_stage2.sh 1 work_dirs/eagle2.5_debug - Multi-node example (srun/internal submit_job):
PARTITION=xxx GPUS=16 bash shell/train_stage2.sh 2 work_dirs/eagle2.5_multinode
- Single-node example:
โจ Launching Streamlit Demo
- Interactive testing of the VLM with UI. Refer to document for more details:
./document/4.streamlit_demo.md
๐ฎ Inference
- End-to-end usage and multimodal examples (single/multiple images, single/multiple videos, streaming, batch):
./document/5.inference.md- Load with
transformersAutoModel/AutoProcessor:"nvidia/Eagle-2.5-8B" - Recommended
torch_dtype=torch.bfloat16; runmodel.generate(...)on GPU
- Load with
Benchmark Results
๐ฅ Video Benchmarks
| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B |
|---|---|---|---|---|---|
| MVBenchtest | - | - | 72.0 | 69.6 | 74.8 |
| Perception_testval | - | - | - | 70.5 | 82.0 |
| EgoSchemafullset | - | 72.2 | - | 65.0 | 72.2 |
| MMB-Video | 1.63 | 1.30 | 1.68 | 1.79 | 1.94 |
| MLVUval | - | - | 68.9 | 70.2 | 77.6 |
| LVBenchval | 66.7 | 64.0 | 60.0 | 56.0 | 66.4 |
| Video-MMEw/o subtitle | 71.9 | 75.0 | 64.2 | 65.1 | 72.4 |
| Video-MMEw subtitle | 77.2 | 81.3 | 66.9 | 71.6 | 75.7 |
| CG-BenchClue | 58.6 | 50.9 | - | 44.5 | 55.8 |
| CG-BenchLong | 44.9 | 37.8 | - | 35.5 | 46.6 |
| CG-BenchmIoU | 5.73 | 3.85 | - | 2.48 | 13.4 |
| HourVideoDev | - | 37.2 | - | - | 44.5 |
| HourVideoTest | - | 37.4 | - | - | 41.8 |
| Charade-STAmIoU | 35.7 | - | - | 43.6 | 65.9 |
| HD-EPIC | - | 37.6 | - | - | 42.9 |
| HRVideoBench | - | - | - | - | 68.5 |
| EgoPlanval | - | - | - | - | 45.3 |
๐ผ๏ธ Image Benchmarks
| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B |
|---|---|---|---|---|---|
| DocVQAtest | 92.8 | 93.1 | 93.0 | 95.7 | 94.1 |
| ChartQAtest | 85.7 | 87.2 | 84.8 | 87.3 | 87.5 |
| InfoVQAtest | 79.2 | 81.0 | 77.6 | 82.6 | 80.4 |
| TextVQAval | 77.4 | 78.8 | 79.1 | 84.9 | 83.7 |
| OCRBenchtest | 736 | 754 | 822 | 864 | 869 |
| MMstartest | 64.7 | 59.1 | 62.8 | 63.9 | 66.2 |
| RWQAtest | 75.4 | 67.5 | 70.1 | 68.5 | 76.7 |
| AI2Dtest | 84.6 | 79.1 | 84.5 | 83.9 | 84.5 |
| MMMUval | 69.1 | 62.2 | 56.0 | 58.6 | 55.8 |
| MMBench_V11test | 83.1 | 74.6 | 83.2 | 82.6 | 81.7 |
| MMVetGPT-4-Turbo | 69.1 | 64.0 | 62.8 | 67.1 | 62.9 |
| HallBenchavg | 55.0 | 45.6 | 50.1 | 52.9 | 54.7 |
| MathVistatestmini | 63.8 | 63.9 | 64.4 | 68.2 | 67.8 |
| Avg Score | 74.9 | 71.7 | 73.1 | 75.6 | 75.6 |
๐ฆพ Embodied Benchmarks
| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B |
|---|---|---|---|---|---|
| OpenEQA | - | - | - | - | 63.5 |
| ERQA | 47.0 | 41.8 | - | - | 38.3 |
| EgoPlanval | - | - | - | - | 45.3 |
License
- See LICENSE for the code of this repository.
- See LICENSE_MODEL for the models of Eagle 2 and Eagle 2.5.
For detailed parameter explanations and launcher script notes, see: ./document/explain_script_arguments.md.