๐Ÿฆ… Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

May 28, 2026 ยท View on GitHub

๐Ÿฆ… Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Eagle

Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.

This repository provides the end-to-end guidance and scripts for the environment setup, data preparation, training, and inference of the Eagle VLM.

Highlight

๐Ÿš€ Strong Results Across The Board

  • SOTA on 6 out of 10 long video benchmarks
  • Outperforms GPT-4o (0806) on 3/5 video tasks
  • Outperforms Gemini 1.5 Pro on 4/6 video tasks
  • Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
  • 72.4% on Video-MME with 512 input frames
  • Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.

๐ŸŽฏ Key Innovations

  • Information-First Sampling:
    • Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
    • Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
  • Progressive Mixed Post-Training:
    • Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
  • Diversity-Driven Data Recipe:
    • Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.

โšก Efficiency & Framework Optimization

  • GPU Memory Optimization:
    • Integrate Triton-based fused operators replacing PyTorchโ€™s MLP, RMSNorm, and RoPE implementations.
    • Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
    • Sufficient to fit up to 32K context length with an 8B model on a single GPU.
  • Distributed Context Parallelism:
    • Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
    • Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
  • Video Decoding Acceleration:
    • Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
  • Inference Acceleration:
    • Supports vLLM deployment with reduced memory and accelerated inference.

Model Details

  • Model Type: Long-context vision-language model
  • Architecture:
    • Vision encoder: Siglip2-So400m-Patch16-512
    • Language model: Qwen2.5-7B-Instruct
    • Multimodal base architecture: LLaVA with tiling-based vision input
  • Supported Inputs:
    • Long video sequences (up to 512 frames)
    • High-resolution images (up to 4K HD input size)
    • Multi-page documents
    • Long text
  • Training Strategy:
    • Progressive mixed post-training, expanding from 32K to 128K context length
    • Information-first sampling for optimal visual and textual information retention
  • Training Data:
    • Open-source video and document datasets
    • Eagle-Video-110K (110K long videos with dual-level annotation)

Getting Started

๐Ÿ“š Onboarding

Recommended order:

  1. Set environment variables โ†’ 2) Install โ†’ 3) Prepare data โ†’ 4) Train โ†’ 5) Demo โ†’ 6) Inference
  • Onboarding overview: see ./document/0.onboarding.md

โš™๏ธ Installation & Environment

  • Detailed steps and dependencies: ./document/1.installing.md
    • Conda environment (Python 3.10)
    • PyTorch and FlashAttention (match your CUDA)
    • Install this repo with pip install -e .
    • Troubleshooting notes (specific Transformers version, OpenCV dependencies, etc.)

๐Ÿ“‚ Data Preparation (Playground)

  • Directory structure and JSONL/LMDB examples: ./document/2.preparing_playground.md
    • playground/sft_recipe (data recipe)
    • playground/sft_jsonl and playground/sft_data (annotations and raw data)
    • Example parquetโ†’LMDB conversion scripts are not included in this repo
    • Use shell/prepare.sh to normalize and generate .prepare.json (internal submit_prepare_job.sh is not included)
    • LMDB reading example and tips: ./document/how_to_use_lmdb_to_read_images.md

๐Ÿ’ช Training (Stage-2 / Finetuning)

  • Full training entry points and multinode/multigpu options: ./document/3.training.md
    • Single-node example: GPUS=8 bash shell/train_stage2.sh 1 work_dirs/eagle2.5_debug
    • Multi-node example (srun/internal submit_job): PARTITION=xxx GPUS=16 bash shell/train_stage2.sh 2 work_dirs/eagle2.5_multinode

โœจ Launching Streamlit Demo

  • Interactive testing of the VLM with UI. Refer to document for more details: ./document/4.streamlit_demo.md

๐Ÿ”ฎ Inference

  • End-to-end usage and multimodal examples (single/multiple images, single/multiple videos, streaming, batch): ./document/5.inference.md
    • Load with transformers AutoModel/AutoProcessor: "nvidia/Eagle-2.5-8B"
    • Recommended torch_dtype=torch.bfloat16; run model.generate(...) on GPU

Benchmark Results

๐ŸŽฅ Video Benchmarks

BenchmarkGPT-4oGemini-1.5 ProInternVL2.5-8BQwen2.5-VL-8BEagle2.5-8B
MVBenchtest--72.069.674.8
Perception_testval---70.582.0
EgoSchemafullset-72.2-65.072.2
MMB-Video1.631.301.681.791.94
MLVUval--68.970.277.6
LVBenchval66.764.060.056.066.4
Video-MMEw/o subtitle71.975.064.265.172.4
Video-MMEw subtitle77.281.366.971.675.7
CG-BenchClue58.650.9-44.555.8
CG-BenchLong44.937.8-35.546.6
CG-BenchmIoU5.733.85-2.4813.4
HourVideoDev-37.2--44.5
HourVideoTest-37.4--41.8
Charade-STAmIoU35.7--43.665.9
HD-EPIC-37.6--42.9
HRVideoBench----68.5
EgoPlanval----45.3

๐Ÿ–ผ๏ธ Image Benchmarks

BenchmarkGPT-4oGemini-1.5 ProInternVL2.5-8BQwen2.5-VL-8BEagle2.5-8B
DocVQAtest92.893.193.095.794.1
ChartQAtest85.787.284.887.387.5
InfoVQAtest79.281.077.682.680.4
TextVQAval77.478.879.184.983.7
OCRBenchtest736754822864869
MMstartest64.759.162.863.966.2
RWQAtest75.467.570.168.576.7
AI2Dtest84.679.184.583.984.5
MMMUval69.162.256.058.655.8
MMBench_V11test83.174.683.282.681.7
MMVetGPT-4-Turbo69.164.062.867.162.9
HallBenchavg55.045.650.152.954.7
MathVistatestmini63.863.964.468.267.8
Avg Score74.971.773.175.675.6

๐Ÿฆพ Embodied Benchmarks

BenchmarkGPT-4oGemini-1.5 ProInternVL2.5-8BQwen2.5-VL-8BEagle2.5-8B
OpenEQA----63.5
ERQA47.041.8--38.3
EgoPlanval----45.3

License

  • See LICENSE for the code of this repository.
  • See LICENSE_MODEL for the models of Eagle 2 and Eagle 2.5.

For detailed parameter explanations and launcher script notes, see: ./document/explain_script_arguments.md.