Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
June 26, 2025 Β· View on GitHub
1S-Lab, Nanyang Technological Universityβ 2A*STAR, Singaporeβ 3Simon Fraser Universityβ 4Shanghai AI Lab
Ego-R1 is a comprehensive research framework that combines reinforcement learning-based tool-use reasoning with egocentric video analysis capabilities.
π Project Overview
This repository provides:
- Chain-of-Tool-Thought Generation (cott_gen): Multi-modal AI agents for analyzing egocentric video data with tool-calling capabilities (RAG, Video-LLM, VLM)
- Ego-R1-Agent: Reinforcement learning framework for training multiturn tool-use interleaved LLMs
- Ego-R1 Dataset: 25K Chain-of-Tool-Thought examples and 4.4K QA pairs
π Key Features
- Multi-modal Tool-Augmented Reasoning: Combines RAG search, Video-LLM, and Vision-Language Models for long video understanding. Agents learn to use multiple tools to decompose and answer complex egocentric video questions
- Reinforcement Learning: GRPO training for thinking-reasoning-and-acting interleaved behavior
- Comprehensive Dataset: Release the code for CoTT data generation and pre-processed data for both SFT and RL training
π° News
- [2025.6.8] Officially launch the Ego-R1 codebase.
π Table of Contents
- Repository Structure
- Installation
- Quick Start
- Usage Examples
- Dataset
- Acknowledgments
- License
- Contributing
- Authors & Contact
- Citation
π Repository Structure
Ego-R1/
βββ cott_gen/ # Chain-of-Tool-Thought generation for egocentric video QA
β βββ main.py # Main agent runner with multi-turn reasoning
β βββ tools.py # Tool implementations (RAG, Video-LLM, VLM)
β βββ utils.py # Utility functions and data processing
β βββ prompts.py # System and reasoning prompts
β βββ postprocess.py # Data postprocessing and analysis
β βββ environment.yml # Conda environment for autogen
βββ LLaMA-Factory/ # LLM fine-tuning framework (submodule)
βββ Ego-R1-Agent/ # RL framework for reasoning + search LLMs
β βββ train_grpo.sh # GRPO training script
β βββ train_ppo.sh # PPO training script
β βββ eval/ # Inference and evaluation scripts
β βββ verl/ # veRL framework components
βββ data/ # Ego-R1 dataset (should be downloaded from HF)
β βββ Ego-CoTT-25K/ # 25K Chain-of-Tool-Thought for SFT
β βββ Ego-QA-4.4K/ # 4.4K QA pairs for RL training
β βββ Ego-CoTT-raw/ # Raw data in multiple formats
βββ scripts/ # Training and generation scripts
β βββ train/ # SFT training scripts
β βββ gen/ # Data generation scripts
βββ api/ # API components for RAG and visual tools
βββ rag/ # RAG-related API components
βββ visual_tools/ # Multi-modal visual tool APIs
π§ Installation
Download Ego-R1-Data
huggingface-cli download Ego-R1/Ego-R1-Data --local-dir data --repo-type dataset
Environment Setup
0. Toolbox API Environment
i. Set Environment
cd api/rag
pip install -e .
Make sure to install FFmpeg beforehand, as it is required for the visual tools to function properly.
ii. Prepare the Data For Egoschema and Videomme benchmark
huggingface-cli download Ego-R1/h-rag_database --local-dir data --repo-type dataset
Unzip the Videomme and Egoschema videos.
iii. Setup API
-
Set GPT Key
export AZURE_OPENAI_ENDPOINT=ENDPOINT export AZURE_OPENAI_API_KEY=KEY -
Start RAG
-
For Egolife/Ego-R1:
- Set video directory in
rag/configs/egolife.yaml:base: data_dir: data/egolife # set to h-rag_database/egolife - Run:
python api_for_egolife.py
- Set video directory in
-
For Egoschema:
- Run:
python api_for_egoschema.py --min_log_dir=h-rag_database/egoschema --port 6001 # default
- Run:
-
For Videomme:
- Run:
python api_for_videomme.py --min_log_dir=h-rag_database/videomme/videomme_10min --sec_log_dir=h-rag_database/videomme/videomme_30s --port 7001 # default
- Run:
-
iv. Start Visual API
-
Set Config
- Set video directory in
visual_tools/configs.yamlfor EgoLife, Egoschema, and Videomme videos separately:data_dir: "/path/to/egolife" data_dir: "/path/to/videomme" data_dir: "/path/to/egoschema" - Set any number of Gemini API keys:
gemini_api_keys: ["your-gemini-api-key-1", "your-gemini-api-key-2"]
- Set video directory in
-
Run API
- For any visual API, run:
python api.py - For LLaVA-based VideoLLM, run the LLaVA API first:
python xxxx_videollm_llava/llava_video.py
- For any visual API, run:
1. CoTT-Data-Generation Environment
# One-line installation
cd cott_gen
conda env create -f environment.yml
conda activate autogen
# Or install step by step:
# conda create -n autogen python=3.10
# conda activate autogen
# pip install -U autogenstudio==0.6.1
# pip install future google-genai
2. SFT (LLaMA-Factory) Environment
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
3. RL (Ego-R1-Agent) Environment
conda create -n egor1 python=3.9
conda activate egor1
# Install PyTorch (optional - vllm can handle this)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# verl
pip install -e .
# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb google-genai
You can follow Search-R1 to build the environment as well.
π Quick Start
Inference
1. Test the model
bash Ego-R1-Agent/utils/serve.sh
2. Inference on the benchmark
conda activate egor1
# with a summary model
bash Ego-R1-Agent/eval/infer_bench_summ.sh
# or you can go with a basic one
# python infer.py --arg1 xxx --arg2 xxx
1. Supervised Fine-Tuning (SFT)
# Prepare data
mkdir -p LLaMA-Factory/data
cp data/Ego-CoTT-25K/train-cott.json LLaMA-Factory/data/
# Train model
conda activate llamafactory
cd LLaMA-Factory
llamafactory-cli train examples/train_full/qwen.yaml
2. Reinforcement Learning Training
# Prepare data
mkdir -p Ego-R1-Agent/data
cp data/Ego-CoTT-raw/*.parquet Ego-R1-Agent/data/
# Start RL training
conda activate egor1
cd Ego-R1-Agent
bash train_grpo.sh # For GRPO training
3. Chain-of-Tool-Thought Generation
# Generate reasoning traces with multi-modal tools
conda activate autogen
bash scripts/gen/run_data_gen.sh
π¬ Usage Examples
Multi-Modal Reasoning Process
The Ego-R1 agent uses a structured chain-of-tool-thought approach:
- Think: Analyze the question and plan the reasoning approach
- RAG Search: Retrieve relevant context from video databases across different time granularities
- Video-LLM: Analyze specific video segments for detailed understanding
- VLM: Extract visual details from specific frames when needed
- Answer: Provide reasoned response based on collected evidence
Tool Usage Examples
RAG Search
{
"name": "rag",
"arguments": {
"level": "day", # or "week", "hour"
"keywords": ["cooking", "kitchen"],
"start_time": "DAY1_11210217",
"query_time": "DAY1_11220217"
}
}
Video Analysis
{
"name": "video_llm",
"arguments": {
"question": "What cooking action is being performed?",
"range": "DAY1_11210217-DAY1_11220217"
}
}
Image Analysis
{
"name": "vlm",
"arguments": {
"question": "What objects are visible on the table?",
"timestamp": "DAY1_11210217"
}
}
π Dataset
Ego-CoTT-25K
- Size: 25,000 examples (415MB)
- Format: Multi-turn conversations with tool calls
- Purpose: Supervised fine-tuning
- Tools: RAG, Video-LLM, VLM integration
Ego-QA-4.4K
- Size: 4,400 QA pairs
- Sources: 1.5K Gemini-generated + 2.9K manual annotations
- Agents: 6 different identities (A1-A6)
- Purpose: Rule-based reinforcement learning training or generating CoTT from scratch
π Acknowledgments
This project builds upon several excellent open-source frameworks:
- autogen: Foundation for multi-agent conversations and tool calling
- veRL: Reinforcement learning framework for LLM training
- LLaMA-Factory: Comprehensive LLM fine-tuning platform
- Search-R1: RL framework for reasoning + search capabilities
- DeepSeek-R1: Inspiration for reasoning model architecture
π License
This project is licensed under the Apache License 2.0. See the LICENSE files in individual components for details.
π€ Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests to help improve this research framework.
π¨βπ» Authors & Contact
If you have any queries, feel free to contact: Shulin Tian (shulin002@ntu.edu.sg) & Ruiqi Wang (rwa135@sfu.ca)
π Citation
@misc{tian2025egor1chainoftoolthoughtultralongegocentric,
title={Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning},
author={Shulin Tian and Ruiqi Wang and Hongming Guo and Penghao Wu and Yuhao Dong and Xiuying Wang and Jingkang Yang and Hao Zhang and Hongyuan Zhu and Ziwei Liu},
year={2025},
eprint={2506.13654},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.13654},
}