README.md
December 6, 2024 · View on GitHub
📌Table of Contents
🚀Introduction
The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning?
To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
This repository describes the usage of our evaluation data (EgoPlan-Val and EgoPlan-Test) and instruction-tuning data (EgoPlan-IT), and provides the corresponding codes for evaluating and fine-tuning MLLMs on our benchmark. Welcome to evaluate your models and explore methods to enhance the models' EgoPlan capabilities on our benchmark!
📝Data
Egocentric Videos
The EgoPlan datasets are constructed based on the two existing egocentric video sources: Epic-Kitchens-100 and Ego4D.
Download the RGB frames of Epic-Kitchens-100. The folder structure of the dataset is shown below:
EPIC-KITCHENS
└── P01
└── rgb_frames
└── P01_01
├── frame_0000000001.jpg
└── ...
Download the videos of Ego4D. The folder structure of the dataset is shown below:
Ego4D
└──v1_288p
├── 000786a7-3f9d-4fe6-bfb3-045b368f7d44.mp4
└── ...
EgoPlan Evaluation Data
Questions from the human-verified evaluation data are formatted as multiple-choice problems. MLLMs need to select the most reasonable answer from four candidate choices. The primary metric is Accuracy.
We divide the evaluation data into two subsets: EgoPlan-Val (containing 3,355 samples) for validation and EgoPlan-Test (containing 1,584 samples) for test, wherein the ground-truth answers of EgoPlan-Test are kept non-public. Below shows an example from the validation set:
{
"sample_id": 115,
"video_source": "EpicKitchens",
"video_id": "P01_13",
"task_goal": "store cereal",
"question": "Considering the progress shown in the video and my current observation in the last frame, what action should I take next in order to store cereal?",
"choice_a": "put cereal box into cupboard",
"choice_b": "take cereal bag",
"choice_c": "open cupboard",
"choice_d": "put cereal bag into cereal box",
"golden_choice_idx": "A",
"answer": "put cereal box into cupboard",
"current_observation_frame": 760,
"task_progress_metadata": [
{
"narration_text": "take cereal bag",
"start_frame": 36,
"stop_frame": 105
},
{
"narration_text": "fold cereal bag",
"start_frame": 111,
"stop_frame": 253
},
{
"narration_text": "put cereal bag into cereal box",
"start_frame": 274,
"stop_frame": 456
},
{
"narration_text": "close cereal box",
"start_frame": 457,
"stop_frame": 606
},
{
"narration_text": "open cupboard",
"start_frame": 689,
"stop_frame": 760
}
],
}
EgoPlan Training Data (EgoPlan-IT)
We provide an automatically constructed instruction-tuning dataset EgoPlan_IT, which contains 50K samples, for fine-tuning the model. Below shows an example from EgoPlan-IT:
{
"sample_id": 39,
"video_source": "EpicKitchens",
"video_id": "P07_113",
"task_goal": "Cut and peel the onion",
"question": "Considering the progress shown in the video and my current observation in the last frame, what action should I take next in order to cut and peel the onion?",
"answer": "grab onion",
"current_observation_frame": 9308,
"task_progress_metadata": [
{
"narration_text": "open drawer",
"start_frame": 9162,
"stop_frame": 9203
},
{
"narration_text": "grab knife",
"start_frame": 9214,
"stop_frame": 9273
},
{
"narration_text": "close drawer",
"start_frame": 9272,
"stop_frame": 9303
}
],
"negative_answers": [
"open drawer",
"grab knife",
"close drawer",
"slice onion",
"remove peel from onion",
"peel onion"
]
}
đź’»Getting Started
1. Installation
Clone the repo and install dependent packages:
git clone https://github.com/ChenYi99/EgoPlan.git
cd EgoPlan
pip install -r requirements.txt
2. Data Preparation
Prepare Egocentric videos: Download the RGB frames of Epic-Kitchens-100 and the videos of Ego4D.
Prepare EgoPlan datasets: Download the validation data set EgoPlan_validation.json and the training dataset EgoPlan_IT.json. Put these two JSON files under the directory data/.
For details of the data structure, please refer to Data.
3. Model Weights
We use Video-LLaMA as an example for evaluation and instruction-tuning.
Prepare the pretrained model checkpoints
- The checkpoint of the vanilla Video-LLaMA can be downloaded from Video-LLaMA-2-7B-Finetuned.
- Alternatively, the checkpoint of the Video-LLaMA that has been further tuned on our EgoPlan-IT can be downloaded from EgoPlan-Video-LLaMA-2-7B.
Prepare the pretrained LLM weights
Video-LLaMA is based on Llama2 Chat 7B. The corresponding LLM weights can be downloaded from Llama-2-7b-chat-hf.
Prepare weights for initializing the Visual Encoder and Q-Former (optional)
If the server cannot access the Internet, the following weights should be downloaded in advance:
- VIT (eva_vit_g.pth)
- Q-Former (blip2_pretrained_flant5xxl.pth)
- Bert (bert-base-uncased)
4. Evaluation
Evaluating the Vanilla Video-LLaMA
Config the paths for model weights in video_llama_eval_only_vl.yaml.
Set the paths for the project root, Epic-Kitchens-100 RGB frames and Ego4D videos in eval_video_llama.sh.
Then, run the script on 1xV100 (32G) GPU:
bash scripts/eval_video_llama.sh
Evaluating the Video-LLaMA Tuned on EgoPlan-IT
Config the paths for model weights in egoplan_video_llama_eval_only_vl.yaml.
Set the paths for the project root, Epic-Kitchens-100 RGB frames and Ego4D videos in eval_egoplan_video_llama.sh.
Then, run the script on 1xV100 (32G) GPU:
bash scripts/eval_egoplan_video_llama.sh
5. Training
For increasing instruction diversity, in addition to EgoPlan-IT, we also include an additional collection of 164K instruction data following Video-LLaMA:
- 3K image-based instructions from MiniGPT-4 [link].
- 150K image-based instructions from LLaVA [link]. The images can be downloaded from here.
- 11K video-based instructions from VideoChat [link]. The videos can be downloaded following the instructions from the official Github repo of Webvid.
Config the paths for model weights and datasets in visionbranch_stage3_finetune_on_EgoPlan_IT.yaml.
Set the path for the project root in finetune_egoplan_video_llama.sh.
Then, run the script on 8xV100 (32G) GPUs:
bash scripts/finetune_egoplan_video_llama.sh
🖊️Submission
We are consistently maintaining an EgoPlan-Bench Leaderboard. To show your model's performance on our leaderboard, please contact yichennlp@gmail.com with attached prediction files for the validation and test sets.
We ONLY accept ".json" files. The submitted data format should be like:
[
{
"sample_id": "int",
"label": "str"
},
...
]
where the "sample_id" field should be an integer and the "label" field should be a string within ["A","B","C","D"]. An example submission file for the validation set can be found here.
📚Citation
If you find our project helpful, hope you can star our repository and cite our paper as follows:
@misc{chen2024egoplanbenchbenchmarkingmultimodallarge,
title={EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning},
author={Yi Chen and Yuying Ge and Yixiao Ge and Mingyu Ding and Bohao Li and Rui Wang and Ruifeng Xu and Ying Shan and Xihui Liu},
year={2024},
eprint={2312.06722},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2312.06722},
}
🙌Acknowledgement
This repo benefits from Epic-Kitchens, Ego4D, Video-LLaMA, LLaMA, MiniGPT-4, LLaVA, VideoChat. Thanks for their wonderful works!