HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
September 10, 2025 Β· View on GitHub
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Project Page | Paper
:fire: News
- [2025.06.26] Our paper HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics has been accepted by ICCV'2025 π.
- [2024.08.24] :keyboard: Our short paper BREASE: Bridging Episodes and Semantics, A Novel Framework for Long-Form Video Understanding has been accepted by the EVAL-FoMo workshop at ECCV'24.
Model Overview
Results
Plug-and-Play Experiments
Requirements
You can install the conda environment by running:
git clone https://github.com/joslefaure/HERMES.git
cd HERMES
pip install -e .
Supported Datasets
Additionally, our modules can be plugged into other VLMs for faster inference and improved memory management.
Prepare MovieCORE and/or MovieChat-1k
-
Download the train data (if you want to finetune HERMES) from here and the test data from here
-
Extract the frames at 10FPS and organize it as follows:
βββ data
βββ moviecore
βββ annotation
βββ frames
βββ {video_id}
βββ frame000001.jpg
βββ ...
Pretrained Checkpoints
| Dataset | Download Link |
|---|---|
| MovieCORE | GDrive / HuggingFace |
| MovieChat-1k | GDrive / HuggingFace |
| LVU | GDrive (Coming soon) |
| Breakfast | GDrive (Coming soon) |
| COIN | GDrive (Coming soon) |
Inference
We inference the model on 4 V100 GPUs (32GB). One GPU will do BTW.
First add your openai API to the environment variable export OPENAI_API_KEY='sk-***** (only for moviechat dataset), as we use GPT3.5 for scoring. For the other datasets, we report top-1 accuracy.
# Zero-shot
bash run_scripts/moviecore/test.sh
# Fully-supervised
bash run_scripts/moviecore/test.sh path/to/your/model.pth
Same for the other datasets. All the scripts are included in run_scripts.
Train
We train the model on 8 V100 GPUs (32GB).
bash run_scripts/{dataset}/train.sh
Citation
If you find our code or our paper useful for your research, please [β star] this repo and [cite] the following paper:
@misc{faure2024hermestemporalcoherentlongformunderstanding,
title={HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics},
author={Gueter Josmy Faure and Jia-Fong Yeh and Min-Hung Chen and Hung-Ting Su and Shang-Hong Lai and Winston H. Hsu},
year={2024},
eprint={2408.17443},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.17443},
}
Acknowledgement
We thank the authors of the following repositories for open-sourcing their code.
Icon made by Freepik from www.flaticon.com