Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
October 20, 2025 · View on GitHub
[Paper] [Project Page] [Video]
In this work, we propose hierarchical event memory for online video temporal grounding (OnVTG), which requires the model to locate events related to a given natural language query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. We achieve efficient, accurate, and real-time online video temporal localization. We validate the effectiveness of our method on the ActivityNet Captions, TACoS, and MAD datasets.
Our paper is accepted by ICCV-2025.

Requiments
Please install the necessary dependencies listed in requirements.txt.
Data Preparation
Download the data for ActivityNet Captions and TACoS datasets from Google Drive or Baidu Netdisk. Please download the data for MAD dataset from the official repository.
Please decompress the data and place it in the data directory.
Training and Inference
ActivityNet Captions
To train the model on the ActivityNet Captions dataset:
CUDA_VISIBLE_DEVICES=0 bash scripts/train.sh configs/anet.yaml
You can download the trained model checkpoint from Google Drive or Baidu Netdisk.
To evaluate the trained model:
# Inference using single GPU
CUDA_VISIBLE_DEVICES=0 bash scripts/eval.sh configs/anet.yaml results/anet/models/best.pth
# Inference using multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval.sh configs/anet.yaml results/anet/models/best.pth
You can replace results/anet/models/best.pth with the downloaded checkpoint.
The results should be:
| Method | SD | ED | ||||
|---|---|---|---|---|---|---|
| With Future Prediction | 42.89 | 24.49 | 67.82 | 51.74 | -1.58s | -10.96s |
| Without Future Prediction | 45.29 | 26.25 | 76.24 | 62.14 | 41.10s | -10.89s |
MAD
To train the model on the MAD dataset:
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/train.sh configs/mad.yaml
If you wish to train using a different number of GPUs, please modify the batch_size in configs/mad.yaml to ensure that the batch_size multiplied by the number of GPUs remains unchanged.
You can download the trained model checkpoint from Google Drive or Baidu Netdisk.
To evaluate the trained model:
# Inference using single GPU
CUDA_VISIBLE_DEVICES=0 bash scripts/eval.sh configs/mad.yaml results/mad/models/best.pth
# Inference using multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval.sh configs/mad.yaml results/mad/models/best.pth
Please replace results/mad/models/best.pth with the downloaded checkpoint.
The results should be:
| Method | SD | ED | ||||
|---|---|---|---|---|---|---|
| With Future Prediction | 9.84 | 6.43 | 16.67 | 12.18 | 0.64s | -1.10s |
| Without Future Prediction | 15.76 | 11.07 | 37.84 | 29.21 | 3.60s | -1.45s |
TACoS
To train the model on the TACoS dataset:
CUDA_VISIBLE_DEVICES=0 bash scripts/train.sh configs/tacos.yaml
You can download the trained model checkpoint from Google Drive or Baidu Netdisk.
To evaluate the trained model:
# Inference using single GPU
CUDA_VISIBLE_DEVICES=0 bash scripts/eval.sh configs/tacos.yaml results/tacos/models/best.pth
# Inference using multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval.sh configs/tacos.yaml results/tacos/models/best.pth
Please replace results/tacos/models/best.pth with the downloaded checkpoint.
The results should be:
| Method | SD | ED | ||||
|---|---|---|---|---|---|---|
| With Future Prediction | 37.44 | 27.32 | 57.49 | 44.44 | -1.28s | -3.78s |
| Without Future Prediction | 44.19 | 30.87 | 68.96 | 52.69 | 22.38s | -4.26s |
Citation
@inproceedings{zheng-etal-2025-hierarchical,
title = "Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding",
author = "Zheng, Minghang and
Peng, Yuxin and
Sun, Benyuan and
Yang, Yi and
Liu, Yang",
booktitle = "Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)",
year = "2025"
}