StreamFormer Online Action Detection
September 10, 2025 · View on GitHub
Introduction
This is a PyTorch implementation for the downstream task Online Action Detection based on MAT "Memory-and-Anticipation Transformer for Online Action Understanding".
Environment
The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1
pip install -r requirements.txt
Data Preparation
1. Prepare the dataset
Prepare the THUMOS and TVSeries video datasets, including raw videos, metafiles, and ground-truth annotations. Please refer to LSTR.
2. Extract video features using the following script
Change the following paths in Streamformer/scripts/downstream_extract_oad_feature.sh.
MODEL_PATH='/PATH/TO/PRETRAINED/timesformer-siglip-16'
PRETRAINED_CKPT='/PATH/TO/CHECKPOINT/checkpoint-epoch.pth'
THUMOS_DATA_PATH='/PATH/TO/thumos-video'
THUMOS_ANNO_PATH='/PATH/TO/metadata/thumos'
THUMOS_SAVE_PATH='/PATH/TO/THUMOS14/streamformer_multitask_feature'
TVSERIES_DATA_PATH='/PATH/TO/tv_series/mkv_videos'
TVSERIES_ANNO_PATH='/PATH/TO/metadata/tvseries'
TVSERIES_SAVE_PATH='/PATH/TO/TVSeries/streamformer_multitask_feature'
Then, run feature extraction in parallel on 8GPUs.
cd Streamformer
./scripts/downstream_extract_oad_feature.sh
(Optionally) You can also download our pre-extracted feature from HERE.
Put all the files under data/THUMOS or data/TVSeries, and the file structure should be as:
- THUMOS dataset:
data/THUMOS/ ├── streamformer_multitask_feature/ | ├── video_validation_0000051.npy (of size L x 768) │ ├── ... ├── flow_kinetics_bninception/ | ├── video_validation_0000051.npy (of size L x 1024) | ├── ... ├── target_perframe/ | ├── video_validation_0000051.npy (of size L x 22) | ├── ... - TVSeries dataset (optionally, you can generate optical flow for TVSeries.):
data/TVSeries/
├── streamformer_multitask_feature/
| ├── Breaking_Bad_ep1.npy (of size L x 768)
│ ├── ...
├── target_perframe/
| ├── Breaking_Bad_ep1.npy (of size L x 31)
| ├── ...
Training and Inference
Train and inference MAT on pre-extracted THUMOS video features with flow on a single GPU.
./scripts/train_thumos_with_flow.sh
or visual feature only (without flow).
./scripts/train_thumos_without_flow.sh
Note that we use the first online evaluation mode in MAT for simplicity, where each test video is split into non-overlapping samples, and the model makes prediction on the all the frames in the short-term memory as if they were the latest frame.
Main Results and checkpoints
THUMOS14
| method | visual feature | mAP (%) | config | checkpoint |
|---|---|---|---|---|
| MAT (with flow) | Streamformer | 73.8 | yaml | Download |
| MAT (without flow) | Streamformer | 68.3 | yaml | Download |
TVSeries
| method | visual feature | mAP (%) | config | checkpoint |
|---|---|---|---|---|
| MAT (without flow) | Streamformer | 87.8 | yaml | Download |
Acknowledgements
This codebase is built upon MAT.