What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
September 26, 2025 ยท View on GitHub
Official code of What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning, ICCV 2025.

Installation
To create a conda environment with the required dependencies, run the following command:
conda env create -f environment.yml
source activate cf
Dataset Preparation
Pre-training
Ego4D/EgoClip
Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.
Downstream tasks
GTEA: Please follow Bridge-Prompt to download the raw video and then extract frames from videos.
EgoPRE: .
EpicKitchen & Charades-Ego: Please refer to EgoVLP codebase for data preparation.
AE2: Please pre-extract frame features for this task, following Align-Ego-Exo for the data split.
Generate State Changes and Their Counterfactuals with Llama
Please refer to Llama 3 for model weights and installation instructions. We use the following scripts to generate state change and counterfactual descriptions for the entire Ego4D dataset. Please note that you will need to modify the paths to Ego4D's annotation files in the scripts.
# clip-level state changes and their counterfactuals
cd llama_script
python clip_level_sc_cf.py
# video-level counterfactuals
cd llama_script
python video_level_cf.py
Generate Text Features with FLAVA
To extract clip-level narration features, please run
language_extraction/feature_extractor.py
To extract video-level summary features, please run
language_extraction/summary_feature_extractor.py
Pretraining
Running on SLURM cluster
To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this level directly and run
bash mover_trainer.sh job_name
The parameters of the SLURM job can be changed in the trainer.sh script. We use 2 nodes, each with 4 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.
Running on a single machine
Please run
torchrun --nnodes 1 --nproc_per_node 8 --master_port 8081 run/train_egoaggregate.py --config configs/pt/egoaggregation.json
Pretraining Checkpoint
The pretraining checkpoint is available here.
Downstream Task Training/Testing
Temporal Action Segmentation (GTEA)
Step 1: Generate features with the pre-trained video model.
Please note that you will need to specify the dataset, model name, cofig path in the script and the "save_dir" in ./as_configs/gtea/gtea_exfm.yaml.
python extract_frame_features.py
Please refer to Bridge-Prompt for more details.
Step 2: Train/test ASFormer based on the features.
cd ASFormer
python main.py --feature cf --dataset gtea --split 1/2/3/4
python main.py --action eval --feature cf --dataset gtea --split 1/2/3/4
python eval.py -- result_dir path_to_results --split 1/2/3/4/0
Please refer to ASFormer for more details.
EgoPER (Action Segmentation and Error Detection)
The detailed instruction is in EgoPER - this has modified EgoPER scripts from the original.
To use this modified scripts, submodules should be loaded:
- If you clone this repo without submodule, follow:
git submodule update --init --recursive
- To clone with the submodule, follow:
git clone --recurse-submodules git@github.com:HCIS-Lab/counterfactual-video-pretrain.git
AE2 Action Phase Recognition
python AE2/AE2_phase_cls.py
Zero-Shot Downstream Task Testing
EpicKitchen-100 Zero-Shot Multi-Instance Retrieval
python downstream_script/test_epic.py
Charades-Ego Zero-Shot Action Classification
python downstream_script/test_charades.py
AE2 Zero-Shot Action Phase Frame Retrieval
python AE2/AE2_frame_retrieval.py
Citation
If you use our code or method, please cite the following paper:
@InProceedings{counterfactual_ICCV_2025,
author = {Kung, Chi-Hsi and Ramirez, Frangil and Ha, Juhyung and Chen, Yi-Ting and Crandall, David and Tsai, Yi-Hsuan},
title = {What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
}
Acknowledgement
The pretraining and Chrades-Ego, EPIC-KITCHENS test codebase is based on EgoVLP and HierVL.
The feature extraction code is adapted from Bridge-Prompt.
The temporal action segmentation code is adapted from ASFormer.
The action phase recognition and frame retrieval code is adapted from AE2