What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

September 26, 2025 · View on GitHub

Official code of What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning, ICCV 2025.

Installation

To create a conda environment with the required dependencies, run the following command:

conda env create -f environment.yml
source activate cf

Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.

Downstream tasks

GTEA: Please follow Bridge-Prompt to download the raw video and then extract frames from videos.

EgoPRE: .

EpicKitchen & Charades-Ego: Please refer to EgoVLP codebase for data preparation.

AE2: Please pre-extract frame features for this task, following Align-Ego-Exo for the data split.

Generate State Changes and Their Counterfactuals with Llama

Please refer to Llama 3 for model weights and installation instructions. We use the following scripts to generate state change and counterfactual descriptions for the entire Ego4D dataset. Please note that you will need to modify the paths to Ego4D's annotation files in the scripts.

# clip-level state changes and their counterfactuals
cd llama_script
python clip_level_sc_cf.py

# video-level counterfactuals
cd llama_script
python video_level_cf.py

Generate Text Features with FLAVA

To extract clip-level narration features, please run

language_extraction/feature_extractor.py

To extract video-level summary features, please run

language_extraction/summary_feature_extractor.py

Pretraining

Running on SLURM cluster

To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this level directly and run

bash mover_trainer.sh job_name

The parameters of the SLURM job can be changed in the trainer.sh script. We use 2 nodes, each with 4 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.

Running on a single machine

Please run

torchrun  --nnodes 1 --nproc_per_node 8 --master_port 8081  run/train_egoaggregate.py --config configs/pt/egoaggregation.json

python extract_frame_features.py

Please refer to Bridge-Prompt for more details.

Step 2: Train/test ASFormer based on the features.

cd ASFormer
python main.py --feature cf --dataset gtea --split 1/2/3/4
python main.py --action eval --feature cf --dataset gtea --split 1/2/3/4
python eval.py -- result_dir path_to_results --split 1/2/3/4/0

Please refer to ASFormer for more details.

EgoPER (Action Segmentation and Error Detection)

The detailed instruction is in EgoPER - this has modified EgoPER scripts from the original.

To use this modified scripts, submodules should be loaded:

If you clone this repo without submodule, follow:

git submodule update --init --recursive

To clone with the submodule, follow:

git clone --recurse-submodules git@github.com:HCIS-Lab/counterfactual-video-pretrain.git

AE2 Action Phase Recognition

python AE2/AE2_phase_cls.py

Zero-Shot Downstream Task Testing

EpicKitchen-100 Zero-Shot Multi-Instance Retrieval

python downstream_script/test_epic.py

Charades-Ego Zero-Shot Action Classification

python downstream_script/test_charades.py

AE2 Zero-Shot Action Phase Frame Retrieval

python AE2/AE2_frame_retrieval.py

Citation

If you use our code or method, please cite the following paper:

@InProceedings{counterfactual_ICCV_2025,
    author    = {Kung, Chi-Hsi and Ramirez, Frangil and Ha, Juhyung and Chen, Yi-Ting and Crandall, David and Tsai, Yi-Hsuan},
    title     = {What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
}

Acknowledgement

The pretraining and Chrades-Ego, EPIC-KITCHENS test codebase is based on EgoVLP and HierVL.

The feature extraction code is adapted from Bridge-Prompt.

The temporal action segmentation code is adapted from ASFormer.

The action phase recognition and frame retrieval code is adapted from AE2