Punching Bag vs. Punching Person: Motion Transferability in Videos [ICCV 25]

August 14, 2025 ยท View on GitHub

Website Paper

๐ŸŽ‰ (June 25, 2025) Paper got accepted at ICCV 2025

Benchmark datasets

The detailed videos list and class labels for Syn-TA, Kinectics400 - TA, and Something-something-v2 - TA are provided in dataset_splits and labels. For K400-TA and SSv2-TA, please download the subset of videos according to the videos list from their original providers: Kinetics400 and Something-something-v2.

Syn-TA

The Syn-TA dataset videos are available on Hugging Face: https://huggingface.co/datasets/raiyaanabdullah/Syn-TA. If you wish to generate them in Blender, please follow the instructions in GENERATE_SYNTA.md.

Training

The training configurations for all the models are available at configs. Please see INSTRUCTIONS.md for more details.

Results

Absolute drop and harmonic mean of known (CoarseMotion-KC) and unknown (CoarseMotion-UC) accuracies (average of two sets) for coarse activities

ModelSyn-TAK400-TASSv2-TA
Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘
Unimodal Models
ResNet5066.6629.9336.7341.3076.4946.2130.2857.5945.0726.0818.9933.01
I3D80.5037.5142.9951.1776.8947.2529.6358.4959.6034.4025.2043.53
X3D93.7158.4535.2571.7981.2349.8831.3561.7872.7341.8130.9253.05
SlowFast89.2746.8642.4161.4581.7050.3331.3762.2657.6735.1522.5143.60
MViTv263.6943.2320.46*51.5068.8845.0623.8154.4754.3132.3721.9340.49
Rev-MViT65.5338.0227.5147.9859.4040.5418.86*48.1634.6421.7212.92*26.68
AIM99.13*70.16*28.9782.17*95.0463.7331.3176.2979.94*45.82*34.1258.18*
UniformerV297.9651.2046.7667.2593.5662.2931.2774.7758.1633.2024.9642.25
Multimodal Models
ActionCLIP96.2955.3340.9570.2793.2462.2431.0074.6064.1036.6627.4446.56
X-CLIP85.0447.8337.2161.2292.6961.4731.2273.9069.4940.1029.3950.74
ViFi-CLIP79.6735.4644.2149.0193.2460.4432.8073.3158.6930.6927.9940.22
EZ-CLIP98.3052.4345.8768.3886.8866.7020.1875.4362.5534.8427.7044.72
FROSTER89.4231.8057.6146.9195.99*69.23*26.7680.42*57.6530.6826.9739.98
Domain Generalization Methods
VideoDG98.0743.4354.6460.1786.1153.9532.1566.2757.2531.5425.7140.63
STDN70.6623.9746.6935.7268.1146.1022.0154.8935.9322.3113.6227.51
CIR60.139.5950.5416.4168.5312.6655.8721.3448.0131.9716.0438.37

Absolute drop and harmonic mean of known (CoarseMotion-KC) and unknown (CoarseMotion-UC) accuracies (average of two sets) for fine activities

ModelSyn-TAK400-TASSv2-TA
Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘Known โ†‘Unknown โ†‘D_abs โ†“HM โ†‘
ActionCLIP88.0138.81*49.19*53.85*87.7541.5246.2356.2059.7225.8433.8836.03
X-CLIP75.2022.9052.2934.9889.06*48.1140.9562.3765.31*26.5338.7837.69
ViFi-CLIP69.2719.9149.3630.7988.9126.7062.2140.9752.1326.2825.8534.93
EZ-CLIP89.54*24.8964.6438.7183.7673.959.81*78.4759.8329.73*30.0939.70*
FROSTER85.4420.6864.7633.2688.9374.11*14.8280.81*50.3424.9925.35*33.34