HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness
October 20, 2024 · View on GitHub
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness
Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman
NeurIPS, 2024
project page | arxiv | bibtex
News
-
10/20/2024 For those interested in training and inference of HOI-Swap, the corresponding code can be accessed upon request. Please fill out the form here for more details.
-
10/16/2024 Unfortunately, we are unable to release the HOI-Swap pre-trained checkpoints due to legal constraints. However, the HOI-Swap edit benchmark and evaluation code are now available here. Stay tuned for the training and inference code.
HOI-Swap edit benchmark
The benchmark includes both image and video editing tasks. You can download the data here. The source images/videos and reference object images used as model input for editing are provided. Alongside, we provide HOI-Swap's generated results together with baseline approaches, thanks to their open-source availability! See the sections below for more details.
Image editing
The evaluation set for image editing includes 1,250 source images, each paired with four reference object images, resulting in a total of 5,000 edited images. images_hoi4d contains 1000 images from HOI4D, and images_egoexo4d contains 250 EgoExo4D images. We provide the results from three baseline methods alongside HOI-Swap. Additionally, our evaluation requires using the hand object detector. To simplify the process, we've already included preprocessed detection results (found in the hand_det folder).
Evaluation: Run evaluation/eval_image.py for quantitative evaluations (Table 1 of the paper).
Baselines:
- Paint-by-example: https://github.com/Fantasy-Studio/Paint-by-Example
- AnyDoor: https://github.com/ali-vilab/AnyDoor
- Affordance Diffusion: https://github.com/NVlabs/affordance_diffusion
Video editing
The video editing evaluation set consists of 25 source videos, each combined with four reference object images, yielding 100 unique edited videos. videos_hoi4d contains 17 videos from HOI4D, and videos_ood contains 8 videos from TCN Pouring and EPIC-Kitchens, demonstrating zero-shot generalization capabilities.
We also provide preprocessed detection results using the hand object detector, available in the hand_det_video folder.
Evaluation:
- Run VBench with
--dimension subject_consistency motion_smoothness --mode custom_inputfor the first 2 metrics in Table 1 - Run
evaluation/eval_video.pyfor the last 3 metrics in Table 1.
Baselines:
- AnyDoor for every frame: https://github.com/ali-vilab/AnyDoor
- AnyDoor + AnyV2V: https://github.com/TIGER-AI-Lab/AnyV2V
- VideoSwap: https://github.com/showlab/VideoSwap
Disclaimer
This repository provides a personal reproduction of HOISwap, completed independently at the University of Texas at Austin. The codebase is released as a personal project and is not affiliated with any external organizations.
Citation
If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.
@article{xue2024hoi,
title={HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness},
author={Xue, Zihui and Luo, Mi and Chen, Changan and Grauman, Kristen},
journal={arXiv preprint arXiv:2406.07754},
year={2024}
}