HOI-Synth
April 1, 2026 · View on GitHub

Overview
We investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments on VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data when real labeled data are scarce, achieving gains of +5.67% (VISOR), +8.24% (EgoHOS), and +11.69% (ENIGMA-51) with only 10% real labels. We conduct a systematic study on data alignment, demonstrating that aligning objects, grasps, and environments to the target domain is essential for bridging the synthetic-to-real gap. Our analysis is supported by a novel generation pipeline and the HOI-Synth benchmark, which augments existing datasets with synthetic images automatically labeled with contact states, bounding boxes, and pixel-wise masks.
Project Page - Conference version - Journal version
Citation
If you use our HOI-Synth benchmark, data generation pipeline or this code for your research, please cite our paper:
@inproceedings{leonardi2025synthetic,
title={Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?},
author={Leonardi, Rosario and Furnari, Antonino and Ragusa, Francesco and Farinella, Giovanni Maria},
booktitle={European Conference on Computer Vision},
pages={36--54},
year={2025},
organization={Springer}
}
Table of Contents
HOI-Synth benchmark
The HOI-Synth benchmark extends three egocentric datasets designed to study hand-object interaction detection, EPIC-KITCHENS VISOR [1], EgoHOS [2], and ENIGMA-51 [3], with automatically labeled synthetic data obtained through the proposed HOI generation pipeline.
Download
Synthetic-Data
You can download the synthetic data at the following links:
The format follows the standard of HOS introduced in the VISOR-HOS GitHub repository. Please refer to that link for more information.
After downloading, place the images and annotations in their respective folders.
You will find several annotation files available:
train.json: Contains the complete train annotations.val.json: Contains the complete val annotations.train_x.json: Contains annotations for specific percentages of data. For example,train_10.jsoncontains annotations for 10% of the data.
Additionally, you will find combined annotations (e.g., Synthetic + VISOR). In such cases, move the images from the corresponding real dataset into the appropriate "images" folder.
For the Enigma-51 synthetic images (enigma-51_synth), there are three folders containing the different synthetic data used in the experiments (Check the paper for more information):
- In-domain
- Out-domain
- Out-domain with FOV of the target dataset
Aligned Data
For experiments focused on data alignment, please download the aligned set here:
EPIC-KITCHENS VISOR
To download the data and the corresponding annotations for EPIC-KITCHENS VISOR, follow this link: EPIC-KITCHENS VISOR Data Preparation.
EgoHOS
To download the images of EgoHOS, follow this link: EgoHOS.
We have converted the annotations into the HOS format, which can be downloaded at the following link: EgoHOS Annotations.
ENIGMA-51
You can download the ENIGMA-51 data at the following links:
For more information, visit the official ENIGMA-51 website.
Data Generation Pipeline
Baselines
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
Ackowledgements
This research has been supported by the project Future Artificial Intelligence Research (FAIR) – PNRR MUR Cod. PE0000013 - CUP: E63C22001940006
This research has been partially supported by the project EXTRA-EYE - PRIN 2022 - CUP E53D23008280006 - Finanziato dall’Unione Europea - Next Generation EU
References
- [1] Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., Fidler, S., Fouhey, D., Damen, D.: Epic-kitchens visor benchmark: Video segmentations and object relations. In: NeurIPS. pp. 13745–13758 (2022)
- [2] Zhang, L., Zhou, S., Stent, S., Shi, J.: Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In: ECCV. pp. 127–145 (2022)
- [3] Ragusa, F., Leonardi, R., Mazzamuto, M., Bonanno, C., Scavo, R., Furnari, A., Farinella, G. M.: ENIGMA-51: Towards a Fine-Grained Understanding of Human Behavior in Industrial Scenarios. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 4549-4559) (2024)
- [4] Wang, R., Zhang, J., Chen, J., Xu, Y., Li, P., Liu, T., Wang, H.: Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. In: CVPR. pp. 11359–11366 (2023)