MMITF
November 24, 2025 · View on GitHub
This project uses a Transformer model with Intermodality Attention to predict which object a person is pointing at in 2D images. It models the relationship between hand configurations—captured through hand landmarks—and candidate object locations to infer the intended target in tabletop scenarios.
Figure: Experimental setup used for data collection with the NICOL robot (see paper) on the left, and corresponding annotated results on the right. Images captured in a lab of the Knowledge Technology research group, Department of Informatics, University of Hamburg.
MMITF
MMITF is a research project that uses a Transformer model with Intermodality Attention to predict which object a person is pointing at in 2D images. It models the relationship between hand configurations—captured through hand landmarks—and the spatial locations of candidate objects to infer the intended target in tabletop scenarios.
Features
- Transformer-based pointing gesture interpretation
- Intermodality Attention to fuse hand and object representations
- Labeling GUI for efficient dataset annotation
- Evaluation pipeline with metric tracking
- Visualization tools for model outputs and errors
Project Structure
MMITF/
├── assets/ # Experimental Setup image
├── data/ # Central data directory for all scripts
│ ├── images/ # Raw images of participants pointing
│ │ ├── scene1/ # Example: participant 1's image sequence
│ │ │ ├── img1.jpg
│ │ │ ├── img2.jpg
│ │ │ └── ...
│ ├── data_set/ # Processed dataset pickles (keypoints, labels, etc.)
│ ├── training_results/ # Model checkpoints and training logs
│ ├── evaluation_results/ # Evaluation results and metrics
│ ├── runs/ # TensorBoard or additional training logs
│ ├── labeling/ # Output of manual labeling
│ │ ├── labeled_data/ # Pickle files of labeled feature data per participant
│ │ └── labels/ # CSV files with gesture annotations (start, end, label)
├── data_loader/ # Dataset loading and collation utilities
├── evaluation/ # Model evaluation logic and metrics
├── features/ # Feature extraction and preprocessing
├── label/ # Labeling GUI tools
├── models/ # Core models for 2MITF and 3MITF
├── scripts/ # Entry point scripts
├── train/ # Model training logic
├── visualization/ # Run evalaution on specified samples and save annotaed images
├── config.yaml # Central point to control paths and hyperparameters
├── README.md # Project overview and usage instructions
└── requirements.txt # Python package dependencies
Setup
- Clone the repository:
git clone git@github.com:lmuellercode/MMITF.git
cd MMITF
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
Usage
Generate Annotation Labels (Action Labels)
python run_label_actions.py
Create Training Targets for Model Training
python scripts.run_labeling
Train the Model
python -m scripts.run_training
usage: run_training.py [-h] --model {2mitf,3mitf} [--config CONFIG]
Train a MMITF model variant (2mod or 3mod) across all splits using the provided configuration.
options:
-h, --help show this help message and exit
--model {2mitf,3mitf}
Choose the model architecture
Evaluate the Model
python -m scripts.run_evaluation
usage: run_evaluation.py [-h] --model {2mitf,3mitf} [--config CONFIG]
Evaluate a model trained with MMITF.
options:
-h, --help show this help message and exit
--model {2mitf,3mitf}
Model variant to evaluate
--config CONFIG
YAML config path default set to config.yaml
Visualize Results
python -m scripts.run_visualize_results
usage: run_visualize_results [-h] [-c CONFIG] (-s SELECT | --test-folders) [-o OUTPUT] [-v]
Filter dataset or list test folders.
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
YAML config path default set to config.yaml
-s SELECT, --select SELECT
KEY:RANGE specs, e.g. 3:0-4, allows for multi folder select e.g. -s 3:0-4 -s 16:1,5,7,9-10
--test-folders
List available test-folder keys and exit.
-o OUTPUT, --output OUTPUT
Pickle output path for filtered data.
-v, --verbose
Enable verbose output.
Make Confusion Matrices
python3 -m scripts.run_training
usage: run_training.py [-h] --model {2mitf,3mitf} [--config CONFIG]
Train a MMITF model variant (2mod or 3mod) across all splits using the provided configuration.
options:
-h, --help show this help message and exit
--model {2mitf,3mitf}
Choose the model architecture
--config CONFIG
Path to config YAML
Related Work
This project builds upon insights from the following works:
- Ji et al. Detecting Human-Object Relationships in Videos
- Zhang et al. MediaPipe Hands: On-device Real-time Hand Tracking
- Minderer et al. Scaling Open-Vocabulary Object Detection
- Kerzel et al. NICOL: A Neuro-inspired Collaborative Semi-humanoid Robot that Bridges Social Interaction and Reliable Manipulation
Dataset
Once the image dataset get clearence the link will be published here.
Find the Curated Datasets and Example Model Weights here: https://drive.google.com/drive/folders/1eQ6xgL9h2PK6dhLS_3K_AiBbNbNCFh3d?usp=sharing
Images will follow soon.
Unpack the files and place the folders in data, then run the scripts!
📥 Access the Paper
📘 Publisher Notice
Use of this Accepted Version is subject to Springer Nature’s terms:
https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms
Please cite the Version of Record.
📚 How to Cite
If you use this code, data, or results in your research, please cite:
@InProceedings{10.1007/978-3-032-04552-2_10,
author = {Müller, Luca and Ali, Hassan and Allgeuer, Philipp and Gajdošech, Lukáš and Wermter, Stefan},
editor = {Senn, Walter and Sanguineti, Marcello and Saudargienė, Aušra and Tetko, Igor V. and Villa, Alessandro E. P. and Jirsa, Viktor and Bengio, Yoshua},
title = {Pointing-Guided Target Estimation via Transformer-Based Attention},
booktitle = {Artificial Neural Networks and Machine Learning. ICANN 2025 International Workshops and Special Sessions},
year = {2026},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {85--97},
isbn = {978-3-032-04552-2}
}
Notes
Roadmap:
- Tidy up redundand code
- Evaluation logic
- Some utility methods
- Overall polish
- Add Baseline
- Add Dataset Analysis functionalities (Heatmaps)
License
This project is licensed under the MIT License, except for the following file:
Accepted Manuscript (PDF)
That file is not covered by the MIT License. It is shared under the publisher’s terms for Accepted Manuscripts:
https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms
Maintained by Luca Müller