MMITF

November 24, 2025 · View on GitHub

This project uses a Transformer model with Intermodality Attention to predict which object a person is pointing at in 2D images. It models the relationship between hand configurations—captured through hand landmarks—and candidate object locations to infer the intended target in tabletop scenarios.

Figure: Experimental setup used for data collection with the NICOL robot (see paper) on the left, and corresponding annotated results on the right. Images captured in a lab of the Knowledge Technology research group, Department of Informatics, University of Hamburg.

MMITF

MMITF is a research project that uses a Transformer model with Intermodality Attention to predict which object a person is pointing at in 2D images. It models the relationship between hand configurations—captured through hand landmarks—and the spatial locations of candidate objects to infer the intended target in tabletop scenarios.

Features

Transformer-based pointing gesture interpretation
Intermodality Attention to fuse hand and object representations
Labeling GUI for efficient dataset annotation
Evaluation pipeline with metric tracking
Visualization tools for model outputs and errors

Project Structure

MMITF/
├── assets/                        # Experimental Setup image
├── data/                          # Central data directory for all scripts
│   ├── images/                    # Raw images of participants pointing
│   │   ├── scene1/                # Example: participant 1's image sequence
│   │   │   ├── img1.jpg
│   │   │   ├── img2.jpg
│   │   │   └── ...
│   ├── data_set/                  # Processed dataset pickles (keypoints, labels, etc.)
│   ├── training_results/          # Model checkpoints and training logs
│   ├── evaluation_results/        # Evaluation results and metrics
│   ├── runs/                      # TensorBoard or additional training logs
│   ├── labeling/                  # Output of manual labeling
│   │   ├── labeled_data/          # Pickle files of labeled feature data per participant
│   │   └── labels/                # CSV files with gesture annotations (start, end, label)
├── data_loader/                   # Dataset loading and collation utilities
├── evaluation/                    # Model evaluation logic and metrics
├── features/                      # Feature extraction and preprocessing
├── label/                         # Labeling GUI tools
├── models/                        # Core models for 2MITF and 3MITF
├── scripts/                       # Entry point scripts
├── train/                         # Model training logic
├── visualization/                 # Run evalaution on specified samples and save annotaed images
├── config.yaml                    # Central point to control paths and hyperparameters
├── README.md                      # Project overview and usage instructions
└── requirements.txt               # Python package dependencies

Setup

Clone the repository:

git clone git@github.com:lmuellercode/MMITF.git
cd MMITF

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Generate Annotation Labels (Action Labels)

python run_label_actions.py

Create Training Targets for Model Training

python scripts.run_labeling

Train the Model

python -m scripts.run_training

usage: run_training.py [-h] --model {2mitf,3mitf} [--config CONFIG]

Train a MMITF model variant (2mod or 3mod) across all splits using the provided configuration.

options:
  -h, --help            show this help message and exit
  --model {2mitf,3mitf}
                        Choose the model architecture

Evaluate the Model

python -m scripts.run_evaluation


usage: run_evaluation.py [-h] --model {2mitf,3mitf} [--config CONFIG]

Evaluate a model trained with MMITF.

options:
  -h, --help            show this help message and exit
  --model {2mitf,3mitf}
                        Model variant to evaluate
  --config CONFIG       
                        YAML config path default set to config.yaml

Visualize Results

python -m scripts.run_visualize_results

usage: run_visualize_results [-h] [-c CONFIG] (-s SELECT | --test-folders) [-o OUTPUT] [-v]

Filter dataset or list test folders.

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        YAML config path default set to config.yaml
  -s SELECT, --select SELECT
                        KEY:RANGE specs, e.g. 3:0-4, allows for multi folder select e.g. -s 3:0-4 -s 16:1,5,7,9-10
  --test-folders        
                        List available test-folder keys and exit.
  -o OUTPUT, --output OUTPUT
                        Pickle output path for filtered data.
  -v, --verbose         
                        Enable verbose output.

Make Confusion Matrices

python3 -m scripts.run_training

usage: run_training.py [-h] --model {2mitf,3mitf} [--config CONFIG]

Train a MMITF model variant (2mod or 3mod) across all splits using the provided configuration.

options:
  -h, --help            show this help message and exit
  --model {2mitf,3mitf}
                        Choose the model architecture
  --config CONFIG       
                        Path to config YAML

This project builds upon insights from the following works:

Ji et al. Detecting Human-Object Relationships in Videos
Zhang et al. MediaPipe Hands: On-device Real-time Hand Tracking
Minderer et al. Scaling Open-Vocabulary Object Detection
Kerzel et al. NICOL: A Neuro-inspired Collaborative Semi-humanoid Robot that Bridges Social Interaction and Reliable Manipulation

Dataset

Once the image dataset get clearence the link will be published here.

Find the Curated Datasets and Example Model Weights here: https://drive.google.com/drive/folders/1eQ6xgL9h2PK6dhLS_3K_AiBbNbNCFh3d?usp=sharing

Images will follow soon.

Unpack the files and place the folders in data, then run the scripts!

📥 Access the Paper

Accepted Manuscript:

Version of Record:

📘 Publisher Notice

Use of this Accepted Version is subject to Springer Nature’s terms:
https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms

Please cite the Version of Record.

📚 How to Cite

If you use this code, data, or results in your research, please cite:

@InProceedings{10.1007/978-3-032-04552-2_10,
  author    = {Müller, Luca and Ali, Hassan and Allgeuer, Philipp and Gajdošech, Lukáš and Wermter, Stefan},
  editor    = {Senn, Walter and Sanguineti, Marcello and Saudargienė, Aušra and Tetko, Igor V. and Villa, Alessandro E. P. and Jirsa, Viktor and Bengio, Yoshua},
  title     = {Pointing-Guided Target Estimation via Transformer-Based Attention},
  booktitle = {Artificial Neural Networks and Machine Learning. ICANN 2025 International Workshops and Special Sessions},
  year      = {2026},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {85--97},
  isbn      = {978-3-032-04552-2}
}

Notes

Roadmap:

Tidy up redundand code
- Evaluation logic
- Some utility methods
Overall polish
Add Baseline
Add Dataset Analysis functionalities (Heatmaps)

License

This project is licensed under the MIT License, except for the following file:

Accepted Manuscript (PDF)

That file is not covered by the MIT License. It is shared under the publisher’s terms for Accepted Manuscripts:
https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms

Maintained by Luca Müller