@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology (WACV 2025)

October 14, 2024 · View on GitHub

by Xin Jiang*, Junwei Zheng*, Ruiping Liu, Jiahang Li, Jiaming Zhang†, Sven Matthiesen, Rainer Stiefelhagen

* denotes equal contribution and † denotes corresponding author

News

  • [2024.09.17] ATBench (Assistive Technology Benchmark) is accepted to WACV2025.
  • [2024.10.13] We are excited to release ATModel (Assistive Technology Model) training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)

pipeline

Introduction

multi_task_result

ATBench is designed by a pre-design user study with PVIs, including five five most crucial vision-language tasks: Panoptic Segmentation, Image Captioning, Visual Question Answering (VQA), Depth Estimation, Optical Character Recognition (OCR). And we also proposed a novel ATModel that can address all tasks simultaneously.

More detailed can be found in our arxiv paper.

Getting Started

Checkpoints and Numbers:

PS
(ADE-150)
DE
(NYU-V2)
OCR
(6 datasets avg)
IC
(VizWiz_Cap)
VQA
(VizWiz_VQA)
#Params
ModelPQRMSEAcc(%)CIDErAcc(%)
Unified-IO (S)-0.649--42.471M
Unified-IO (B)-0.469--45.8241M
Unified-IO (L)-0.402--47.7776M
X-Decoder (T)41.6----164M
GIT (T)---113.168.00.7B
PaLI (T)---117.267.53.0B
ATModel38.50.42580.152.553.762M

Installation, Dataset, Training and Evaluation Guide:

Acknowledgement

  • We build our work on top of X-Decoder and use their code. We appreciate the previous open-source repository X-Decoder.

Citation

If you find our work useful in your research, please cite:

@inproceedings{jiang2025atbench,
title={@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology},
author={Jiang, Xin and Zheng, Junwei and Liu, Ruiping and Li, Jiahang and Zhang, Jiaming and Matthiesen, Sven and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025}
}