TutorialVQAData
April 13, 2026 ยท View on GitHub
Data for our LREC 2020 paper TutorialVQAData: Question Answering Dataset for Tutorial Videos paper. For the videos, here is a mirror in case the original links don't work.
License
TutorialVQAData is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) (LICENSE.md).
Citation
If you use this dataset in your research, please cite our LREC 2020 paper:
@inproceedings{colas-etal-2020-tutorialvqa,
title = "{T}utorial{VQA}: Question Answering Dataset for Tutorial Videos",
author = "Colas, Anthony and
Kim, Seokhwan and
Dernoncourt, Franck and
Gupte, Siddhesh and
Wang, Zhe and
Kim, Doo Soon",
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.670/",
pages = "5450--5455",
language = "eng",
ISBN = "979-10-95546-34-4",
abstract = "Despite the number of currently available datasets on video-question answering, there still remains a need for a dataset involving multi-step and non-factoid answers. Moreover, relying on video transcripts remains an under-explored topic. To adequately address this, we propose a new question answering task on instructional videos, because of their verbose and narrative nature. While previous studies on video question answering have focused on generating a short text as an answer, given a question and video clip, our task aims to identify a span of a video segment as an answer which contains instructional details with various granularities. This work focuses on screencast tutorial videos pertaining to an image editing program. We introduce a dataset, TutorialVQA, consisting of about 6,000 manually collected triples of (video, question, answer span). We also provide experimental results with several baseline algorithms using the video transcripts. The results indicate that the task is challenging and call for the investigation of new algorithms."
}