TutorialVQAData

April 13, 2026 ยท View on GitHub

Data for our LREC 2020 paper TutorialVQAData: Question Answering Dataset for Tutorial Videos paper. For the videos, here is a mirror in case the original links don't work.

License

TutorialVQAData is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) (LICENSE.md).

Citation

If you use this dataset in your research, please cite our LREC 2020 paper:


@inproceedings{colas-etal-2020-tutorialvqa,
    title = "{T}utorial{VQA}: Question Answering Dataset for Tutorial Videos",
    author = "Colas, Anthony  and
      Kim, Seokhwan  and
      Dernoncourt, Franck  and
      Gupte, Siddhesh  and
      Wang, Zhe  and
      Kim, Doo Soon",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.670/",
    pages = "5450--5455",
    language = "eng",
    ISBN = "979-10-95546-34-4",
    abstract = "Despite the number of currently available datasets on video-question answering, there still remains a need for a dataset involving multi-step and non-factoid answers. Moreover, relying on video transcripts remains an under-explored topic. To adequately address this, we propose a new question answering task on instructional videos, because of their verbose and narrative nature. While previous studies on video question answering have focused on generating a short text as an answer, given a question and video clip, our task aims to identify a span of a video segment as an answer which contains instructional details with various granularities. This work focuses on screencast tutorial videos pertaining to an image editing program. We introduce a dataset, TutorialVQA, consisting of about 6,000 manually collected triples of (video, question, answer span). We also provide experimental results with several baseline algorithms using the video transcripts. The results indicate that the task is challenging and call for the investigation of new algorithms."
}