KGQA-datasets-generalization

June 29, 2022 · View on GitHub

License: Apache-2.0

Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9

Table of contents

  1. Datasets
    1. Overview
    2. Statistics
    3. Use of the datasets
  2. Reproduction
    1. Requirements
    2. Run Scripts
  3. License

Datasets

Overview

By analyzing 25 existing KGQA datasets, we spot a huge gap in generalization evaluation of KGQA systems in the Semantic Web community. The main goal of this work is to reuse existing datasets from nearly a decade of research and thus generate new datasets applicable to generalization evaluation. We propose a simple and novel method to achieve this goal, and evaluate the effectiveness of our method and the quality of the new datasets it generate in generalizable KGQA systems.

Test Existing KGQA Datasets

The table below shows the evaluation result w.r.t. three levels of generalization defined in (Gu et al., 2021).

DatasetKGYearI.I.D.CompositionalZero-Shot
WebQuestionsFreebase2013
SimpleQuestionsFreebase2015
ComplexQuestionsFreebase2016---
GraphQuestionsFreebase2016
WebQuestionsSPFreebase2016
The 30M Factoid QAFreebase2016
SimpleQuestionsWikidataWikidata2017
LC-QuAD 1.0DBpedia2017
ComplexWebQuestionsFreebase2018
QALD-9DBpedia2018
PathQuestionFreebase2018---
MetaQAWikiMovies2018---
SimpleDBpediaQADBpedia2018
TempQuestionsFreebase2018---
LC-QuAD 2.0Wikidata2019
FreebaseQAFreebase2019---
Compositional Freebase QuestionsFreebase2020
RuBQ 1.0Wikidata2020---
GrailQAFreebase2020
Event-QAEventKG2020---
RuBQ 2.0Wikidata2021---
MLPQDBpedia2021---
Compositional Wikidata QuestionsWikidata2021
TimeQuestionsWikidata2021---
CronQuestionsWikidata2021---

Statistics

The statistics of the original datasets and its counterparts (*) generated by our approach is shown below.

DatasetTotalTrainValidationTestI.I.D.CompositionalZero-Shot
QALD-9558408-150465351
LC-QuAD 1.050004000-10004345597
LC-QuAD 2.03022124177-60444624948472
QALD-9*558385-1731441118
LC-QuAD 1.0*5000342052110593311021228
LC-QuAD 2.0*302212032132676633401432352651

Use of the datasets

  • The datasets are available in json format.
  • All the datasets are stored in the output_dir directory, where three sub-directories exist for LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9 respectively. In each dataset directory, there are two sub-directories for its original and new versions respectively.

Reproduction

Requirements

  • rdflib==6.0.2
  • datasets==1.16.1
  • scikit-learn==1.0.1
  • numpy==1.20.3
  • pandas==1.3.5

Due to usage of the kgqa_datasets repository (see link), you need to clone it into the root directory of this project.

Parameters

In order to ensure reproducibility, we set random_seed to 42 for all the KGQA datasets (e.g., LC-QuAD 1.0, LC-QuAD 2.0, and QALD-9).

QALD

  • dataset_id: dataset-qald
  • input_path data_dir/qald/data_sets.json
  • output_dir: output_dir/qald
  • sampling_ratio_zero: .4
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .1
  • n_splits_compo: 1
  • n_splits_zero: 1
  • validation_size: 0.0

LC-QuAD 1.0

  • dataset_id: dataset-lcquad
  • input_path data_dir/lcquad/data_sets.json
  • output_dir: output_dir/lcquad
  • sampling_ratio_zero: .6
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .2
  • n_splits_compo: 1
  • n_splits_zero: 1

LC-QuAD 2.0

  • dataset_id: dataset-lcquad2
  • input_path data_dir/lcquad2/data_sets.json
  • output_dir: output_dir/lcquad2
  • sampling_ratio_zero: .6
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .2
  • n_splits_compo: 1
  • n_splits_zero: 1
  • validation_size: 0.0

Run Scripts

  1. Prior to re-splitting a given KGQA dataset, first preprocess raw datasets by running the following command:
python preprocess.py --tasks <dataset_name> --data_dir <data_dir> --shuffle True --random_seed 42
  1. Start to re-split the given dataset by running the following command:
python resplit.py --dataset_id <dataset_id> --input_path <data_dir> --output_dir <output_dir> --sampling_ratio_zero .4 --sampling_ratio_compo .1 --sampling_ratio_iid .1 --random_seed 42 --n_splits_compo 1 --n_splits_zero 1 --validation_size 0.0

Citation

Please cite our paper if you use any tool or datasets provided in this repository:

@article{jiang2022knowledge,
  title={Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?},
  author={Jiang, Longquan and Usbeck, Ricardo},
  journal={arXiv preprint arXiv:2205.06573},
  year={2022}
}

License

This work is licensed under the Apache 2.0 License - see the LICENSE file for details.