KGQA-datasets-generalization

June 29, 2022 · View on GitHub

Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9

Datasets
Reproduction
1. Requirements
2. Run Scripts
License

By analyzing 25 existing KGQA datasets, we spot a huge gap in generalization evaluation of KGQA systems in the Semantic Web community. The main goal of this work is to reuse existing datasets from nearly a decade of research and thus generate new datasets applicable to generalization evaluation. We propose a simple and novel method to achieve this goal, and evaluate the effectiveness of our method and the quality of the new datasets it generate in generalizable KGQA systems.

Test Existing KGQA Datasets

The table below shows the evaluation result w.r.t. three levels of generalization defined in (Gu et al., 2021).

Dataset	KG	Year	I.I.D.	Compositional	Zero-Shot
WebQuestions	Freebase	2013	☑	☒	☒
SimpleQuestions	Freebase	2015	☑	☒	☒
ComplexQuestions	Freebase	2016	-	-	-
GraphQuestions	Freebase	2016	☑	☑	☒
WebQuestionsSP	Freebase	2016	☑	☒	☒
The 30M Factoid QA	Freebase	2016	☑	☒	☒
SimpleQuestionsWikidata	Wikidata	2017	☑	☒	☒
LC-QuAD 1.0	DBpedia	2017	☑	☑	☑
ComplexWebQuestions	Freebase	2018	☑	☒	☒
QALD-9	DBpedia	2018	☑	☑	☑
PathQuestion	Freebase	2018	-	-	-
MetaQA	WikiMovies	2018	-	-	-
SimpleDBpediaQA	DBpedia	2018	☑	☒	☒
TempQuestions	Freebase	2018	-	-	-
LC-QuAD 2.0	Wikidata	2019	☑	☑	☑
FreebaseQA	Freebase	2019	-	-	-
Compositional Freebase Questions	Freebase	2020	☑	☑	☒
RuBQ 1.0	Wikidata	2020	-	-	-
GrailQA	Freebase	2020	☑	☑	☑
Event-QA	EventKG	2020	-	-	-
RuBQ 2.0	Wikidata	2021	-	-	-
MLPQ	DBpedia	2021	-	-	-
Compositional Wikidata Questions	Wikidata	2021	☑	☑	☒
TimeQuestions	Wikidata	2021	-	-	-
CronQuestions	Wikidata	2021	-	-	-

Statistics

The statistics of the original datasets and its counterparts (*) generated by our approach is shown below.

Dataset	Total	Train	Validation	Test	I.I.D.	Compositional	Zero-Shot
QALD-9	558	408	-	150	46	53	51
LC-QuAD 1.0	5000	4000	-	1000	434	559	7
LC-QuAD 2.0	30221	24177	-	6044	4624	948	472
QALD-9*	558	385	-	173	14	41	118
LC-QuAD 1.0*	5000	3420	521	1059	331	1021	228
LC-QuAD 2.0*	30221	20321	3267	6633	4014	3235	2651

Use of the datasets

The datasets are available in json format.
All the datasets are stored in the output_dir directory, where three sub-directories exist for LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9 respectively. In each dataset directory, there are two sub-directories for its original and new versions respectively.

Reproduction

Requirements

rdflib==6.0.2
datasets==1.16.1
scikit-learn==1.0.1
numpy==1.20.3
pandas==1.3.5

Due to usage of the kgqa_datasets repository (see link), you need to clone it into the root directory of this project.

Parameters

In order to ensure reproducibility, we set random_seed to 42 for all the KGQA datasets (e.g., LC-QuAD 1.0, LC-QuAD 2.0, and QALD-9).

QALD

dataset_id: dataset-qald
input_path data_dir/qald/data_sets.json
output_dir: output_dir/qald
sampling_ratio_zero: .4
sampling_ratio_compo: .1
sampling_ratio_iid: .1
n_splits_compo: 1
n_splits_zero: 1
validation_size: 0.0

LC-QuAD 1.0

dataset_id: dataset-lcquad
input_path data_dir/lcquad/data_sets.json
output_dir: output_dir/lcquad
sampling_ratio_zero: .6
sampling_ratio_compo: .1
sampling_ratio_iid: .2
n_splits_compo: 1
n_splits_zero: 1

LC-QuAD 2.0

dataset_id: dataset-lcquad2
input_path data_dir/lcquad2/data_sets.json
output_dir: output_dir/lcquad2
sampling_ratio_zero: .6
sampling_ratio_compo: .1
sampling_ratio_iid: .2
n_splits_compo: 1
n_splits_zero: 1
validation_size: 0.0

Run Scripts

Prior to re-splitting a given KGQA dataset, first preprocess raw datasets by running the following command:

python preprocess.py --tasks <dataset_name> --data_dir <data_dir> --shuffle True --random_seed 42

Start to re-split the given dataset by running the following command:

python resplit.py --dataset_id <dataset_id> --input_path <data_dir> --output_dir <output_dir> --sampling_ratio_zero .4 --sampling_ratio_compo .1 --sampling_ratio_iid .1 --random_seed 42 --n_splits_compo 1 --n_splits_zero 1 --validation_size 0.0

Citation

Please cite our paper if you use any tool or datasets provided in this repository:

@article{jiang2022knowledge,
  title={Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?},
  author={Jiang, Longquan and Usbeck, Ricardo},
  journal={arXiv preprint arXiv:2205.06573},
  year={2022}
}

License

This work is licensed under the Apache 2.0 License - see the LICENSE file for details.