README.md
November 14, 2019 · View on GitHub
Repository for the CoNLL 2019 paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking
Link to the paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking
Please use the following citation:
@inproceedings{hanselowski2019snopes,
title={A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking},
author={Hanselowski, Andreas and Stab, Christian and Schulz, Claudia and Li, Zile and Gurevych, Iryna},
booktitle={Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL2019)},
year={2019}
}
Disclaimer:
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
Corpus for this repository
- You can either recreate the corpus using the web crawler
- Or request the original siplits of the data for the reproduction of the results on this data archive website (However, please take note of the licence agreement as the corpus is not publicly available)
Installation
- Download and install Anaconda (https://www.anaconda.com/)
- Create a Python Environment and activate it:
conda create -n snopes python=3.6
source activate snopes
- Install the required dependencies
pip install -r requirements.txt
pip install git+git://github.com/j6mes/drqa@parallel
- Download NLTK Punkt Tokenizer
python -c "import nltk; nltk.download('punkt')"
- For BERT models, create separately a Python Environment:
conda create -n bert python=3.6
source activate bert
pip install -r requirements_bert.txt
pip install git+git://github.com/j6mes/drqa@parallel
- Proceed with downloading the data set, the embeddings, the models and the evidence data
Download the word embeddings
Download pretrained GloVe Vectors
wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
mkdir -p data/glove
unzip glove.6B.zip -d data/glove
gzip data/glove/*.txt
Download pretrained Wiki FastText Vectors
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip
mkdir -p data/fasttext
unzip wiki.en.zip -d data/fasttext
or (in this case the folder path "fasttext_path": "data/fasttext/wiki.en.bin" the config files in the folders conf_snopes_paper and conf_snopes_stance_paper need to be changed to "data/fasttext/wiki-news-300d-1M.vec" )
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec
mkdir -p data/fasttext
unzip wiki-news-300d-1M.vec.zip -d data/fasttext
Pre-processed datasets
Ask for permission to dowload the dataset: Snopes Corpus
unzip ukp_snopes_corpus.zip
cp -r snopes_corpus/datasets data/
Evidence extraction
PYTHONPATH=src python src/scripts/evidence_extraction.py --model [esim or bilstm_ranking or decompos_att or tf-idf or random_baseline]
e.g.: PYTHONPATH=src python src/scripts/evidence_extraction.py --model bilstm_ranking
Available configuration files:
| Field | Description |
|---|---|
| esim | ESIM ranking model |
| bilstm_ranking | BiLSTM ranking model |
| decompos_att | Decomposable attention ranking model |
| tf-idf | TF-IDF ranking |
| random_baseline | Random basline |
Stance detection
To run the stance detection models, launch the script as following:
PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config conf_cnopes_stance_paper <path/to/config/file/json>]
e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_stance_paper/config_snopes_use_bilstm.json
Available configuration files:
| Field | Description |
|---|---|
| config_snopes_bert_sent_bilstm.json | BERT sentence embeddings with BiLSTM encoding |
| config_snopes_bert_word_bilstm.json | BERT word embeddings with BiLSTM encoding |
| config_snopes_use_bilstm.json | Google Universal Sentence Encoder (USE+Attent) |
Claim validation
To run the claim validation models, launch the script as following:
PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config_cnopes_paper <path/to/config/file/json>]
e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_paper/config_snopes_bilstm.json
Available configuration files:
| Field | Description |
|---|---|
| config_snopes_bilstm.json | BilSTM |
| config_snopes_bert_sent_attention.json | BERT sentence embeddings with attention |
| config_snopes_esim_glove.json | Extended ESIM with GloVe embeddings |
| config_snopes_use_attention.json | Google Universal Sentence Encoder with attention |
Run Decomposable Attention model (different configurations are required)
To run the decomposable attention model with Snopes corpus, run the script below:
Training:
PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json [config/snopes.json or config/stance.json] <folder/to/keep/the/model> --cuda-device <GPU Number> --is-snopes True
e.g.: PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json config/stance.json model/decomp_attention_stance/ --cuda-device 0 --is-snopes True
Testing:
PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json <folder/to/keep/the/model>/model.tar.gz [data/datasets/snopes.test.jsonl or data/datasets/snopes.stance.test.jsonl] --cuda-device <GPU Number> --is-snopes True
e.g.: PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json model/decomp_attention_stance/model.tar.gz data/datasets/snopes.stance.test.jsonl --cuda-device 0 --is-snopes True
BERT models (different configurations are required)
For BERT models, change Python environment to bert
source activate bert
The pretrained model need to be downloaded and the bert-as-service server need to be first started. To download the pretrained model:
mkdir bert
cd bert
wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
unzip cased_L-24_H-1024_A-16.zip
For BERT word embedding:
bert-serving-start -pooling_strategy NONE -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]
For BERT sentence embedding:
bert-serving-start -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]
bert_port is default as 5555, and bert_port_out default as 5556. These two ports should refer to the same fields in the config.
For more detail, please refer to bert-as-service.
Run AtheneMLP model for stance detection (not yet fixed)
Please note, the comand below is not yet wroking properly and will be fixed later. If you want to fix it yourselfe, the required files are in the directory src/fnc
PYTHONPATH=src python python src/fnc/pipeline.py
Description of the Config File
The config file regarding the file paths and the hyper parameters is src/athene/utils/config.py. The descriptions of each field are followings:
| Field | Description |
|---|---|
| model_name | Name of the RTE model. Used as part of the path to save the trained RTE model. |
| glove_path | Path to the pre-trained GloVe word embedding. Either point to the glove.6B.300d.txt.gz or the glove.6B.300d.txt file. |
| fasttext_path | Path to the pre-trained FastText word embedding. Should point to the wiki.en.bin file. |
| ckpt_folder | Path to the checkpoint folder for the trained RTE model. Default as model/<model_name>/rte_checkpoints. |
| db_path | Path to the FEVER database file. |
| dataset_folder | Path to the dataset folder. |
| raw_training_set | Path to the original training set file. |
| raw_dev_set | Path to the original development set file. |
| raw_test_set | Path to the original test set file. |
| training_doc_file | Path to the training set with predicted pages, i.e. the output of the training set through document retrieval sub-task. |
| dev_doc_file | Path to the development set with predicted pages, i.e. the output of the development set through document retrieval sub-task. |
| test_doc_file | Path to the test set with predicted pages, i.e. the output of the test set through document retrieval sub-task. |
| training_set_file | Path to the training set with predicted evidences, i.e. the output of the training set through sentence retrieval sub-task. |
| dev_set_file | Path to the development set with predicted evidences, i.e. the output of the development set through sentence retrieval sub-task. |
| test_set_file | Path to the test set with predicted evidences, i.e. the output of the test set through sentence retrieval sub-task. |
| document_k_wiki | The maximal number of candidate pages for each claim in the document retrieval sub-task. |
| document_parallel | Whether to perform the document retrieval sub-task parallel. True or False. |
| document_add_claim | Whether to append the original claim to the query to the MediaWiki API in the document retrieval sub-task. True or False. |
| submission_file | Path to the final submission file. |
| estimator_name | The name of the RTE estimator referring to src/athene/rte/utils/estimator_definitions.py. |
| max_sentences | The maximal number of predicted evidences for each claim. |
| max_sentence_size | The maximal length of each predicted evidence. The words that exceed the maximal length are truncated. |
| max_claim_size | The maximal length of each claim. The words that exceed the maximal length are truncated. |
| seed | Random seed of the RTE sub-task. |
| name | The prefix of the checkpoint files for the RTE sub-task. The checkpoint files will be saved in the <ckpt_folder>. |
| bert_port | port for the BERT server. Only for BERT models. |
| bert_port_out | port_out for the BERT server. Only for BERT models. |
| max_gpu_memory | Maximal GPU memory usage. Percentage. (0.0, 1.0]. Default as 0.5 |
| tensorboard_folder | TensorBoard folder. Default as None, meaning no tensorboard data will be collected. |
'esim_hyper_param' field contains the hyper parameters regarding the ESIM based model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| num_neurons | The number of neurons for each layer in the model. The first 2 numbers refer to the numbers of neurons of the two bidirectional RNNs in the ESIM model. |
| pos_weight | The positive weights of the 3 classes for the weighted loss. The order is Supported, Refuted, Not Enough Info. |
| max_checks_no_progress | Early stopping policy. Stop training if no improvement in the last x epochs. |
| trainable | Whether to fine tune the word embeddings. True or False. |
'esim_mtl_hyper_param' field contains the hyper parameters regarding the ESIM Multi-Task Learning model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| num_neurons_esim | The number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers. |
| num_neurons_claim_validation | The number of neurons of the MLP for the claim validation task. |
| num_neurons_evidence_evaluation | The number of neurons of the MLP for the evidence evaluation task. |
'esim_end_2_end_hyper_param' field contains the hyper parameters regarding the ESIM End-2-End model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| sentence_selection_criterion | The criterion of selecting candidate evidences. 'n_best_sents': Select top N sentences. 'sent_threshold': Select sentences with scores higher than a threshold. |
| n_best_sents | The number of sentences to select. Only for the criterion 'n_best_sents'. |
| sent_threshold | The threshold for selecting sentences. Only for the criterion 'sent_threshold'. |
'esim_credibility_hyper_param' field contains the hyper parameters regarding the model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| word_trainable | True or False. Whether the word embeddings are trainable. |
| domain_trainable | True or False. Whether the domain name embeddings are trainable. |
| suffix_trainable | True or False. Whether the embeddings for the suffix of domain names are trainable. |
| stance_trainable | True or False. Whether the stance embeddings are trainable. |
| protocol_trainable | True or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable. |
| domain_embedding_size | The dimensionality of the domain name embeddings. |
| suffix_embedding_size | The dimensionality of the embeddings for the suffix of the domain name . |
| protocol_embedding_size | The dimensionality of the embeddings for the protocols of the domain name . |
| stance_embedding_size | The dimensionality of the stance embeddings. |
| features_embeddings_path | The path where the embeddings are stored after training. |
'esim_credibility_mtl_hyper_param' field contains the hyper parameters regarding the Multi-Task Learning model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| num_neurons_esim | The number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers. |
| num_neurons_claim_validation | The number of neurons of the MLP for the claim validation task. |
| num_neurons_stance_detection | The number of neurons of the MLP for the stance detection task. |
| word_trainable | True or False. Whether the word embeddings are trainable. |
| domain_trainable | True or False. Whether the domain name embeddings are trainable. |
| suffix_trainable | True or False. Whether the embeddings for the suffix of domain names are trainable. |
| stance_trainable | True or False. Whether the stance embeddings are trainable. |
| protocol_trainable | True or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable. |
| domain_embedding_size | The dimensionality of the domain name embeddings. |
| suffix_embedding_size | The dimensionality of the embeddings for the suffix of the domain name . |
| protocol_embedding_size | The dimensionality of the embeddings for the protocols of the domain name . |
| stance_embedding_size | The dimensionality of the stance embeddings. |
| claim_training_set | The path to the training set of the claim validation task. |
| claim_dev_set | The path to the development set of the claim validation task. |
| claim_test_set | The path to the test set of the claim validation task. |
| stance_training_set | The path to the training set of the stance detection task. |
| stance_dev_set | The path to the development set of the stance detection task. |
| stance_test_set | The path to the test set of the stance detection task. |
| features_embeddings_path | The path where the embeddings are stored after training. |
'han_hyper_param' field contains the hyper parameters regarding the Hierarchical Attention Network models in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| word_output_size | The dimensionality of the word-level output after aligned by the attention |
| sent_output_size | The dimensionality of the sentence-level output after aligned by the attention |
'use_hyper_param' field contains the hyper parameters regarding the Universal Sentence Encoder based models in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| attention_function | Only for use_attention models. The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight. |
| do_finetune | Only for use_attention models. True or False. Whether to fine tune USE. |
'bert_sent_hyper_param' field contains the hyper parameters regarding the models using the BERT sentence encoding in the RTE sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| attention_function | The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight. |
'sentence_retrieval_ensemble_param' field contains the hyper parameters regarding the ESIM based model in the sentence retrieval sub-task. The descriptions of several special parameters are followings:
| Field | Description |
|---|---|
| num_model | The number of models to ensemble. |
| tf_random_state | The random seeds for the models to ensemble. |
| num_negatives | The number of negative sampling, i.e. false evidences, for each claim in the training phase. |
| c_max_length | The maximal length of each claim. The words that exceed the maximal length are truncated. |
| s_max_length | The maximal length of each candidate evidence sentence. The words that exceed the maximal length are truncated. |
| reserve_embed | Whether to reserve slots in the word embeddings for unseen words. True or False. |
| model_path | Path to the folder for the checkpoint files of the ensemble models. |
Contacts:
- <lastname>@ukp.informatik.tu-darmstadt.de
- https://www.informatik.tu-darmstadt.de/ukp/ukp_home/
- https://www.tu-darmstadt.de
Credit:
This repository was build by Zile Li, Hao Zhang, and Andreas Hanselowski.
License:
- Apache License Version 2.0