README.md

November 14, 2019 · View on GitHub

Repository for the CoNLL 2019 paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Link to the paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Please use the following citation:

@inproceedings{hanselowski2019snopes,
          title={A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking},
          author={Hanselowski, Andreas and Stab, Christian and Schulz, Claudia and Li, Zile and Gurevych, Iryna},
          booktitle={Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL2019)},
          year={2019}
        }

Disclaimer:

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Corpus for this repository

You can either recreate the corpus using the web crawler
Or request the original siplits of the data for the reproduction of the results on this data archive website (However, please take note of the licence agreement as the corpus is not publicly available)

Installation

Download and install Anaconda (https://www.anaconda.com/)
Create a Python Environment and activate it:

    conda create -n snopes python=3.6
    source activate snopes

Install the required dependencies

    pip install -r requirements.txt
    pip install git+git://github.com/j6mes/drqa@parallel

Download NLTK Punkt Tokenizer

    python -c "import nltk; nltk.download('punkt')"

For BERT models, create separately a Python Environment:

    conda create -n bert python=3.6
    source activate bert
    pip install -r requirements_bert.txt
    pip install git+git://github.com/j6mes/drqa@parallel

Proceed with downloading the data set, the embeddings, the models and the evidence data

Download the word embeddings

Download pretrained GloVe Vectors

    wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
    mkdir -p data/glove
    unzip glove.6B.zip -d data/glove
    gzip data/glove/*.txt

Download pretrained Wiki FastText Vectors

    wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip
    mkdir -p data/fasttext
    unzip wiki.en.zip -d data/fasttext
    
    or (in this case the folder path "fasttext_path": "data/fasttext/wiki.en.bin" the config files in the folders conf_snopes_paper and conf_snopes_stance_paper need to be changed to "data/fasttext/wiki-news-300d-1M.vec" ) 
    
    wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec
    mkdir -p data/fasttext
    unzip wiki-news-300d-1M.vec.zip -d data/fasttext

Pre-processed datasets

Ask for permission to dowload the dataset: Snopes Corpus

    unzip ukp_snopes_corpus.zip
    cp -r snopes_corpus/datasets data/

Evidence extraction

    PYTHONPATH=src python src/scripts/evidence_extraction.py --model  [esim or bilstm_ranking or decompos_att or tf-idf or random_baseline]
    e.g.: PYTHONPATH=src python src/scripts/evidence_extraction.py --model bilstm_ranking

Available configuration files:

Field	Description
esim	ESIM ranking model
bilstm_ranking	BiLSTM ranking model
decompos_att	Decomposable attention ranking model
tf-idf	TF-IDF ranking
random_baseline	Random basline

Stance detection

To run the stance detection models, launch the script as following:

    PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config conf_cnopes_stance_paper <path/to/config/file/json>]
    e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_stance_paper/config_snopes_use_bilstm.json

Available configuration files:

Field	Description
config_snopes_bert_sent_bilstm.json	BERT sentence embeddings with BiLSTM encoding
config_snopes_bert_word_bilstm.json	BERT word embeddings with BiLSTM encoding
config_snopes_use_bilstm.json	Google Universal Sentence Encoder (USE+Attent)

Claim validation

To run the claim validation models, launch the script as following:

    PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config_cnopes_paper <path/to/config/file/json>]
    e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_paper/config_snopes_bilstm.json

Available configuration files:

Field	Description
config_snopes_bilstm.json	BilSTM
config_snopes_bert_sent_attention.json	BERT sentence embeddings with attention
config_snopes_esim_glove.json	Extended ESIM with GloVe embeddings
config_snopes_use_attention.json	Google Universal Sentence Encoder with attention

Run Decomposable Attention model (different configurations are required)

To run the decomposable attention model with Snopes corpus, run the script below:

Training:

    PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json [config/snopes.json or config/stance.json] <folder/to/keep/the/model> --cuda-device <GPU Number> --is-snopes True
    e.g.: PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json config/stance.json model/decomp_attention_stance/ --cuda-device 0 --is-snopes True

Testing:

    PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json <folder/to/keep/the/model>/model.tar.gz [data/datasets/snopes.test.jsonl or data/datasets/snopes.stance.test.jsonl] --cuda-device <GPU Number> --is-snopes True
    e.g.: PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json model/decomp_attention_stance/model.tar.gz data/datasets/snopes.stance.test.jsonl --cuda-device 0 --is-snopes True

BERT models (different configurations are required)

For BERT models, change Python environment to bert

    source activate bert

The pretrained model need to be downloaded and the bert-as-service server need to be first started. To download the pretrained model:

    mkdir bert
    cd bert
    wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
    unzip cased_L-24_H-1024_A-16.zip

For BERT word embedding:

    bert-serving-start -pooling_strategy NONE -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]

For BERT sentence embedding:

    bert-serving-start -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]

bert_port is default as 5555, and bert_port_out default as 5556. These two ports should refer to the same fields in the config.

For more detail, please refer to bert-as-service.

Run AtheneMLP model for stance detection (not yet fixed)

Please note, the comand below is not yet wroking properly and will be fixed later. If you want to fix it yourselfe, the required files are in the directory src/fnc

    PYTHONPATH=src python python src/fnc/pipeline.py

Description of the Config File

The config file regarding the file paths and the hyper parameters is src/athene/utils/config.py. The descriptions of each field are followings:

Field	Description
model_name	Name of the RTE model. Used as part of the path to save the trained RTE model.
glove_path	Path to the pre-trained GloVe word embedding. Either point to the glove.6B.300d.txt.gz or the glove.6B.300d.txt file.
fasttext_path	Path to the pre-trained FastText word embedding. Should point to the wiki.en.bin file.
ckpt_folder	Path to the checkpoint folder for the trained RTE model. Default as model/<model_name>/rte_checkpoints.
db_path	Path to the FEVER database file.
dataset_folder	Path to the dataset folder.
raw_training_set	Path to the original training set file.
raw_dev_set	Path to the original development set file.
raw_test_set	Path to the original test set file.
training_doc_file	Path to the training set with predicted pages, i.e. the output of the training set through document retrieval sub-task.
dev_doc_file	Path to the development set with predicted pages, i.e. the output of the development set through document retrieval sub-task.
test_doc_file	Path to the test set with predicted pages, i.e. the output of the test set through document retrieval sub-task.
training_set_file	Path to the training set with predicted evidences, i.e. the output of the training set through sentence retrieval sub-task.
dev_set_file	Path to the development set with predicted evidences, i.e. the output of the development set through sentence retrieval sub-task.
test_set_file	Path to the test set with predicted evidences, i.e. the output of the test set through sentence retrieval sub-task.
document_k_wiki	The maximal number of candidate pages for each claim in the document retrieval sub-task.
document_parallel	Whether to perform the document retrieval sub-task parallel. True or False.
document_add_claim	Whether to append the original claim to the query to the MediaWiki API in the document retrieval sub-task. True or False.
submission_file	Path to the final submission file.
estimator_name	The name of the RTE estimator referring to src/athene/rte/utils/estimator_definitions.py.
max_sentences	The maximal number of predicted evidences for each claim.
max_sentence_size	The maximal length of each predicted evidence. The words that exceed the maximal length are truncated.
max_claim_size	The maximal length of each claim. The words that exceed the maximal length are truncated.
seed	Random seed of the RTE sub-task.
name	The prefix of the checkpoint files for the RTE sub-task. The checkpoint files will be saved in the <ckpt_folder>.
bert_port	port for the BERT server. Only for BERT models.
bert_port_out	port_out for the BERT server. Only for BERT models.
max_gpu_memory	Maximal GPU memory usage. Percentage. (0.0, 1.0]. Default as 0.5
tensorboard_folder	TensorBoard folder. Default as None, meaning no tensorboard data will be collected.

'esim_hyper_param' field contains the hyper parameters regarding the ESIM based model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
num_neurons	The number of neurons for each layer in the model. The first 2 numbers refer to the numbers of neurons of the two bidirectional RNNs in the ESIM model.
pos_weight	The positive weights of the 3 classes for the weighted loss. The order is Supported, Refuted, Not Enough Info.
max_checks_no_progress	Early stopping policy. Stop training if no improvement in the last x epochs.
trainable	Whether to fine tune the word embeddings. True or False.

'esim_mtl_hyper_param' field contains the hyper parameters regarding the ESIM Multi-Task Learning model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
num_neurons_esim	The number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers.
num_neurons_claim_validation	The number of neurons of the MLP for the claim validation task.
num_neurons_evidence_evaluation	The number of neurons of the MLP for the evidence evaluation task.

'esim_end_2_end_hyper_param' field contains the hyper parameters regarding the ESIM End-2-End model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
sentence_selection_criterion	The criterion of selecting candidate evidences. 'n_best_sents': Select top N sentences. 'sent_threshold': Select sentences with scores higher than a threshold.
n_best_sents	The number of sentences to select. Only for the criterion 'n_best_sents'.
sent_threshold	The threshold for selecting sentences. Only for the criterion 'sent_threshold'.

'esim_credibility_hyper_param' field contains the hyper parameters regarding the model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
word_trainable	True or False. Whether the word embeddings are trainable.
domain_trainable	True or False. Whether the domain name embeddings are trainable.
suffix_trainable	True or False. Whether the embeddings for the suffix of domain names are trainable.
stance_trainable	True or False. Whether the stance embeddings are trainable.
protocol_trainable	True or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable.
domain_embedding_size	The dimensionality of the domain name embeddings.
suffix_embedding_size	The dimensionality of the embeddings for the suffix of the domain name .
protocol_embedding_size	The dimensionality of the embeddings for the protocols of the domain name .
stance_embedding_size	The dimensionality of the stance embeddings.
features_embeddings_path	The path where the embeddings are stored after training.

'esim_credibility_mtl_hyper_param' field contains the hyper parameters regarding the Multi-Task Learning model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
num_neurons_esim	The number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers.
num_neurons_claim_validation	The number of neurons of the MLP for the claim validation task.
num_neurons_stance_detection	The number of neurons of the MLP for the stance detection task.
word_trainable	True or False. Whether the word embeddings are trainable.
domain_trainable	True or False. Whether the domain name embeddings are trainable.
suffix_trainable	True or False. Whether the embeddings for the suffix of domain names are trainable.
stance_trainable	True or False. Whether the stance embeddings are trainable.
protocol_trainable	True or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable.
domain_embedding_size	The dimensionality of the domain name embeddings.
suffix_embedding_size	The dimensionality of the embeddings for the suffix of the domain name .
protocol_embedding_size	The dimensionality of the embeddings for the protocols of the domain name .
stance_embedding_size	The dimensionality of the stance embeddings.
claim_training_set	The path to the training set of the claim validation task.
claim_dev_set	The path to the development set of the claim validation task.
claim_test_set	The path to the test set of the claim validation task.
stance_training_set	The path to the training set of the stance detection task.
stance_dev_set	The path to the development set of the stance detection task.
stance_test_set	The path to the test set of the stance detection task.
features_embeddings_path	The path where the embeddings are stored after training.

'han_hyper_param' field contains the hyper parameters regarding the Hierarchical Attention Network models in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
word_output_size	The dimensionality of the word-level output after aligned by the attention
sent_output_size	The dimensionality of the sentence-level output after aligned by the attention

'use_hyper_param' field contains the hyper parameters regarding the Universal Sentence Encoder based models in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
attention_function	Only for use_attention models. The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3*': c and e both through a feed-forward layer, then the multiplication as attention weight.
do_finetune	Only for use_attention models. True or False. Whether to fine tune USE.

Field

Description

attention_function

Only for use_attention models. The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight.

do_finetune

Only for use_attention models. True or False. Whether to fine tune USE.

'bert_sent_hyper_param' field contains the hyper parameters regarding the models using the BERT sentence encoding in the RTE sub-task. The descriptions of several special parameters are followings:

Field	Description
attention_function	The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3*': c and e both through a feed-forward layer, then the multiplication as attention weight.

Field

Description

attention_function

The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight.

'sentence_retrieval_ensemble_param' field contains the hyper parameters regarding the ESIM based model in the sentence retrieval sub-task. The descriptions of several special parameters are followings:

Field	Description
num_model	The number of models to ensemble.
tf_random_state	The random seeds for the models to ensemble.
num_negatives	The number of negative sampling, i.e. false evidences, for each claim in the training phase.
c_max_length	The maximal length of each claim. The words that exceed the maximal length are truncated.
s_max_length	The maximal length of each candidate evidence sentence. The words that exceed the maximal length are truncated.
reserve_embed	Whether to reserve slots in the word embeddings for unseen words. True or False.
model_path	Path to the folder for the checkpoint files of the ensemble models.

Contacts:

<lastname>@ukp.informatik.tu-darmstadt.de
https://www.informatik.tu-darmstadt.de/ukp/ukp_home/
https://www.tu-darmstadt.de

Credit:

This repository was build by Zile Li, Hao Zhang, and Andreas Hanselowski.

License:

Apache License Version 2.0