README.md

November 14, 2019 · View on GitHub

Repository for the CoNLL 2019 paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Link to the paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Please use the following citation:

@inproceedings{hanselowski2019snopes,
          title={A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking},
          author={Hanselowski, Andreas and Stab, Christian and Schulz, Claudia and Li, Zile and Gurevych, Iryna},
          booktitle={Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL2019)},
          year={2019}
        }

Disclaimer:

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Corpus for this repository

  • You can either recreate the corpus using the web crawler
  • Or request the original siplits of the data for the reproduction of the results on this data archive website (However, please take note of the licence agreement as the corpus is not publicly available)

Installation

    conda create -n snopes python=3.6
    source activate snopes
  • Install the required dependencies
    pip install -r requirements.txt
    pip install git+git://github.com/j6mes/drqa@parallel
  • Download NLTK Punkt Tokenizer
    python -c "import nltk; nltk.download('punkt')"
  • For BERT models, create separately a Python Environment:
    conda create -n bert python=3.6
    source activate bert
    pip install -r requirements_bert.txt
    pip install git+git://github.com/j6mes/drqa@parallel
  • Proceed with downloading the data set, the embeddings, the models and the evidence data

Download the word embeddings

Download pretrained GloVe Vectors

    wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
    mkdir -p data/glove
    unzip glove.6B.zip -d data/glove
    gzip data/glove/*.txt

Download pretrained Wiki FastText Vectors

    wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip
    mkdir -p data/fasttext
    unzip wiki.en.zip -d data/fasttext
    
    or (in this case the folder path "fasttext_path": "data/fasttext/wiki.en.bin" the config files in the folders conf_snopes_paper and conf_snopes_stance_paper need to be changed to "data/fasttext/wiki-news-300d-1M.vec" ) 
    
    wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec
    mkdir -p data/fasttext
    unzip wiki-news-300d-1M.vec.zip -d data/fasttext

Pre-processed datasets

Ask for permission to dowload the dataset: Snopes Corpus

    unzip ukp_snopes_corpus.zip
    cp -r snopes_corpus/datasets data/

Evidence extraction

    PYTHONPATH=src python src/scripts/evidence_extraction.py --model  [esim or bilstm_ranking or decompos_att or tf-idf or random_baseline]
    e.g.: PYTHONPATH=src python src/scripts/evidence_extraction.py --model bilstm_ranking

Available configuration files:

FieldDescription
esimESIM ranking model
bilstm_rankingBiLSTM ranking model
decompos_attDecomposable attention ranking model
tf-idfTF-IDF ranking
random_baselineRandom basline

Stance detection

To run the stance detection models, launch the script as following:

    PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config conf_cnopes_stance_paper <path/to/config/file/json>]
    e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_stance_paper/config_snopes_use_bilstm.json

Available configuration files:

FieldDescription
config_snopes_bert_sent_bilstm.jsonBERT sentence embeddings with BiLSTM encoding
config_snopes_bert_word_bilstm.jsonBERT word embeddings with BiLSTM encoding
config_snopes_use_bilstm.jsonGoogle Universal Sentence Encoder (USE+Attent)

Claim validation

To run the claim validation models, launch the script as following:

    PYTHONPATH=src python src/scripts/rte.py --mode <train/test> [--config_cnopes_paper <path/to/config/file/json>]
    e.g.: PYTHONPATH=src python src/scripts/rte.py --mode train --config conf_snopes_paper/config_snopes_bilstm.json

Available configuration files:

FieldDescription
config_snopes_bilstm.jsonBilSTM
config_snopes_bert_sent_attention.jsonBERT sentence embeddings with attention
config_snopes_esim_glove.jsonExtended ESIM with GloVe embeddings
config_snopes_use_attention.jsonGoogle Universal Sentence Encoder with attention

Run Decomposable Attention model (different configurations are required)

To run the decomposable attention model with Snopes corpus, run the script below:

Training:
    PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json [config/snopes.json or config/stance.json] <folder/to/keep/the/model> --cuda-device <GPU Number> --is-snopes True
    e.g.: PYTHONPATH=src python src/scripts/train_da.py data/datasets/snopes.page.json config/stance.json model/decomp_attention_stance/ --cuda-device 0 --is-snopes True
Testing:
    PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json <folder/to/keep/the/model>/model.tar.gz [data/datasets/snopes.test.jsonl or data/datasets/snopes.stance.test.jsonl] --cuda-device <GPU Number> --is-snopes True
    e.g.: PYTHONPATH=src python src/scripts/eval_da.py data/datasets/snopes.page.json model/decomp_attention_stance/model.tar.gz data/datasets/snopes.stance.test.jsonl --cuda-device 0 --is-snopes True

BERT models (different configurations are required)

For BERT models, change Python environment to bert

    source activate bert

The pretrained model need to be downloaded and the bert-as-service server need to be first started. To download the pretrained model:

    mkdir bert
    cd bert
    wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
    unzip cased_L-24_H-1024_A-16.zip

For BERT word embedding:

    bert-serving-start -pooling_strategy NONE -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]

For BERT sentence embedding:

    bert-serving-start -model_dir bert/cased_L-24_H-1024_A-16 -max_seq_len 50 [-port <bert_port> -port_out <bert_port_out> -graph_tmp_dir <path/to/tmp/folder, default as /tmp>]

bert_port is default as 5555, and bert_port_out default as 5556. These two ports should refer to the same fields in the config.

For more detail, please refer to bert-as-service.

Run AtheneMLP model for stance detection (not yet fixed)

Please note, the comand below is not yet wroking properly and will be fixed later. If you want to fix it yourselfe, the required files are in the directory src/fnc

    PYTHONPATH=src python python src/fnc/pipeline.py

Description of the Config File

The config file regarding the file paths and the hyper parameters is src/athene/utils/config.py. The descriptions of each field are followings:

FieldDescription
model_nameName of the RTE model. Used as part of the path to save the trained RTE model.
glove_pathPath to the pre-trained GloVe word embedding. Either point to the glove.6B.300d.txt.gz or the glove.6B.300d.txt file.
fasttext_pathPath to the pre-trained FastText word embedding. Should point to the wiki.en.bin file.
ckpt_folderPath to the checkpoint folder for the trained RTE model. Default as model/<model_name>/rte_checkpoints.
db_pathPath to the FEVER database file.
dataset_folderPath to the dataset folder.
raw_training_setPath to the original training set file.
raw_dev_setPath to the original development set file.
raw_test_setPath to the original test set file.
training_doc_filePath to the training set with predicted pages, i.e. the output of the training set through document retrieval sub-task.
dev_doc_filePath to the development set with predicted pages, i.e. the output of the development set through document retrieval sub-task.
test_doc_filePath to the test set with predicted pages, i.e. the output of the test set through document retrieval sub-task.
training_set_filePath to the training set with predicted evidences, i.e. the output of the training set through sentence retrieval sub-task.
dev_set_filePath to the development set with predicted evidences, i.e. the output of the development set through sentence retrieval sub-task.
test_set_filePath to the test set with predicted evidences, i.e. the output of the test set through sentence retrieval sub-task.
document_k_wikiThe maximal number of candidate pages for each claim in the document retrieval sub-task.
document_parallelWhether to perform the document retrieval sub-task parallel. True or False.
document_add_claimWhether to append the original claim to the query to the MediaWiki API in the document retrieval sub-task. True or False.
submission_filePath to the final submission file.
estimator_nameThe name of the RTE estimator referring to src/athene/rte/utils/estimator_definitions.py.
max_sentencesThe maximal number of predicted evidences for each claim.
max_sentence_sizeThe maximal length of each predicted evidence. The words that exceed the maximal length are truncated.
max_claim_sizeThe maximal length of each claim. The words that exceed the maximal length are truncated.
seedRandom seed of the RTE sub-task.
nameThe prefix of the checkpoint files for the RTE sub-task. The checkpoint files will be saved in the <ckpt_folder>.
bert_portport for the BERT server. Only for BERT models.
bert_port_outport_out for the BERT server. Only for BERT models.
max_gpu_memoryMaximal GPU memory usage. Percentage. (0.0, 1.0]. Default as 0.5
tensorboard_folderTensorBoard folder. Default as None, meaning no tensorboard data will be collected.

'esim_hyper_param' field contains the hyper parameters regarding the ESIM based model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
num_neuronsThe number of neurons for each layer in the model. The first 2 numbers refer to the numbers of neurons of the two bidirectional RNNs in the ESIM model.
pos_weightThe positive weights of the 3 classes for the weighted loss. The order is Supported, Refuted, Not Enough Info.
max_checks_no_progressEarly stopping policy. Stop training if no improvement in the last x epochs.
trainableWhether to fine tune the word embeddings. True or False.

'esim_mtl_hyper_param' field contains the hyper parameters regarding the ESIM Multi-Task Learning model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
num_neurons_esimThe number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers.
num_neurons_claim_validationThe number of neurons of the MLP for the claim validation task.
num_neurons_evidence_evaluationThe number of neurons of the MLP for the evidence evaluation task.

'esim_end_2_end_hyper_param' field contains the hyper parameters regarding the ESIM End-2-End model and the model using BERT word encodings in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
sentence_selection_criterionThe criterion of selecting candidate evidences. 'n_best_sents': Select top N sentences. 'sent_threshold': Select sentences with scores higher than a threshold.
n_best_sentsThe number of sentences to select. Only for the criterion 'n_best_sents'.
sent_thresholdThe threshold for selecting sentences. Only for the criterion 'sent_threshold'.

'esim_credibility_hyper_param' field contains the hyper parameters regarding the model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
word_trainableTrue or False. Whether the word embeddings are trainable.
domain_trainableTrue or False. Whether the domain name embeddings are trainable.
suffix_trainableTrue or False. Whether the embeddings for the suffix of domain names are trainable.
stance_trainableTrue or False. Whether the stance embeddings are trainable.
protocol_trainableTrue or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable.
domain_embedding_sizeThe dimensionality of the domain name embeddings.
suffix_embedding_sizeThe dimensionality of the embeddings for the suffix of the domain name .
protocol_embedding_sizeThe dimensionality of the embeddings for the protocols of the domain name .
stance_embedding_sizeThe dimensionality of the stance embeddings.
features_embeddings_pathThe path where the embeddings are stored after training.

'esim_credibility_mtl_hyper_param' field contains the hyper parameters regarding the Multi-Task Learning model using source information for Snopes corpora in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
num_neurons_esimThe number of neurons for the shared ESIM component, i.e. 2 BiLSTM layers.
num_neurons_claim_validationThe number of neurons of the MLP for the claim validation task.
num_neurons_stance_detectionThe number of neurons of the MLP for the stance detection task.
word_trainableTrue or False. Whether the word embeddings are trainable.
domain_trainableTrue or False. Whether the domain name embeddings are trainable.
suffix_trainableTrue or False. Whether the embeddings for the suffix of domain names are trainable.
stance_trainableTrue or False. Whether the stance embeddings are trainable.
protocol_trainableTrue or False. Whether the embeddings for the protocols (HTTP or HTTPS) are trainable.
domain_embedding_sizeThe dimensionality of the domain name embeddings.
suffix_embedding_sizeThe dimensionality of the embeddings for the suffix of the domain name .
protocol_embedding_sizeThe dimensionality of the embeddings for the protocols of the domain name .
stance_embedding_sizeThe dimensionality of the stance embeddings.
claim_training_setThe path to the training set of the claim validation task.
claim_dev_setThe path to the development set of the claim validation task.
claim_test_setThe path to the test set of the claim validation task.
stance_training_setThe path to the training set of the stance detection task.
stance_dev_setThe path to the development set of the stance detection task.
stance_test_setThe path to the test set of the stance detection task.
features_embeddings_pathThe path where the embeddings are stored after training.

'han_hyper_param' field contains the hyper parameters regarding the Hierarchical Attention Network models in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
word_output_sizeThe dimensionality of the word-level output after aligned by the attention
sent_output_sizeThe dimensionality of the sentence-level output after aligned by the attention

'use_hyper_param' field contains the hyper parameters regarding the Universal Sentence Encoder based models in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
attention_functionOnly for use_attention models. The attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight.
do_finetuneOnly for use_attention models. True or False. Whether to fine tune USE.

'bert_sent_hyper_param' field contains the hyper parameters regarding the models using the BERT sentence encoding in the RTE sub-task. The descriptions of several special parameters are followings:

FieldDescription
attention_functionThe attention implementations to choose. Choices are: 'cosine_similarity', 'weight_matrix': c as claim and e as evidence, randomly initialize a 512*512 trainable matrix m, and calculate the attention weights as c·m·e. 'weight_matrix_2': vector v=W·e + b, and the attention weight as v·c. 'weight_matrix_3': c and e both through a feed-forward layer, then the multiplication as attention weight.

'sentence_retrieval_ensemble_param' field contains the hyper parameters regarding the ESIM based model in the sentence retrieval sub-task. The descriptions of several special parameters are followings:

FieldDescription
num_modelThe number of models to ensemble.
tf_random_stateThe random seeds for the models to ensemble.
num_negativesThe number of negative sampling, i.e. false evidences, for each claim in the training phase.
c_max_lengthThe maximal length of each claim. The words that exceed the maximal length are truncated.
s_max_lengthThe maximal length of each candidate evidence sentence. The words that exceed the maximal length are truncated.
reserve_embedWhether to reserve slots in the word embeddings for unseen words. True or False.
model_pathPath to the folder for the checkpoint files of the ensemble models.

Contacts:

Credit:

This repository was build by Zile Li, Hao Zhang, and Andreas Hanselowski.

License:

  • Apache License Version 2.0