HarveyNER

November 8, 2022 ยท View on GitHub

We introduce a new dataset HarveyNER with fine-grained locations annotated in tweets. This dataset presents unique challenges and characterizes many complex and long location mentions in informal descriptions. We built strong baseline models using Curriculum Learning and experimented with different heuristic curricula to better recognize diffcult location mentions. alt HarveyNER focuses on the coordinate-oriented locations so we mainly annotate Point that can be precisely pinned to a map and Area that occupies a small polygon of a map. Considering that some disasters can affect line-like objects (e.g., a food can affect the neighbors of a whole river), we also include Road and River types.

  • Points: denote an exact location that a geocoordinate can be assigned. E.g., a uniquely named building, intersections of roads or rivers.
  • Areas: denote geographical entities such as city subdivisions, neighborhoods, etc.
  • Roads: denote a road or a section of a road.
  • Rivers: denote a river or a section of a river.

Statistics

Data SplitTrainValidTestTotal
All Tweets3,9671,3011,3036,571
Tweet w/ Entity1,0873663531,806
Tweet w/o Entity2,8809359504,765
All Entity Type1,5815235002,604
Point591206202999
Area7152362121,163
Road1585157266
River1173029176

Dataset

Please use the latest version in the data directory

Requirement

Please see requirement. You can ceate a conda environment using the bert_ner.yaml file:

$ conda env create -f bert_ner.yml

Run

$ python run_ner_loc.py --data_dir=data/tweets --bert_model=bert-base-uncased --task_name=ner --max_seq_length=48 --num_train_epochs=50 --learning_rate=5e-5 --bert_lr=5e-5 --train_batch_size=32 --eval_batch_size=32 --do_train --do_eval --do_predict --seed=42  --do_lower_case --warmup_proportion=0.1 --curriculum=commonness --netural --complexity_lambda=0.6 --maximum_lambda=1 --anti

Citation

If you extend or use this dataset, please cite the paper where it was introduced.

@inproceedings{chen-etal-2022-crossroads,
    title = "Crossroads, Buildings and Neighborhoods: A Dataset for Fine-grained Location Recognition",
    author = "Chen, Pei  and Xu, Haotian  and Zhang, Cheng  and Huang, Ruihong",
    booktitle = "NAACL",
    year = "2022",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.243",
}