Experiments

August 16, 2019 · View on GitHub

This document shows some further experiments that were made after the official paper version.

Subword embeddings

We did some extensive experiments replacing all Wikipedia and Common Crawl embeddings (as well as character embeddings) with subword embeddings.

For that purpose we use Byte-Pair Encoding (BPE) Embeddings as proposed by Heinzerling and Strube in BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.

For the first batch of experiments, we use a fixed dimension size of 300 and just change the number of merge operations. The number of merge operations are: 1,000, 3,000, 5,000, 10,000, 25,000, 50,000, 100,000 and 200,000.

ONB dataset

The current SOTA on the ONB dataset is 85.31 (reported in our paper). The following table shows experiments and F-Scores with Subword embeddings and different merge operations:

Merge operations	Run 1	Run 2	Run 3	Avg. runs
1,000	84.42	83.73	83.23	83.79
3,000	81.72	83.75	85.08	83.52
5,000	83.81	83.83	84.40	84.01
10,000	84.58	84.58	84.55	84.57
25,000	86.01	84.49	84.64	85.05
50,000	85.34	85.27	84.82	85.14
100,000	84.50	86.16	84.97	85.21
200,000	84.61	85.19	85.23	85.01

Subwords embeddings with 100,000 merge operations achieve an averaged F-Score of 85.21 with is very close to our reported result (85.31) in our paper.

Using Wipedia, Common Crawl and character embeddings has a total dimension size of 650 (300 + 300 + 50). Subword embeddings have only a total dimension size of 300. Thus, the network size is smaller and the F-Score performance is not negatively affected! The file size of the trained NER model decreases from 2.7 GB to only 424 MB! Training time will also decrease from 48 minutes to 26 minutes.

LFT dataset

The current SOTA on the LFT dataset is 77.51 (reported in our paper). The following table shows experiments and F-Scores with Subword embeddings and different merge operations:

Merge operations	Run 1	Run 2	Run 3	Avg. runs
1,000	75.84	74.66	75.23	75.24
3,000	75.02	75.72	76.10	75.61
5,000	75.46	75.25	76.58	75.76
10,000	75.73	75.46	76.20	75.80
25,000	74.48	75.31	76.42	75.40
50,000	76.34	75.87	76.66	76.29
100,000	75.96	77.72	76.79	76.82
200,000	75.43	76.09	75.70	75.74

Subword embeddings with 100,000 merge operations achieve an averaged F-Score of 76.82 with is -0.69 worse than our reported result (77.51) in our paper.

The model size shrinks from 2.7GB to 424MB using Subword embeddings, and the training time decreases from 2 hours to only 1 hour.