CogComp Tokenizer

July 2, 2017 · View on GitHub

This project is based on work by Yiming Jiang, who wrote the initial version and evaluated CCG and Stanford tokenizers against a corpus drawn from OntoNotes.

It has been modified in the following way: additional classes were written to use cogcomp-core-utilities data structures. The underlying tokenizer is still the LBJava SentenceSplitter. The evaluation code has not been updated to use the new IllinoisTokenizer class (TODO).

HOW TO USE IT

The class edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder interface from cogcomp-core-utilities to create a TextAnnotation object with SENTENCE and TOKEN views, other builder can be provided as a constructor argument to a CachingAnnotatorService that uses other annotators in pipeline fashion.

Setup

The StanfordTokenizer requires using a Java 8 runtime.

Original Project Structure

OntoNotes

OntoNotesParser:
input: a single ontonotes file
output: JSON file

OntoNotesJsonReader:
input: JSON file generated by OntoNotesParser
output: Curator Record Data Structure

Tokenizers

IllinoisTokenizer:
input: array of sentence strings, total is an article
output: Curator Record Data Structure

StanfordTokenizer: input: array of sentence strings, total is an article
output: Curator Record Data Structure

Evaluator

Evaluator:
takes gold standard Record and a sample Record

Evaluation Criteria:
ON_GOLD_STANDARD_AGAINST_SAMPLE: iterate over each gold standard token and see if it's in sample tokens
ON_SAMPLE_AGAINST_GOLD_STANDARD: iterate over each sample token and see if it's in gold standard tokens

JSON File Format

Each JSON file is generated from OntoNotesParser. Each JSON file corresponds to each OntoNotes file.

This is the format of JSON file.

{  
    "sentences": [  
    {  
        "sentence_text":  
        "sentence_start_offset":  
        "sentence_end_offset":  
        "tokens": [  
        {  
            "token_text":  
            "token_start_offset":  
            "token_end_offset":  
        },  
        ...  
    },  
    ...  
}

Format:
"sentences" has an array of sentences, with sentence text and offsets.
"tokens" has an array of tokens with token text and offsets.

Example Usage

OntoNotesParser

OntoNotesParser parser = new OntoNotesParser("wsj_0089.onf");
parser.writeToFileInJson("json_output.txt");

OntoNotesJsonReader

OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
reader.parseIntoCuratorRecord();

IllinoisTokenizer

OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
IllinoisTokenizer illinoisTokenizer = new IllinoisTokenizer(rawTexts);

StanfordTokenizer

OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
StanfordTokenizer stanfordTokenizer = new StanfordTokenizer(rawTexts);

Evaluator

Evaluator evaluator = new Evaluator();
evaluator.evaluateIllinoisTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);
evaluator.evaluateStanfordTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);

About CogComp Tokenizer

Developed by: Yiming Jiang
Advised by: Professor Dan Roth
Mentored by: Mark Sammons

##Citation

If you use this code in your research, please provide the URL for this github repository in the relevant publications. Thank you for citing us if you use us in your work!