CogComp Tokenizer
July 2, 2017 ยท View on GitHub
This project is based on work by Yiming Jiang, who wrote the initial version and evaluated CCG and Stanford tokenizers against a corpus drawn from OntoNotes.
It has been modified in the following way: additional classes were
written to use cogcomp-core-utilities data structures. The underlying
tokenizer is still the LBJava SentenceSplitter. The evaluation code
has not been updated to use the new IllinoisTokenizer class (TODO).
HOW TO USE IT
The class edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder
interface from cogcomp-core-utilities to create a TextAnnotation
object with SENTENCE and TOKEN views, other builder can be provided
as a constructor argument to a CachingAnnotatorService that uses other
annotators in pipeline fashion.
Setup
The StanfordTokenizer requires using a Java 8 runtime.
Original Project Structure
OntoNotes
OntoNotesParser:
input: a single ontonotes file
output: JSON file
OntoNotesJsonReader:
input: JSON file generated by OntoNotesParser
output: Curator Record Data Structure
Tokenizers
IllinoisTokenizer:
input: array of sentence strings, total is an article
output: Curator Record Data Structure
StanfordTokenizer:
input: array of sentence strings, total is an article
output: Curator Record Data Structure
Evaluator
Evaluator:
takes gold standard Record and a sample Record
Evaluation Criteria:
ON_GOLD_STANDARD_AGAINST_SAMPLE: iterate over each gold standard token and see if it's in sample tokens
ON_SAMPLE_AGAINST_GOLD_STANDARD: iterate over each sample token and see if it's in gold standard tokens
JSON File Format
Each JSON file is generated from OntoNotesParser. Each JSON file corresponds to each OntoNotes file.
This is the format of JSON file.
{
"sentences": [
{
"sentence_text":
"sentence_start_offset":
"sentence_end_offset":
"tokens": [
{
"token_text":
"token_start_offset":
"token_end_offset":
},
...
},
...
}
Format:
"sentences" has an array of sentences, with sentence text and offsets.
"tokens" has an array of tokens with token text and offsets.
Example Usage
OntoNotesParser
OntoNotesParser parser = new OntoNotesParser("wsj_0089.onf");
parser.writeToFileInJson("json_output.txt");
OntoNotesJsonReader
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
reader.parseIntoCuratorRecord();
IllinoisTokenizer
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
IllinoisTokenizer illinoisTokenizer = new IllinoisTokenizer(rawTexts);
StanfordTokenizer
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
StanfordTokenizer stanfordTokenizer = new StanfordTokenizer(rawTexts);
Evaluator
Evaluator evaluator = new Evaluator();
evaluator.evaluateIllinoisTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);
evaluator.evaluateStanfordTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);
About CogComp Tokenizer
- Developed by: Yiming Jiang
- Advised by: Professor Dan Roth
- Mentored by: Mark Sammons
##Citation
If you use this code in your research, please provide the URL for this github repository in the relevant publications. Thank you for citing us if you use us in your work!