Annotation Guidelines
September 22, 2022 ยท View on GitHub
Guidelines for annotating the ground truth for the data.
These guidelines define how to annotate the ground truth for the reddit data in order to produce extraction quality evaluations. There are two kinds of annotations used in evaluating the extractions: comment and post annotations.
Comment Annotations
Comment annotations should be written in a JSON Lines file where each object has the following keys:
id- the ID attribute of the corresponding comment
label-
a gold annotation for the label (one of
"AUTHOR","OTHER","EVERYBODY","NOBODY", or"INFO") expressed by the comment, ornullif no label is expressed implied-
trueif the label is implied by the view of the author andfalseif the label is somehow explicitly stated spam-
trueif the comment is spam,falseotherwise
The possible labels are:
AUTHOR- The author of the anecdote is in the wrong.
OTHER- The other person in the anecdote is in the wrong.
EVERYBODY- Everyone in the anecdote is in the wrong.
NOBODY- No one in the anecdote is in the wrong.
INFO- More information is required to make a judgment.
If the comment explicitly expresses a label either by its initialism or
some phrase corresponding to the initialism, then use that label for the
comment. Similarly, mark the comment with implied as false and
spam as false.
If the comment expresses multiple labels with no clear winner or is
otherwise ambiguous, mark label as null, implied as null, and
spam as true.
If the comment expresses no labels explicitly but still has a viewpoint
that clearly expresses one of the labels, then use that label for the
comment. Mark implied as true and spam as false.
Finally, if the comment expresses no label (i.e., none of AUTHOR,
OTHER, NOBODY, EVERYBODY, or INFO), then mark label as null,
implied as null, and spam as true.
Post Annotations
Post annotations should be written in a JSON Lines file where each object has the following keys:
id- the ID attribute of the corresponding post
post_type- a gold annotation for the post's type
implied-
trueif the post type is not explicitly stated in the post title. spam-
trueif the post is spam,falseotherwise
Possible post types are:
HISTORICAL- The author is asking "am I the a**hole?"
HYPOTHETICAL- The author is asking "would I be the a**hole?"
META- The post is about the subreddit itself.
If the post type is explicitly stated in the post title, then mark
post_type as the stated post type, mark implied as false, and
spam as false, unless the post type is META in which case mark
spam as true. Additionally, if the post type is explicitly stated but
clearly wrong (such as using HISTORICAL for a HYPOTHETICAL post), then
use the true post type rather than the stated one.
If the post type is not explicitly stated in the post title, but
otherwise clear from the post, mark the appropriate post type, mark
implied as true and spam as false.
If the post cannot be categorized into one of the types above, mark the
post_type as null, implied as null, and spam as true.
If the post is something that should not be present in the dataset (for
example a deleted post), then mark spam as true.