Annotation Guidelines

September 22, 2022 · View on GitHub

Guidelines for annotating the ground truth for the data.

These guidelines define how to annotate the ground truth for the reddit data in order to produce extraction quality evaluations. There are two kinds of annotations used in evaluating the extractions: comment and post annotations.

Comment Annotations

Comment annotations should be written in a JSON Lines file where each object has the following keys:

id: the ID attribute of the corresponding comment
label: a gold annotation for the label (one of "AUTHOR", "OTHER", "EVERYBODY", "NOBODY", or "INFO") expressed by the comment, or null if no label is expressed
implied: true if the label is implied by the view of the author and false if the label is somehow explicitly stated
spam: true if the comment is spam, false otherwise

The possible labels are:

AUTHOR: The author of the anecdote is in the wrong.
OTHER: The other person in the anecdote is in the wrong.
EVERYBODY: Everyone in the anecdote is in the wrong.
NOBODY: No one in the anecdote is in the wrong.
INFO: More information is required to make a judgment.

If the comment explicitly expresses a label either by its initialism or some phrase corresponding to the initialism, then use that label for the comment. Similarly, mark the comment with implied as false and spam as false.

If the comment expresses multiple labels with no clear winner or is otherwise ambiguous, mark label as null, implied as null, and spam as true.

If the comment expresses no labels explicitly but still has a viewpoint that clearly expresses one of the labels, then use that label for the comment. Mark implied as true and spam as false.

Finally, if the comment expresses no label (i.e., none of AUTHOR, OTHER, NOBODY, EVERYBODY, or INFO), then mark label as null, implied as null, and spam as true.

Post Annotations

Post annotations should be written in a JSON Lines file where each object has the following keys:

id: the ID attribute of the corresponding post
post_type: a gold annotation for the post's type
implied: true if the post type is not explicitly stated in the post title.
spam: true if the post is spam, false otherwise

Possible post types are:

HISTORICAL: The author is asking "am I the a**hole?"
HYPOTHETICAL: The author is asking "would I be the a**hole?"
META: The post is about the subreddit itself.

If the post type is explicitly stated in the post title, then mark post_type as the stated post type, mark implied as false, and spam as false, unless the post type is META in which case mark spam as true. Additionally, if the post type is explicitly stated but clearly wrong (such as using HISTORICAL for a HYPOTHETICAL post), then use the true post type rather than the stated one.

If the post type is not explicitly stated in the post title, but otherwise clear from the post, mark the appropriate post type, mark implied as true and spam as false.

If the post cannot be categorized into one of the types above, mark the post_type as null, implied as null, and spam as true.

If the post is something that should not be present in the dataset (for example a deleted post), then mark spam as true.