RuSentRel 1.1

October 1, 2023 ยท View on GitHub

๐Ÿ““ Update 01 October 2023: this collection is now available in arekit-ss for a quick sampling of contexts with most subject-object relation mentions with just single script into JSONL/CSV/SqLite including (optional) language transfering ๐Ÿ”ฅ [Learn more ...]

Release Notes:

  • List of synonyms has been expanded; not it covers all extracted named entities in *.ann files;
  • Providing collection reader.

RuSentRel corpus [paper] of version 1.1 consisted of analytical articles from Internet-portal inosmi.ru. These are translated into Russian texts in the domain of international politics obtained from foreign authoritative sources. The collected articles contain both the author's opinion on the subject matter of the article and a large number of references mentioned between the participants of the described situations. In total, 73 large analytical texts were labeled with about 2000 relations.

The texts were processed by the automatic name entity (NE) recognizer, based on CRF method [paper]. NE were categorized into four classes: Persons, Organizations, Places and Geopolitical Entities (states and capitals as states). Automatic labeling contains a few errors that have not yet been corrected. Preliminary analysis showed that the F-measure of determining the correct entity boundaries exceeds 95%. Recognized NE were composed in *.ann files.

For verbose description, please see References section.

For model application, please refer to the following repositores:

Collection Reader

๐Ÿ““ Update 01 October 2023: this collection is now available in arekit-ss for a quick sampling of contexts with most subject-object relation mentions with just single script into JSONL/CSV/SqLite including (optional) language transfering ๐Ÿ”ฅ [Learn more ...]

Folder reader contains a collection reader (source file parsers), written in Python-3.6.

Please refer to read.py, as it provides an example of how this collection could be parsed/readed.

Parameters

ParameterTraining collectionTest collection
Number of documents4429
Sentences (avg./doc.)74.5137
NE (avg./doc.)194300
unique NE (avg./doc.)33.359.9
positive pairs of NE (avg./doc.)6.2314.7
negative pairs of NE (avg./doc.)9.3315.6
Share of attitudes expressed in a single sentence76.5%73%

Statistics for the whole Collection:

ParameterCollection
Avg. dist. between NE within a sentence in words10.2
Human labeling agreement (F1(P, N))0.55
Contradiction (Acc.)0.01

Separately for train and test collections, we compose and group these sets by sizes and the resulted statistics for the first eight groups is presented in table below.

We decide a context sentiment with a pair of entities, when related sentiment attitude could be found.

train-sentTotal12345678
train-sent46747%15%4.4%4.3%2.2%0.9%0.8%1.0%
test-sent66947%13%5.0%4.2%2.4%1.0%1.1%1.3%

In most cases we deal with single-context attitudes in train and test collections. However, the distribution of the sentiment single-context attitudes represent 47% is about a half of all occured attitudes. Considering such a distinctive factor for attitudes labeling, it is important to take into account the labels of several contexts

References

@article{loukachevitch2018extracting,
    Author = {Loukachevitch, N. and Rusnachenko, N.},
    Title = {Extracting Sentiment Attitudes from Analytical Texts},
    Journal = {In Proceedings of International conference Dialog-2018},
    Year = {2018}
}