input_data.md
January 25, 2023 ยท View on GitHub
Input samples format
Input data might be in a form of .jsoln
[example]
or csv
[example].
Input represents list/rows of samples.
Every sample contains text and mentioned opinion in it, i.e. source->target relation.
Sample contain the following mandatory parameters:
id(type:uint) -- sample identifierlabel(type:int) -- for training only, denotes a class;- value in range
[0, c], whereccorresponds to classes amount.
- value in range
text(type:strorlist) -- string of terms, separated by(whitespace), or list of terms in case ofjsonlfomat;
Optional parameters:
s_ind(type:int) -- index of the source term intextstring/list;t_ind(type:int) -- index of the target term intextstring/list;opinion_id(type:uint) -- used for grouping opinions, denoting index in group;entities(type:str) -- comma separated term indices which corresponds to entities, in order of their appearance in textframes(type:str) -- comma separated term indices which corresponds to connotation frames, in order of their appearance in text; important for sentiment-classification related tasks;frame_connots_uint(type:str) -- comma separated scores of the in set of three scaleintvalues{-1, 0, 1};syn_subjs(type:str) -- comma separated indices, synonymous to sources_ind;syn_objs(type:str) -- comma separated indices, synonymous to targett_ind;pos_tags(type:str) -- comma separated part-of-speech tags, with length exact the same as terms count oftext;
Embedding details
We support model.txt format, which provides:
- first row -- shape of the embedding matrix
- word and its vector in every row
Embeddings could be obtained from NLPL repository
- See how we do this in the following downloading script
Text-based embeddings will be then converted into vocab.txt and embedding.npz matrix.
[see code implementation]