How to add language information of you dataset?

February 24, 2022 ยท View on GitHub

Take the ag_news (English), for example,

we just need to assign value (List[str]) to the variable languages in the _info(self) function.

  • where str represents the ISO code of a language, which you can obtain from this look-up table. If the dataset involves multiple languages, add all of them. For example, for the Chinese-English machine translation dataset, we have languages = ['en','zh']
    def _info(self):
        return datalabs.DatasetInfo(
            description=_DESCRIPTION,
            features=datalabs.Features(
                {
                    "text": datalabs.Value("string"),
                    "label": datalabs.features.ClassLabel(names=["World", "Sports", "Business", "Science and Technology"]),
                }
            ),
            homepage="http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html",
            citation=_CITATION,
            languages=["en"],
            task_templates=[TextClassification(text_column="text", label_column="label")],
        )

You can verify if the language information has been successfully added by:

from datalabs import load_dataset

dataset = load_dataset("ag_news")
print(dataset['test']._info.languages)

where dataset['test']._info represents all metadata information of the test split.