Add New Datasets into the DataLab Web Platform

April 11, 2022 ยท View on GitHub

DataLab provides an API which users can use to upload their own datasets into the DataLab web platform (privately or publicly) so that:

  • more deep analysis (bias, artifacts) could be made on the dataset in an interactive way
  • you can compare your dataset with existing similar ones to understand their own characteristics
  • more researcher will know your datasets

This involves two steps:

(1) Write a data loader script for your new dataset. You can refer to how to add new datasets into sdk

(2) Create an account at DataLab web platform.

(3) Call the client function to add your dataset

 # suppose that python script is located at DataLab/client/
from client import Client

"""
this depends on your task; you could also customized the function by modifying 
the script DataLab/client/example_funcs.py
"""
from example_funcs import text_classification_func 


client = Client(
                user_name="xxx", # the user name of your account: https://datalab.nlpedia.ai/user
                password="yyy",  # the password of your account: https://datalab.nlpedia.ai/user
                dataset_name_db="zzz", # you can specify any name based on your preference
                dataset_name_sdk="../datasets/ttt", # this should be the name of your data loader script, 
                sub_dataset_name_sdk="default",
                feature_func = text_classification_func, # we provide some example functions (example_funcs.py) but you could customize them as well
)
client.add_dataset_from_sdk() # if you could successfully run this, you can find the dataset here: https://datalab.nlpedia.ai/datasets_explore/user_dataset

where

  • dataset_name_db: denotes the name of the dataset to be stored in the database. You can specify any name based on your preference
  • dataset_name_sdk: denotes your dataset's data loader script path.
  • feature_func: is used to calculate sample-level (e.g., text length) and dataset-level features (e.g., the average of text length), which are usually task-dependent.
    • DataLab provide some templates for four types of task. You can either use them directly or develop new ones based on them.
    • Usually, the process of feature calculation will take some time.
    • We also provide a template script for adding your dataset into web platform, feel free to instantiate it based on your needs.
    • If your dataset is too large (>200M), please contact us.

(4) View your dataset in DataLab web platform

Once you successfully finished the above steps, you can find the dataset in your private space of DataLab web.

(5) Make your dataset Public (Optional)