Learning about Configs

January 11, 2023 · View on GitHub

AdaSeq uses configuration file to control model assembling, training and evaluation. The configuration file supports yaml json jsonline format.

1. Configurate File Organization

Let's take resume.yaml as an example. A configuration file usually consists of the following fields:

experiment: ...
task: ...
dataset: ...
preprocessor: ...
data_collator: ...
model: ...
train: ...
evaluation: ...

2. Introduction to Global Parameters

Notice: Default = / means this parameter is compulsory.

2.1 experiment

ParameterDescriptionTypeDefault
exp_direxperiment directorystrexperiments
exp_nameexperiment name. all outputs will be saved to ./${exp_dir}/${exp_name}/${datetime}/strunknown
seedrandom seedint42

2.2 task

task supports the following values (see metainfo):

  • word-segmentation
  • part-of-speech
  • named-entity-recognition
  • relation-extraction
  • entity-typing

2.3 dataset

Please refer to Customizing Dataset as the combination of dataset parameters is complex.

ParameterDescriptionTypeDefault
tasktask of the datasetstrNone
namemodelscope dataset name, for example damo/resume_nerstrNone
pathhuggingface dataset name, for example conll2003strNone
data_filedata files, it can be an url, local directory or archive, it can be a dict containing train valid test as wellstr/dictNone
data_typeused to specify data loading methodstrNone
transformdataset post processing, usually containing name key schemedictNone
labelslabel set, it can be a list labels: ['O', 'B-ORG', ...], file or urllabels: PATH_OR_URL, or a function counting labels from datasetstr/list/dictNone
access_tokenused to access private repos from modelscope or huggingfacestrNone

2.4 preprocessor

ParameterDescriptionTypeDefault
typepreprocessor typestr/
model_dirtokenizer name or directorystr/
is_word2vecwhether to use Lookup TableboolFalse
tokenizer_kwargsother parameters for tokenizerdictNone
max_lengthmaximum sentence length (subtoken-level)int512

2.5 data_collator

data_collator supports the following values (see metainfo):

  • DataCollatorWithPadding
  • SequenceLabelingDataCollatorWithPadding
  • SpanExtractionDataCollatorWithPadding
  • MultiLabelSpanTypingDataCollatorWithPadding
  • MultiLabelConcatTypingDataCollatorWithPadding

2.6 model

ParameterChild-ParameterDescriptionTypeDefault
typemodel typestr/
embedderused to embed input ids to vectors, usually a pretrained modeldictNone
typeembedder type, optional when using modelscope or huggingface modelstrNone
model_name_or_pathpretrained model name or path, supporting both modelscope or huggingface modelsstr/
encoderencode the sentence vector, such as LSTMdictNone
typeencoder typestr/
decodernot available, under constructiondictNone

2.7 train

ParameterChild-ParameterDescriptionTypeDefault
trainertrainer typestrNone
max_epochsmaximum number of epochs in trainingint/
dataloaderused to load datadict/
batch_size_per_gpubatch size per gpuint/
workers_per_gpudata loading workers per gpuint0
optimizeroptimizerdictNone
typeoptimizer typestr/
lrlearning rate for all parameters except specific param_groupsfloat/
optionsoptions used in optimizer, for example grad_clip: max_norm: 2.0dictNone
param_groupsparam_groups can have different learning rateslistNone
└ regexregex expression to specify parameter groupstr/
└ lrlearning rate for specific parameter groupfloat/
lr_schedulerused to adjust learning rate uding trainingdictNone
typesupporting all lr_scheduler from pytorch (check if your pytorch version includes them)str/
optionsoptions used in the lr_schedulerdictNone
hooksalso callbacks see ModelScope documentationlistNone

2.8 evaluation

ParameterChild-ParameterDescriptionTypeDefault
dataloaderused to load datadict/
batch_size_per_gpubatch size per gpuint/
workers_per_gpudata loading workers per gpuint0
metricsevaluation metricslistNone
typemetric typestr/