Modelforge Model

December 4, 2018 ยท View on GitHub

Model is the core concept in Modelforge. A model consists of:

fielddescriptiontyperequired?
uuidUnique identifierUUID stringyes
nameType identifierstringyes
seriesSubtype identifierstringyes
versionVersionsemver-like list of 3 numbers or single numberyes
created_atDate and time when model was generateddatetime stringyes
parentUnique identifier of the previous versionUUID stringyes
descriptionInformation about the modelstringyes
sourceDownload link or file pathstringyes
sizeSize of the fileintyes
licenseLicense of the modelSPDX identifier or "Proprietary" stringyes
environmentDescription of the computing environment used to create the modelyes
dependenciesOther models on which our model dependUUID strings mapped to Model-sno
codeExample of model usage in Pythonstringno
datasetsList of datasets used to generate the modellist of pairs [name, URL]no
referencesList of relevant resourceslist of URLsno
tagsList of categories for classificationlist of stringsno
metricsAchieved quality metricsmapping from names to numbersno
extraAdditional information which is not covered by any other fieldscustomno

"Required" flag means whether the field always has a non-empty value. The table from above defines "metadata" in Modelforge. The data scheme of the actual payload of the model is referred to as the "internal format", and it is opaque. It can be any tree-like data structure with string, numbers, lists, subtrees and tensors inside.

uuid

Each model has a global unique identifier. It allows to reference any model in the registry. Example: dd6a841c-94e1-47f4-8029-b9aabb32505e.

name

Short name of the model family, the convention is dashed-lowercase. The name defines the type of the model - it's internal format. For example, the models which correspond to document frequencies (as in bag-of-words) are named "docfreq".

series

Short name of the model series, the same convention as with name. For example, the document frequencies calculated from an English Wikipedia dump in 2018 have "wiki-en-2018" series.

version

It is always a good idea to follow semver for versioning data:

  1. In case of a breaking change in the internal format, we increment the major part.
  2. In case of a serious quality improvement without breaking the internal format, we increment the minor part.
  3. In other cases we increment the patch part.

There is an alternative, "no-brainer" versioning scheme which is followed by Chrome and Firefox: increment the only number.

created_at

Date and time when the model was last saved on disk.

parent

Unique identifier (uuid) of the previous model. When a new version is issued, it points to the old one.

description

Markdown text which describes the model. It is a good idea to include the achieved quality metric values here. However, machine-readable structured values should be put in metrics.

source

Models are loaded either from disk or from a URL. This attribute contains the corresponding FS path or the download link.

size

Size of the model file.

license

Models should always have an explicit usage license. Modelforge supports "Proprietary" value and the identifiers from the SPDX database.

environment

It is important to save as much information about the programming environment used to generate the model, as possible. Modelforge contains:

  • Running OS description, e.g. Linux-4.15.0-39-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python interpreter version, e.g. 3.7.1 (default, Oct 22 2018, 11:21:55) [GCC 8.2.0]
  • Installed packages which were loaded while the model was being saved, and their versions.

The format is {"platform": "...", "python": "...", "packages": [["name", "version"],...]}

dependencies

Nested list of metadata belonging to the upstream models. Listing a model in dependencies means that it is impossible to use the dependee without it. This should not be confused with the data used to generate the model, which are listed in datasets.

code

Code example of how to load the model and use it.

datasets

List of the entities used to create the model. They can be real datasets or other models.

references

List of relevant links for the model. It augments the description.

tags

List of tags - categories for model classification.

metrics

Achieved quality metric values, in dictionary format.

extra

Any other information.