Modelforge Model

December 4, 2018 · View on GitHub

Model is the core concept in Modelforge. A model consists of:

field	description	type	required?
uuid	Unique identifier	UUID string	yes
name	Type identifier	string	yes
series	Subtype identifier	string	yes
version	Version	semver-like list of 3 numbers or single number	yes
created_at	Date and time when model was generated	datetime string	yes
parent	Unique identifier of the previous version	UUID string	yes
description	Information about the model	string	yes
source	Download link or file path	string	yes
size	Size of the file	int	yes
license	License of the model	SPDX identifier or "Proprietary" string	yes
environment	Description of the computing environment used to create the model		yes
dependencies	Other models on which our model depend	UUID strings mapped to Model-s	no
code	Example of model usage in Python	string	no
datasets	List of datasets used to generate the model	list of pairs [name, URL]	no
references	List of relevant resources	list of URLs	no
tags	List of categories for classification	list of strings	no
metrics	Achieved quality metrics	mapping from names to numbers	no
extra	Additional information which is not covered by any other fields	custom	no

"Required" flag means whether the field always has a non-empty value. The table from above defines "metadata" in Modelforge. The data scheme of the actual payload of the model is referred to as the "internal format", and it is opaque. It can be any tree-like data structure with string, numbers, lists, subtrees and tensors inside.

uuid

Each model has a global unique identifier. It allows to reference any model in the registry. Example: dd6a841c-94e1-47f4-8029-b9aabb32505e.

name

Short name of the model family, the convention is dashed-lowercase. The name defines the type of the model - it's internal format. For example, the models which correspond to document frequencies (as in bag-of-words) are named "docfreq".

series

Short name of the model series, the same convention as with name. For example, the document frequencies calculated from an English Wikipedia dump in 2018 have "wiki-en-2018" series.

version

It is always a good idea to follow semver for versioning data:

In case of a breaking change in the internal format, we increment the major part.
In case of a serious quality improvement without breaking the internal format, we increment the minor part.
In other cases we increment the patch part.

There is an alternative, "no-brainer" versioning scheme which is followed by Chrome and Firefox: increment the only number.

created_at

Date and time when the model was last saved on disk.

parent

Unique identifier (uuid) of the previous model. When a new version is issued, it points to the old one.

description

Markdown text which describes the model. It is a good idea to include the achieved quality metric values here. However, machine-readable structured values should be put in metrics.

source

Models are loaded either from disk or from a URL. This attribute contains the corresponding FS path or the download link.

size

Size of the model file.

license

Models should always have an explicit usage license. Modelforge supports "Proprietary" value and the identifiers from the SPDX database.

environment

It is important to save as much information about the programming environment used to generate the model, as possible. Modelforge contains:

Running OS description, e.g. Linux-4.15.0-39-generic-x86_64-with-Ubuntu-18.04-bionic
Python interpreter version, e.g. 3.7.1 (default, Oct 22 2018, 11:21:55) [GCC 8.2.0]
Installed packages which were loaded while the model was being saved, and their versions.

The format is {"platform": "...", "python": "...", "packages": [["name", "version"],...]}

dependencies

Nested list of metadata belonging to the upstream models. Listing a model in dependencies means that it is impossible to use the dependee without it. This should not be confused with the data used to generate the model, which are listed in datasets.