ConTSG-Bench Submission Guide

March 6, 2026 ยท View on GitHub

This guide explains how to submit your model and evaluation results to the ConTSG-Bench leaderboard.

For canonical benchmark scope (dataset/model/metric counts), see docs/benchmark_spec.md.


Table of Contents

  1. Overview
  2. Benchmark Datasets
  3. Evaluation Metrics
  4. Submission File Format
  5. Step-by-Step Submission Process
  6. Running Evaluation with ConTSG
  7. Submission Validation
  8. How Ranking Works

1. Overview

ConTSG-Bench is an open benchmark for Conditional Time Series Generation. To ensure reproducibility and comparability, all submissions must be evaluated using the ConTSG framework.

Submission workflow:

  1. Evaluate your model on one or more benchmark datasets.
  2. Create a YAML file with your results.
  3. Validate your YAML locally with validate_submission.
  4. Open a GitHub Pull Request adding the file to the submissions/ directory.
  5. Maintainers review and merge valid submissions.
  6. Maintainers regenerate and validate leaderboard snapshots before publishing updates.

2. Benchmark Datasets

The benchmark consists of 10 datasets spanning 6 domains and 2 semantic levels:

Dataset IDDomainSemantic LevelVariablesSeq Length
synth-mSyntheticMorphological2128
synth-uSyntheticMorphological1128
ettm1EnergyMorphological1120
weather_conceptWeatherConceptual1036
weather_morphologyWeatherMorphological1036
telecomts_segmentTelecomMorphological2128
istanbul_trafficTrafficMorphological1144
airquality_beijingEnvironmentMorphological624
ptbxl_conceptHealthConceptual121000
ptbxl_morphologyHealthMorphological121000

Notes:

  • All datasets provide aligned conditions across all three modalities (text, attribute, label). Models using any condition type can be evaluated and compared on every dataset.
  • You do not need to evaluate on all 10 datasets. Submit results for whichever datasets you have.
  • The condition_modality field in your submission should match the condition type used during training/generation (text, attribute, or label).

3. Evaluation Metrics

The leaderboard tracks 15 metrics across 3 groups:

3.1 Fidelity (7 metrics)

Measures how realistic the generated time series are.

Metric IDDisplay NameDirection
acdACDLower is better
sdSDLower is better
kdKDLower is better
mddMDDLower is better
fidFIDLower is better
prdc_f1.precisionkNN-PRF PrecisionHigher is better
prdc_f1.recallkNN-PRF RecallHigher is better

3.2 Adherence (4 metrics)

Measures how well the generated time series follow the given condition.

Metric IDDisplay NameDirection
jftsdJFTSDLower is better
joint_prdc_f1.precisionJoint kNN-PRF PrecisionHigher is better
joint_prdc_f1.recallJoint kNN-PRF RecallHigher is better
cttpCTTPHigher is better

3.3 Utility (4 metrics)

Measures practical quality via sample-level distances.

Metric IDDisplay NameDirection
dtwDTWLower is better
crpsCRPSLower is better
edEDLower is better
wapeWAPELower is better

Notes:

  • You do not need to report all 15 metrics. The leaderboard displays a "coverage" score showing what fraction of metrics you reported.
  • Metric IDs use dot notation for nested metrics (e.g., prdc_f1.precision, not prdc_f1_precision).

4. Submission File Format

Each submission is a single YAML file placed in the submissions/ directory, named <your_model_name>.yaml.

4.1 Full Schema

model:
  name: my_model                    # Required. Unique model identifier.
  model_type: diffusion             # Required. One of: diffusion, flow, vae, gan, retrieval, other
  org: "My Lab"                     # Optional. Organization or author name.
  paper_link: "https://arxiv.org/abs/2401.xxxxx"  # Optional.
  code_link: "https://github.com/user/repo"       # Recommended. Source code repository.
  model_link: "https://huggingface.co/user/model" # Recommended. HF checkpoint repository.
  ckpt_scope: ["synth-u", "synth-m"]              # Optional. Public checkpoint coverage.
  reproducibility:                                # Recommended for reproducible leaderboard entries.
    script_link: "https://github.com/user/repo/blob/main/scripts/reproduce_contsg.sh"
    command: "bash scripts/reproduce_contsg.sh --dataset synth-u --seed 0"
    commit: "0123456789abcdef0123456789abcdef01234567"
    notes: "One-click script to reproduce reported numbers from checkpoints."
  params: "45M"                     # Optional. Number of trainable parameters.
  notes: "Brief description"        # Optional. One-line model description.

results:
  - dataset: synth-m                # Required. Must be one of the 10 benchmark dataset IDs.
    condition_modality: text         # Required. One of: text, attribute, label.
    n_runs: 3                        # Required. Number of random seeds used (>=1).
    metrics:
      dtw:  { mean: 1.234, std: 0.056 }
      fid:  { mean: 12.34, std: 1.20 }
      cttp: { mean: 0.87,  std: 0.02 }
      # ... add any subset of the 15 benchmark metrics

  - dataset: ettm1
    condition_modality: text
    n_runs: 3
    metrics:
      dtw:  { mean: 2.345, std: 0.089 }
      fid:  { mean: 8.76,  std: 0.95 }
      # ...

4.2 Field Details

model section

FieldRequiredDescription
nameYesUnique identifier for your model. Use lowercase with no spaces (e.g., my_model).
model_typeYesArchitecture family: diffusion, flow, vae, gan, retrieval, or other.
orgNoYour organization or team name.
paper_linkNoURL to the paper (arXiv, conference, etc.).
code_linkRecommendedURL to the source code repository. Needed for Replication Code and Open-source only filter on the leaderboard.
model_linkRecommendedURL to model weights/checkpoints (prefer Hugging Face model repo). Used for HF Weights display.
ckpt_scopeNoList of dataset IDs for which checkpoints are publicly available (e.g., ["synth-u","synth-m"]).
reproducibility.script_linkRecommendedURL to one-click reproduction script in your code repository.
reproducibility.commandRecommendedExact command to reproduce reported metrics.
reproducibility.commitRecommendedGit commit hash corresponding to reported numbers.
reproducibility.notesNoExtra details for maintainers/reviewers.
paramsNoNumber of parameters as a human-readable string (e.g., "45M", "1.2B").
notesNoA brief one-line description of the model.

results section (list)

Each entry describes results on one dataset:

FieldRequiredDescription
datasetYesOne of the 10 benchmark dataset IDs (see Section 2).
condition_modalityYesThe condition type used: text, attribute, or label.
n_runsYesNumber of independent runs (random seeds) that were averaged. Must be >= 1.
metricsYesA mapping of metric IDs to {mean, std} values.

metrics values

Each metric is specified as:

metric_name:
  mean: 1.234      # Required. The mean value across runs.
  std: 0.056       # Optional (default: 0.0). Standard deviation across runs.
  • mean must be a finite number (no NaN, no Infinity).
  • std must be >= 0.
  • If you only ran one seed, set std: 0.0 (or omit it).

4.3 Minimal Example

A minimal valid submission with only 1 dataset and 2 metrics:

model:
  name: my_baseline
  model_type: other

results:
  - dataset: synth-m
    condition_modality: text
    n_runs: 1
    metrics:
      dtw: { mean: 5.67 }
      fid: { mean: 45.2 }

5. Step-by-Step Submission Process

Step 1: Fork the repository

git clone https://github.com/<your-user-or-org>/ConTSG-Bench.git
cd ConTSG-Bench
git checkout -b add-my-model-submission

Step 2: Create your submission file

# Create the file
vim submissions/my_model.yaml
# (paste your YAML content)
pip install -e .
python -m contsg.leaderboard.validate_submission submissions/my_model.yaml

Expected output for a valid file:

INFO: Submission 'my_model': 3 datasets, 48 recognized metrics
Validation PASSED (0 warnings)

Step 4: Commit and push

git add submissions/my_model.yaml
git commit -m "Add my_model results to leaderboard"
git push origin add-my-model-submission

Step 5: Open a Pull Request

Open a PR on GitHub. Maintainers will review your submission and run validation/aggregation steps before publishing leaderboard updates.

Open-source display behavior

The leaderboard uses model.code_link to:

  • show Model (code) hyperlink in the ranking table
  • populate the Replication Code column (Yes when code_link is provided)
  • support the Open-source only ranking filter

The leaderboard also uses model.model_link to:

  • show HF Weights hyperlink for public checkpoints (when provided)

Reproducibility requirements (new submissions)

For new submissions, we strongly recommend including:

  • checkpoint link(s) (model_link)
  • one-click reproduction script (reproducibility.script_link)
  • exact reproduction command (reproducibility.command)
  • pinned code revision (reproducibility.commit)

Template files are provided in:

  • templates/repro_submission/submission.example.yaml
  • templates/repro_submission/reproduce_contsg.sh
  • templates/repro_submission/README.md

Validation note:

  • current validate_submission focuses on metric/schema integrity
  • reproducibility fields above are reviewed by maintainers during PR review

Current official ConTSG checkpoint release scope is:

  • synth-u
  • synth-m

Official checkpoint repository:


6. Running Evaluation with ConTSG

If you are using the ConTSG framework, you can generate results directly:

Training

pip install -e .
contsg train -d synth-m -m my_model --seed 0
contsg train -d synth-m -m my_model --seed 1
contsg train -d synth-m -m my_model --seed 2

Evaluation

contsg evaluate experiments/<run_dir_seed0>
contsg evaluate experiments/<run_dir_seed1>
contsg evaluate experiments/<run_dir_seed2>

Each evaluation produces results/eval_results.json containing all 15 metrics. You can then aggregate the 3 seeds into a submission YAML manually or programmatically.


7. Submission Validation

Local validation

python -m contsg.leaderboard.validate_submission submissions/my_model.yaml

The validator checks:

  • YAML syntax is correct.
  • All required fields are present (model.name, model.model_type, results, etc.).
  • Dataset names are valid benchmark datasets.
  • Metric values are finite numbers.
  • Standard deviations are non-negative.

Errors will block the submission. Warnings (e.g., unknown metric names) are informational and will not block.

Maintainer post-merge workflow (manual)

After approved submissions are merged, maintainers should regenerate snapshot files and validate them before releasing leaderboard updates:

# 1) Aggregate submissions/*.yaml into snapshot files
python -m contsg.leaderboard.aggregate submissions ./snapshots --version vYYYY.MM.DD

# 2) Validate generated snapshot files
python -m contsg.leaderboard.validate ./snapshots

If snapshot validation fails, do not publish the update until all errors are resolved.

Common warnings

WarningMeaning
Unknown metric 'xxx'This metric is not in the benchmark catalog. It will be silently ignored.
Duplicate dataset entryYou have two entries for the same dataset. Only the first will be used.

8. How Ranking Works

The leaderboard uses a percentile normalization + weighted aggregation ranking system. Only Fidelity and Adherence groups participate in the overall ranking. Utility metrics are tracked but do not affect rankings.

Step 1: Percentile Normalization

For each metric within each dataset, compute the percentile rank across all models regardless of condition modality (text, attribute, label). Because ConTSG-Bench provides aligned conditions across all three modalities for each dataset, models using different condition types are directly comparable on the same data.

  • For "lower is better" metrics: norm_score = 1 - percentile_rank
  • For "higher is better" metrics: norm_score = percentile_rank
  • Result: norm_score in [0, 1], where higher is always better.

Step 2: Group Aggregation

For each model, compute the average norm_score within each ranking group:

  • Fidelity score (average of 7 fidelity norm_scores)
  • Adherence score (average of 4 adherence norm_scores)

Utility metrics (DTW, CRPS, ED, WAPE) are excluded from ranking.

Step 3: Overall Score

Three ranking policies are available:

PolicyDescription
BalancedAverage of fidelity and adherence scores
Fidelity-onlyRank solely by fidelity score
Adherence-onlyRank solely by adherence score

Under the default "Balanced" policy, the overall score is the arithmetic mean: (fidelity + adherence) / 2.

Coverage

The leaderboard shows coverage โ€” the fraction of the 15 benchmark metrics that you reported. Models with higher coverage have more robust rankings.