ConTSG-Bench: A Unified Benchmark for Conditional Time Series Generation

May 2, 2026 ยท View on GitHub

arXiv HuggingFace Space HuggingFace Dataset HuggingFace Checkpoints License: MIT

Conditional time series generation with varying conditioning modalities (text, attribute, class label) and semantic abstraction levels (morphological vs. conceptual).

News

  • ๐ŸŽ‰ May 2026 โ€” ConTSG-Bench has been accepted at ICML 2026!

Call for Models & Datasets -- ConTSG-Bench is an open and evolving benchmark. We welcome the community to submit new models and datasets to the leaderboard. See the Submission Guide for details. Reproducibility template files are available in templates/repro_submission/.

Benchmark scope (datasets/models/metrics) is maintained in docs/benchmark_spec.md.


Overview

Conditional time series generation (ConTSG) has emerged as a transformative capability for scientific and industrial advancement. Its applications span from realistic data simulation for healthcare and climate analysis, causal inference, to privacy-preserving data synthesis. While unconditional generation has seen significant progress with established benchmarks, the research frontier has shifted toward controllable synthesis: generating high-fidelity time series that strictly adheres to user-defined, multimodal conditions.

However, the landscape of ConTSG remains highly fragmented. Current methods are isolated by their specific conditioning modalities โ€” some rely on discrete class labels, others on structured attributes, and recent works explore natural language descriptions. These models are typically evaluated on incompatible datasets with different condition modalities, making it infeasible to systematically compare conditional generation effectiveness.

ConTSG-Bench addresses this critical gap by providing the first unified benchmark for conditional time series generation. Our benchmark systematically disentangles condition types along two axes:

  • Modality: class label, attribute, and text
  • Semantic abstraction: morphological (describing observable temporal structures) vs. conceptual (describing high-level domain semantics)

Condition modality and semantic abstraction overview in ConTSG-Bench.

ConTSG-Bench unifies condition modalities (label, attribute, text) and semantic abstraction levels (morphological, conceptual) in a single benchmark setting.

Key features:

  • Single CLI Entry Point: All operations via contsg command
  • Registry Pattern: Easy model/dataset extension via decorators
  • PyTorch Lightning: Standardized training with built-in best practices
  • Experiment Tracking: Git commit tracking, config snapshots, checkpoint management

Datasets

ConTSG-Bench comprises 10 benchmark datasets spanning diverse domains, with aligned conditions across text / attribute / label modalities:

Dataset IDDomainVariatesSeq LengthSemantic Level
synth-uSynthetic1128Morphological
synth-mSynthetic2128Morphological
ettm1Energy1120Morphological
istanbul_trafficTransportation1144Morphological
airquality_beijingEnvironment624Morphological
telecomts_segmentNetwork Telemetry2128Morphological
ptbxl_morphologyHealthcare (ECG)121000Morphological
ptbxl_conceptHealthcare (ECG)121000Conceptual
weather_morphologyMeteorology1036Morphological
weather_conceptMeteorology1036Conceptual

Each dataset provides aligned multimodal conditions: text descriptions, structured attributes, and class labels are derived from the same underlying semantics, enabling controlled cross-modality comparison.

Dataset release: ConTSG-Bench Dataset on Hugging Face

Checkpoint release (current public scope: synth-u, synth-m): ConTSG-Bench Checkpoints on Hugging Face

CTTP resources in the checkpoint release do not mirror the upstream LongCLIP text encoder. To reproduce the public setup, download the exact upstream LongCLIP model zer0int/LongCLIP-GmP-ViT-L-14 separately and point pretrain_model_path (or LONGCLIP_ROOT if a released CTTP model_configs.yaml uses that placeholder) to that local model directory. Do not substitute the original BeichenZhang/LongCLIP-L; the released CTTP resources were checked against the GmP fine-tuned LongCLIP variant. See configs/README.md for direct download commands.

Supported Models

ConTSG-Bench benchmarks 11 representative generation models spanning all three conditioning modalities:

Text-Conditioned Models

ModelRegistry IDBackboneReference
VerbalTSverbaltsPatch Transformer + DDPMGu et al., ICML 2025
T2St2sTransformer + Flow MatchingGe et al., IJCAI 2025
BRIDGEbridgeUNet + DDPMLi et al., ICML 2025
DiffuSETSdiffusetsVAE + Latent DDPMLai et al., Patterns 2025
Text2Motiontext2motionConv AE + GRU-VAEGuo et al., CVPR 2022
RetrievalretrievalNearest Neighbor (Baseline)โ€”

Attribute-Conditioned Models

ModelRegistry IDBackboneReference
TimeWeavertimeweaverTransformer + DDPMNarasimhan et al., ICML 2024
WaveStitchwavestitchS4 + DDPMShankar et al., PACM 2025
TEditteditPatch Transformer + DDPMJing et al., NeurIPS 2024

Label-Conditioned Models

ModelRegistry IDBackboneReference
TimeVQVAEtimevqvaeVQ-VAE + Masked TransformerLee et al., AISTATS 2023
TTS-CGANttscganTransformer + GANLi et al., 2022

Key Results

Model ranking across generation fidelity and condition adherence.

Model ranking under two metric groups: generation fidelity (left) and condition adherence (right). Text-conditioned models offer the highest performance ceiling but also the largest variance.

Key findings from our benchmark:

  • Good generation fidelity does not guarantee condition adherence. Some models perform consistently well on both dimensions, while others show significant rank differences, confirming the need to evaluate these two aspects separately.
  • Text conditioning offers the highest performance ceiling but also the largest variance. Text-conditioned models span the full range from top to bottom, whereas attribute-conditioned methods cluster in the upper-middle tier.

Explore the full results on our interactive leaderboard: ConTSG-Bench Leaderboard

Installation

# Clone the repository
git clone https://github.com/seqml/ConTSG-Bench.git
cd ConTSG-Bench

# Install full runtime + development tools (recommended)
uv sync --extra full --extra dev

# Alternative: install with pip
pip install -e ".[full,dev]"

If you only need the core package APIs (without full benchmark dependencies), use pip install -e ..

Quick Start

Training a Model

# Fast smoke test (no dataset files required)
contsg train -d debug -m verbalts --smoke

# Basic training with dataset and model specification (requires dataset + checkpoints)
contsg train --dataset synth-m --model verbalts

# With custom parameters
contsg train -d synth-m -m verbalts --epochs 1000 --lr 5e-4

# From configuration file
contsg train --config configs/generators/verbalts_ettm1.yaml

# Resume training from a checkpoint
contsg train -d synth-m -m verbalts --resume experiments/exp1/

For benchmark config placeholders (<CTTP_CONFIG>, <CTTP_CHECKPOINT>), see configs/README.md.

Evaluating a Model

# Evaluate best checkpoint
contsg evaluate experiments/20250101_synth-m_verbalts/

# Specify checkpoint and metrics
contsg evaluate experiments/exp1/ --checkpoint last --metrics dtw,fid

Listing Available Resources

contsg list-models        # List all registered models
contsg list-datasets      # List all registered datasets
contsg list-experiments   # List experiments
contsg info experiments/exp1/  # Show experiment details

Extending the Benchmark

ConTSG-Bench uses a decorator-based registry โ€” models and datasets are auto-discovered at runtime. See the full extending guide for detailed documentation including configuration reference, multi-stage training, and complete examples.

Adding a New Model

Subclass BaseGeneratorModule and implement three methods: _build_model(), forward(), and generate().

# contsg/models/my_model.py
from contsg.models.base import BaseGeneratorModule
from contsg.registry import Registry

@Registry.register_model("my_model")
class MyModelModule(BaseGeneratorModule):
    """My custom generation model."""

    def _build_model(self):
        cfg = self.config.model
        data_cfg = self.config.data
        self.encoder = nn.Linear(cfg.channels, data_cfg.n_var * data_cfg.seq_length)
        self.decoder = nn.Linear(cfg.channels, data_cfg.n_var * data_cfg.seq_length)

    def forward(self, batch):
        ts = batch["ts"]           # (B, L, C) โ€” time series
        cap_emb = batch["cap_emb"] # (B, D)   โ€” text embedding
        # ... compute loss ...
        return {"loss": loss}      # must return dict with "loss" key

    def generate(self, condition, n_samples=1, **kwargs):
        # condition: (B, D) โ€” conditioning tensor
        # return: (B, n_samples, L, C)
        return samples

Then use it immediately:

contsg train -d synth-m -m my_model
contsg train -d debug -m my_model --smoke  # quick validation, no data files needed

Model-specific schema is optional. If you want strict validation for your custom model, register with @Registry.register_model("my_model", config_class=MyModelConfig), then run with --strict-schema. Without config_class, your model still works in default relaxed mode.

Adding a New Dataset

For standard file format ({split}_ts.npy, {split}_cap_emb.npy, etc.), registration requires no custom code:

# contsg/data/datasets/my_dataset.py
from contsg.data.datamodule import BaseDataModule
from contsg.registry import Registry

@Registry.register_dataset("my_dataset")
class MyDataModule(BaseDataModule):
    """My custom dataset."""
    pass  # uses default TimeSeriesDataset loader

Place data files in datasets/my_dataset/ following this structure:

datasets/my_dataset/
โ”œโ”€โ”€ meta.json              # Dataset metadata
โ”œโ”€โ”€ train_ts.npy           # (N, L, C) time series
โ”œโ”€โ”€ train_cap_emb.npy      # (N, D) text embeddings
โ”œโ”€โ”€ valid_ts.npy, valid_cap_emb.npy
โ””โ”€โ”€ test_ts.npy,  test_cap_emb.npy

For non-standard formats, override _create_dataset() โ€” see the extending guide for details.

Project Structure

contsg/
โ”œโ”€โ”€ contsg/                     # Main package
โ”‚   โ”œโ”€โ”€ cli.py                  # CLI entry point (train / evaluate / generate)
โ”‚   โ”œโ”€โ”€ registry.py             # Model / dataset / metric registration
โ”‚   โ”œโ”€โ”€ tracker.py              # Experiment tracking
โ”‚   โ”œโ”€โ”€ config/                 # Pydantic configuration system
โ”‚   โ”œโ”€โ”€ models/                 # 11 generation models
โ”‚   โ”œโ”€โ”€ data/                   # Data handling & text embedding precomputation
โ”‚   โ”œโ”€โ”€ train/                  # Training utilities & multi-stage support
โ”‚   โ””โ”€โ”€ eval/                   # Evaluation system (15 leaderboard metrics + auxiliary metrics)
โ”œโ”€โ”€ configs/                    # YAML configuration files
โ”‚   โ”œโ”€โ”€ cttp/                   # CTTP contrastive model configs
โ”‚   โ””โ”€โ”€ generators/             # Generator model configs
โ”œโ”€โ”€ datasets/                   # Dataset files (gitignored)
โ””โ”€โ”€ experiments/                # Experiment outputs (gitignored)

Development & Quality Checks

# Run tests
uv run --extra dev pytest -v

# Lint and type checks
uv run --extra dev ruff check .
uv run --extra dev black --check .
uv run --extra dev mypy contsg

CI runs tests and markdown link checks on pull requests.

Community

Contact

Citation

If you find ConTSG-Bench useful in your research, please cite our paper:

Paper: https://arxiv.org/abs/2603.04767

@article{contsgbench2026,
  title={ConTSG-Bench: A Unified Benchmark for Conditional Time Series Generation},
  author={Shaocheng Lan and Shuqi Gu and Zhangzhi Xiong and Kan Ren},
  journal={arXiv preprint arXiv:2603.04767},
  year={2026}
}

License

This project is licensed under the MIT License. See LICENSE for details.