Model evaluation and benchmarking

May 8, 2026 · View on GitHub

This workflow is an opinionated, config-driven benchmark entry point for PhysicsNeMo-CFD: Hydra and OmegaConf wire model inference, built-in metrics, tabular artifacts (JSON/CSV/HTML), optional PNG visuals, and optional VTK. It follows the same pattern as workflows/domino_design_sensitivities/. Defaults assume DrivAerML-style run layouts and evaluation components shipped under physicsnemo.cfd.evaluation. Nothing here is unique to that stack — you can extend wrappers, adapters, metrics, reports, and Hydra compositions as needed (see Custom models, datasets, and metrics).

How to use this README (goal-oriented)

Run the benchmark as configured — install, set benchmark.datasets[].root and model paths (or overrides), run python main.py (default config_matrix_surface_hf) or --config-name=config_matrix_surface_custom / config_matrix_volume_hf, confirm outputs under run.output_dir.
Stay in YAML / Hydra first — change YAML under conf/ (see for example config_matrix_volume_hf.yaml) or pass Hydra overrides for models, case_id, metrics, outputs, metrics cache, and exit policy; avoid Python changes until defaults are insufficient.
Extend when needed — add CFDModel wrappers, DatasetAdapter implementations, register_metric / register_visual, or custom configs for integrations the stock workflow does not cover.

Contributing: see the repository CONTRIBUTING.md for pull requests, tests, and sign-off requirements.

Installation

Install PhysicsNeMo-CFD from the repository root (setuptools reads pyproject.toml; there is no separate setup.py):

cd physicsnemo-cfd   # repository root
pip install -e ".[dev]"   # editable install + pytest for contributors
# optional GPU extras:
# pip install -e ".[gpu]" --extra-index-url=https://pypi.nvidia.com

Then use the commands below from this directory (workflows/benchmarking/).

Quick start

Prerequisites: Python 3.10+, NVIDIA GPU for full inference runs, checkpoints and dataset on disk or mounted volume. (For a concise goal-oriented path through docs and config, see the introduction above.)
Assets: Download benchmark checkpoints and the DrivAerML evaluation tree (see DrivAerML dataset: download and directory layout). Built-in matrix models resolve their weights from Hugging Face by default (the _hf configs); point model.checkpoint, stats_path, and dataset.root in conf/*.yaml at local paths to use on-disk checkpoints, or override them on the CLI (see Path overrides (Hydra)).
Configure: Edit YAML under conf/ (see Configuration files: _custom = local checkpoint / stats_path, _hf = Hugging Face builtin_packages when those keys are omitted).
Run:

# Matrix surface (default Hydra config: Hugging Face checkpoints)
python main.py

# Local checkpoint paths (edit paths in YAML first)
python main.py --config-name=config_matrix_surface_custom

# Matrix volume — HF vs local
python main.py --config-name=config_matrix_volume_hf
python main.py --config-name=config_matrix_volume_custom

# Overrides (examples)
python main.py case_id=run_1 run.device=cuda:0
python main.py --config-name=config_matrix_surface_hf run.output_dir=my_surface_run
python main.py run.fail_on_all_skipped=true

Outputs: Under run.output_dir — benchmark_results.json / .csv / .html, optional benchmark_artifacts.json, metrics_cache/ when enabled, Hydra metadata under hydra/ if configured.
Multi-GPU (optional): install physicsnemo so DistributedManager is available, then:

torchrun --standalone --nproc_per_node=4 main.py
# or: torchrun --standalone --nproc_per_node=8 main.py --config-name=config_matrix_surface_custom

Cases are split across GPUs (cases[rank::world_size]). Rank 0 writes reports and optional artifacts. Set run.distributed: false only for debugging (each rank would run the full case list).

Configuration files in `conf/`

Hydra loads --config-name=<stem> from this directory (no .yaml suffix). This workflow ships four configs: matrix only — _custom for local or NFS checkpoints you set in YAML, _hf for Hugging Face defaults via builtin_packages when checkpoint / stats_path are omitted.

Config file	Role
`config_matrix_surface_custom.yaml`	Matrix surface: `benchmark.models` × `benchmark.datasets`. Fill `checkpoint`, `stats_path`, `dataset.root`, and DoMINO `kwargs.domino_config` / `stats_path` → `scaling_factors.pkl` as in the template comments. Use `--config-name=config_matrix_surface_custom`.
`config_matrix_volume_custom.yaml`	Matrix volume with the same explicit path story for VTU inference.
`config_matrix_surface_hf.yaml`	Same matrix layout as the surface custom file, but models typically omit `checkpoint` / `stats_path` so weights resolve from Hugging Face. This is the default for `python main.py` (`main.py` `config_name`). Requires `huggingface-cli login` or `HF_TOKEN` for private repos.
`config_matrix_volume_hf.yaml`	HF-hosted volume matrix (GeoTransolver / Transolver / DoMINO); XMGN / FiGNet volume may have no Hub checkpoint—omit those blocks.

_custom vs _hf: use _custom when you want full control over on-disk checkpoints and stats; use _hf when you rely on Hub resolution and optional cache under PHYSICSNEMO_CFD_MODEL_CACHE.

VTK field names on written meshes: Under output, the maps mesh_field_names / ground_truth_mesh_field_names (surface) and volume_mesh_field_names / ground_truth_volume_mesh_field_names (volume) set the VTK array names used when the workflow writes optional .vtp / .vtu files (for example saved inference meshes from run.save_inference_mesh or comparison meshes when reports enable them). Choose strings that match your model’s written fields and the ground-truth adapter’s field names.

Path overrides (Hydra)

Override checkpoint, stats, and dataset paths from the CLI without editing YAML files:

python main.py \
  benchmark.models.0.checkpoint=/path/to/checkpoint.pt \
  benchmark.models.0.stats_path=/path/to/global_stats.json \
  benchmark.datasets.0.root=/path/to/drivaer_data

You can also export environment variables and reference them in a local, gitignored config using OmegaConf oc.env patterns if you add them to your YAML.

DrivAerML dataset: download and directory layout

The drivaerml adapter expects a local root directory (benchmark.datasets[].root) whose children are run_<id> folders. Each run holds the VTK ground-truth meshes the metrics read from.

Canonical source: Hugging Face neashton/drivaerml — dataset card, file layout, and download examples (also mirrored in spirit on CAE-ML Datasets). License: CC BY-SA 4.0; cite the DrivAerML paper per the card.

Scale: the full repository is on the order of ~31 TB. Do not blindly run a whole-repo sync on a laptop; use Git LFS only when you intend to host the full tree, or use selective HTTP / CLI downloads for testing.

What each run_/ contains (per the dataset card)

File / folder	Role
`drivaer_<i>.stl`	Geometry
`geo_ref_<i>.csv`, `geo_parameters_<i>.csv`	Reference / parameter metadata
`boundary_<i>.vtp`	Surface mesh + fields — required for surface benchmarking
`volume_<i>.vtu`	Volume mesh — required for volume benchmarking (see note below)
`force_mom_<i>.csv`, `force_mom_constref_<i>.csv`	Forces / moments
`slices/`, `Images/`	Slices and images (optional for this workflow)

Repo-level extras include openfoam_meshes/, force_mom_all.csv, etc.; the adapter does not need those for standard metrics.

Volume files on Hugging Face: for many runs, volume_.vtu is split on the Hub into two parts. Download both parts for a given i, then concatenate them into a single volume_.vtu in run_/ before running volume benchmarks (see the dataset card for naming). Surface-only tests only need boundary_.vtp.

On-disk layout (what the adapter checks)

Mode	Per-run directory	Default file
Surface (default)	`root/run_<n>/`	`boundary_<n>.vtp` (e.g. `run_1/boundary_1.vtp`)
Volume (`dataset.kwargs.inference_domain: volume`)	`root/run_<n>/`	`volume_<n>.vtu`

Only directories that contain the required mesh for the selected mode are listed as cases. Overrides for non-standard filenames are documented on DrivAerMLAdapter in physicsnemo.cfd.evaluation.datasets.adapters.drivaerml.

1) Full mirror — Git + Git LFS (dataset maintainers’ example)

Install Git LFS, then clone the dataset repo. The clone root (the directory that contains the run_* folders) is benchmark.datasets[].root.

git lfs install
git clone https://huggingface.co/datasets/neashton/drivaerml
# SSH alternative (if configured on Hugging Face): git clone git@hf.co:datasets/neashton/drivaerml

This pulls the entire ~31 TB collection when LFS objects are fetched — use only with appropriate storage and bandwidth.

2) Selective download for testing — HTTP (`wget` / `curl`)

The dataset card documents looping over run_$i and pulling individual files. For surface workflow smoke tests, fetch at least boundary_.vtp per run. Example (adjust the ID list or seq range):

HF_OWNER="neashton"
HF_PREFIX="drivaerml"
LOCAL_DIR="./drivaer_data"
mkdir -p "$LOCAL_DIR"

for i in 1 11 202; do
  RUN_DIR="run_${i}"
  DEST="$LOCAL_DIR/$RUN_DIR"
  mkdir -p "$DEST"
  wget -c "https://huggingface.co/datasets/${HF_OWNER}/${HF_PREFIX}/resolve/main/${RUN_DIR}/boundary_${i}.vtp" \
    -O "$DEST/boundary_${i}.vtp"
done

Then point the benchmark at the parent directory:

python main.py benchmark.datasets.0.root="$(pwd)/drivaer_data"

Adapt the URL pattern to add volume_* parts, STL, or force CSVs as needed; the card’s Example 2 shows the same idea for STL + force files.

3) Hugging Face Hub CLI — partial tree

Install the CLI and prefer include patterns so you do not sync 31 TB by accident:

pip install "huggingface_hub[cli]"
huggingface-cli download neashton/drivaerml --repo-type dataset \
  --local-dir /path/to/drivaerml_subset \
  --include "run_1/*" --include "run_11/*"

Expand --include for every run_ you need. For a full local mirror without Git, understand the size implications before omitting filters.

Train / validation split for benchmarking

After the tree is on disk, use the fixed 90/10 split files under drivaer_ml_files/ if you want to align case lists with that split — see drivaer_ml_files/README.md.

Notebooks (exploration, not batch production)

The notebooks/ directory (surface_benchmarking.ipynb, volume_benchmarking.ipynb, adding_a_new_model.ipynb, etc.) supports interactive exploration and onboarding; main.py + conf/ remains the recommendation for logged, repeatable benchmark runs aligned with CI and matrix configs.

Exit codes (automation)

main.py and python -m physicsnemo.cfd.evaluation.benchmarks.run both call run_benchmark_cli, which runs run_benchmark and maps BenchmarkPolicyError to process exit 1 (message on stderr). Exit 0 when the benchmark completes without that policy exit; other exceptions propagate to Hydra / Python as usual.
Set run.fail_on_all_skipped: true so BenchmarkPolicyError (exit 1) applies if every model×dataset run was skipped (e.g. domain mismatch in matrix mode).
Set run.fail_on_any_metric_nan: true so BenchmarkPolicyError (exit 1) applies if any non-skipped aggregate metric in metrics is NaN.
Override on the CLI: run.fail_on_all_skipped=true.

Config files (short reference)

File	Role
`conf/config_matrix_surface_custom.yaml`	Matrix surface, local `checkpoint` / `stats_path`; `run.output_dir`: `benchmark_results_matrix_surface`.
`conf/config_matrix_volume_custom.yaml`	Matrix volume, same for VTU runs; `run.output_dir`: `benchmark_results_matrix_volume`.
`conf/config_matrix_surface_hf.yaml`	Matrix surface, HF assets; `run.output_dir`: `benchmark_results_matrix_surface_hf`. Default `python main.py`.
`conf/config_matrix_volume_hf.yaml`	Matrix volume, HF assets; `run.output_dir` as in file.

Matrix mode: every benchmark.models entry runs against every benchmark.datasets entry; incompatible surface/volume pairs are skipped. Edit checkpoint paths in _custom; comment out any - name: block you do not need.

Config reference (YAML keys)

Section	Contents
run	`device`, `output_dir`, `seed`, `batch_size`, `save_inference_mesh`, `distributed`, `fail_on_all_skipped`, `fail_on_any_metric_nan`, optional `metrics_cache`
model / benchmark.models	`name`, `checkpoint`, `stats_path`, optional `package` (`hf://org/repo@rev`, `s3://…`, or local dir), `checkpoint_relpath` / `stats_relpath` (paths inside the package), `inference_domain`, `kwargs`
dataset / benchmark.datasets	`name`, `root`, optional `case_ids` (`null` = all cases from `adapter.list_cases()`), `kwargs`
output	VTK field name maps; `streamlines_vector_canonical` (volume); optional `surface_interpolate_point_to_cell_for_metrics` (kNN-IDW point → cell for XmGN/FiGNet-style surfaces so drag/lift/L2 use `metric_dtype: cell`), plus `surface_metrics_idw_k`, `surface_metrics_idw_device`
metrics	Metric names or `{ name: ..., ...kwargs }`
reports	Optional PNG pipeline
benchmark	`mode`, `models` / `datasets`, `reproducibility` (`log_env`, `save_artifacts`)

case_id: optional top-level Hydra key: null uses each dataset's case_ids (or all adapter cases); a string runs one case on every dataset; a list runs those cases in matrix mode.

benchmark.reproducibility.log_env: when true, writes full os.environ to env.json under run.output_dir — avoid in shared CI or when HF/AWS/other secrets may be present (default false in code and in all example YAMLs under conf/). Enable only deliberately for short-lived local debugging.

Metrics cache: run.metrics_cache.enabled stores per-case scalars; delete the cache directory for a full recompute. Plots and meshes are not cached.

Remote model assets: Install optional pip install 'nvidia-physicsnemo-cfd[evaluation-hf]' for hf:// and s3:// package roots. Cache directory defaults to ~/.cache/physicsnemo-cfd/models or override with PHYSICSNEMO_CFD_MODEL_CACHE. Built-in matrix models (including domino, which also resolves domino_config from the package) use per-model roots in physicsnemo.cfd.evaluation.assets.builtin_packages when checkpoint / stats_path are omitted; override with explicit paths or model.package as needed. register_default_asset remains available for custom names. See CONTRIBUTING.md for custom-wrapper tiers.

Custom models, datasets, and metrics

Custom models

Subclass CFDModel (physicsnemo.cfd.evaluation.models.model_registry) under physicsnemo/cfd/evaluation/models/wrappers/.
Implement INFERENCE_DOMAIN, OUTPUT_LOCATION, load, prepare_inputs, predict, decode_outputs(raw_output, case, model_input=None) (the engine passes the same model_input returned by prepare_inputs).
register_model("my_model", MyWrapper) in wrappers/__init__.py.

Custom datasets

Subclass DatasetAdapter under physicsnemo/cfd/evaluation/datasets/adapters/.
Implement list_cases, load_case → CanonicalCase.
register_adapter("my_dataset", MyAdapter) in adapters/__init__.py.

Custom metrics

Register functions with physicsnemo.cfd.evaluation.metrics.register_metric, then list the name under metrics: in YAML (see physicsnemo.cfd.evaluation.metrics).

Canonical types

Schema: physicsnemo.cfd.evaluation.datasets.schema — CanonicalCase, prediction keys for surface/volume.

Metrics

Registered names (or dicts with name + kwargs). Surface and volume matrix configs use different metric subsets — drag and lift are registered for domain="surface" only (evaluation/metrics/builtin/forces.py). Do not list drag / lift under volume benchmarks: get_metric has no volume entry, and the benchmark would record NaNs without a loud error.

Surface example:

metrics:
  - l2_pressure
  - l2_shear_stress
  - drag
  - lift

Volume — HF matrix (config_matrix_volume_hf.yaml): L2 field trio only (subset of the custom matrix).

metrics:
  - l2_pressure
  - l2_turbulent_viscosity
  - l2_velocity

Volume — custom matrix (config_matrix_volume_custom.yaml): same trio plus residual metrics in the shipped config.

metrics:
  - l2_pressure
  - l2_turbulent_viscosity
  - l2_velocity
  - continuity_residual_l2
  - momentum_residual_l2

Residual metrics: continuity_residual_l2 and momentum_residual_l2 are registered built-ins; use them when your config and fields support those residuals. The shipped HF volume matrix omits them; the shipped custom volume matrix includes them. You can append them to any volume metrics: list when appropriate.

Name	Meaning
`l2_pressure`, `l2_shear_stress`	Surface L2
`l2_pressure_area_weighted`	Area-weighted L2 pressure
`drag`, `lift`	Coefficient errors (surface inference only); expands to `drag_error`, etc.
`l2_pressure`, `l2_velocity`, `l2_turbulent_viscosity` (volume)	Volume-field L2 (see `l2.py`, `domain="volume"`). Shipped in both volume matrix YAMLs above.
`continuity_residual_l2`, `momentum_residual_l2`	Volume residual L2; shipped in `config_matrix_volume_custom.yaml` only among the two matrix examples.

Reports and plots

When reports.enabled and reports.visuals are set, PNGs are written under {run.output_dir}/visuals/. Comparison VTK exists if reports.save_comparison_meshes: true.

When reports.visuals is set, each registered visual receives the full results list (one entry per model × dataset cell); built-ins iterate runs and write {run.output_dir}/visuals/ PNGs with model / dataset in the filename where applicable.

aggregate_volume_errors writes per-run files such as {model}_{dataset}_aggregate_resampled_volume.vtk and {model}_{dataset}_aggregate_volume_*_slice.png (sanitized names). Headless servers may need xvfb (see setup.sh for an example apt line — optional).

Name	Role
`field_comparison_surface`	Surface GT vs pred
`line_plot`	GT vs pred along `plot_coord`
`design_scatter` / `design_trend`	Design-of-experiments style plots
`streamlines_comparison`	Volume streamlines

Legacy `bench_example` mapping

`workflows/deprecated/bench_example` / `postprocessing_tools.visualization.utils`	Evaluation visual name
`plot_field_comparisons`	`field_comparison_surface`
`plot_line`	`line_plot`
`plot_design_scatter`	`design_scatter`
`plot_design_trend`	`design_trend`

line_plot: centerline-style strip plots — see workflows/deprecated/bench_example or a custom register_visual.

Troubleshooting

Missing checkpoints / bad paths: inference fails or skips; verify overrides and inference_domain.
DrivAerML root wrong: dataset.root must be the directory that directly contains run_* folders with boundary_*.vtp (surface) or volume_*.vtu (volume), not a parent of an extra nesting level; see DrivAerML dataset.
CUDA OOM: reduce batch resolution in model.kwargs or use a smaller case_id set.
Matrix skips: incompatible model vs dataset domain — check logs for SKIP lines.
Editable install: use pip install -e ".[dev]" so local physicsnemo.cfd changes apply.

Advanced (Python API)

physicsnemo.cfd.evaluation.benchmarks.engine.run_benchmark and Config.from_dict are for scripts and tests. python -m physicsnemo.cfd.evaluation.benchmarks.run and python -m physicsnemo.cfd.evaluation.inference accept flat YAML/JSON without Hydra ${...} interpolation unless you materialize values first. physicsnemo.cfd.evaluation.models holds CFDModel and wrappers; evaluation.inference adds log_inference and the compatibility CLI.

Baseline model stubs (`surface_baseline`, `volume_baseline`)

Smoke-test wrappers: checkpoint: "", stats_path: "". Use under benchmark.models entries.

DrivAerML train/validation split

The drivaer_ml_files/ directory lists which run_* IDs fall in the proposed 90/10 train/validation partition. Download the dataset first (section above), then use those lists to constrain case_id or dataset.case_ids. Details: drivaer_ml_files/README.md.

Package layout (repository)

Path	Role
`physicsnemo/cfd/evaluation/config.py`	`Config`, `load_config`
`physicsnemo/cfd/evaluation/benchmarks/`	Engine, reports, Hydra helpers
`physicsnemo/cfd/evaluation/reports/`	`register_visual`, built-in plots
`physicsnemo/cfd/postprocessing_tools/`	Shared metrics / visualization