Model evaluation and benchmarking

May 8, 2026 · View on GitHub

This workflow is an opinionated, config-driven benchmark entry point for PhysicsNeMo-CFD: Hydra and OmegaConf wire model inference, built-in metrics, tabular artifacts (JSON/CSV/HTML), optional PNG visuals, and optional VTK. It follows the same pattern as workflows/domino_design_sensitivities/. Defaults assume DrivAerML-style run layouts and evaluation components shipped under physicsnemo.cfd.evaluation. Nothing here is unique to that stack — you can extend wrappers, adapters, metrics, reports, and Hydra compositions as needed (see Custom models, datasets, and metrics).

How to use this README (goal-oriented)

  1. Run the benchmark as configured — install, set benchmark.datasets[].root and model paths (or overrides), run python main.py (default config_matrix_surface_hf) or --config-name=config_matrix_surface_custom / config_matrix_volume_hf, confirm outputs under run.output_dir.
  2. Stay in YAML / Hydra first — change YAML under conf/ (see for example config_matrix_volume_hf.yaml) or pass Hydra overrides for models, case_id, metrics, outputs, metrics cache, and exit policy; avoid Python changes until defaults are insufficient.
  3. Extend when needed — add CFDModel wrappers, DatasetAdapter implementations, register_metric / register_visual, or custom configs for integrations the stock workflow does not cover.

Contributing: see the repository CONTRIBUTING.md for pull requests, tests, and sign-off requirements.


Installation

Install PhysicsNeMo-CFD from the repository root (setuptools reads pyproject.toml; there is no separate setup.py):

cd physicsnemo-cfd   # repository root
pip install -e ".[dev]"   # editable install + pytest for contributors
# optional GPU extras:
# pip install -e ".[gpu]" --extra-index-url=https://pypi.nvidia.com

Then use the commands below from this directory (workflows/benchmarking/).


Quick start

  1. Prerequisites: Python 3.10+, NVIDIA GPU for full inference runs, checkpoints and dataset on disk or mounted volume. (For a concise goal-oriented path through docs and config, see the introduction above.)
  2. Assets: Download benchmark checkpoints and the DrivAerML evaluation tree (see DrivAerML dataset: download and directory layout). Built-in matrix models resolve their weights from Hugging Face by default (the _hf configs); point model.checkpoint, stats_path, and dataset.root in conf/*.yaml at local paths to use on-disk checkpoints, or override them on the CLI (see Path overrides (Hydra)).
  3. Configure: Edit YAML under conf/ (see Configuration files: _custom = local checkpoint / stats_path, _hf = Hugging Face builtin_packages when those keys are omitted).
  4. Run:
# Matrix surface (default Hydra config: Hugging Face checkpoints)
python main.py

# Local checkpoint paths (edit paths in YAML first)
python main.py --config-name=config_matrix_surface_custom

# Matrix volume — HF vs local
python main.py --config-name=config_matrix_volume_hf
python main.py --config-name=config_matrix_volume_custom

# Overrides (examples)
python main.py case_id=run_1 run.device=cuda:0
python main.py --config-name=config_matrix_surface_hf run.output_dir=my_surface_run
python main.py run.fail_on_all_skipped=true
  1. Outputs: Under run.output_dirbenchmark_results.json / .csv / .html, optional benchmark_artifacts.json, metrics_cache/ when enabled, Hydra metadata under hydra/ if configured.
  2. Multi-GPU (optional): install physicsnemo so DistributedManager is available, then:
torchrun --standalone --nproc_per_node=4 main.py
# or: torchrun --standalone --nproc_per_node=8 main.py --config-name=config_matrix_surface_custom

Cases are split across GPUs (cases[rank::world_size]). Rank 0 writes reports and optional artifacts. Set run.distributed: false only for debugging (each rank would run the full case list).


Configuration files in conf/

Hydra loads --config-name=<stem> from this directory (no .yaml suffix). This workflow ships four configs: matrix only_custom for local or NFS checkpoints you set in YAML, _hf for Hugging Face defaults via builtin_packages when checkpoint / stats_path are omitted.

Config fileRole
config_matrix_surface_custom.yamlMatrix surface: benchmark.models × benchmark.datasets. Fill checkpoint, stats_path, dataset.root, and DoMINO kwargs.domino_config / stats_pathscaling_factors.pkl as in the template comments. Use --config-name=config_matrix_surface_custom.
config_matrix_volume_custom.yamlMatrix volume with the same explicit path story for VTU inference.
config_matrix_surface_hf.yamlSame matrix layout as the surface custom file, but models typically omit checkpoint / stats_path so weights resolve from Hugging Face. This is the default for python main.py (main.py config_name). Requires huggingface-cli login or HF_TOKEN for private repos.
config_matrix_volume_hf.yamlHF-hosted volume matrix (GeoTransolver / Transolver / DoMINO); XMGN / FiGNet volume may have no Hub checkpoint—omit those blocks.

_custom vs _hf: use _custom when you want full control over on-disk checkpoints and stats; use _hf when you rely on Hub resolution and optional cache under PHYSICSNEMO_CFD_MODEL_CACHE.

VTK field names on written meshes: Under output, the maps mesh_field_names / ground_truth_mesh_field_names (surface) and volume_mesh_field_names / ground_truth_volume_mesh_field_names (volume) set the VTK array names used when the workflow writes optional .vtp / .vtu files (for example saved inference meshes from run.save_inference_mesh or comparison meshes when reports enable them). Choose strings that match your model’s written fields and the ground-truth adapter’s field names.


Path overrides (Hydra)

Override checkpoint, stats, and dataset paths from the CLI without editing YAML files:

python main.py \
  benchmark.models.0.checkpoint=/path/to/checkpoint.pt \
  benchmark.models.0.stats_path=/path/to/global_stats.json \
  benchmark.datasets.0.root=/path/to/drivaer_data

You can also export environment variables and reference them in a local, gitignored config using OmegaConf oc.env patterns if you add them to your YAML.


DrivAerML dataset: download and directory layout

The drivaerml adapter expects a local root directory (benchmark.datasets[].root) whose children are run_<id> folders. Each run holds the VTK ground-truth meshes the metrics read from.

Canonical source: Hugging Face neashton/drivaerml — dataset card, file layout, and download examples (also mirrored in spirit on CAE-ML Datasets). License: CC BY-SA 4.0; cite the DrivAerML paper per the card.

Scale: the full repository is on the order of ~31 TB. Do not blindly run a whole-repo sync on a laptop; use Git LFS only when you intend to host the full tree, or use selective HTTP / CLI downloads for testing.

What each run_<i>/ contains (per the dataset card)

File / folderRole
drivaer_<i>.stlGeometry
geo_ref_<i>.csv, geo_parameters_<i>.csvReference / parameter metadata
boundary_<i>.vtpSurface mesh + fields — required for surface benchmarking
volume_<i>.vtuVolume mesh — required for volume benchmarking (see note below)
force_mom_<i>.csv, force_mom_constref_<i>.csvForces / moments
slices/, Images/Slices and images (optional for this workflow)

Repo-level extras include openfoam_meshes/, force_mom_all.csv, etc.; the adapter does not need those for standard metrics.

Volume files on Hugging Face: for many runs, volume_<i>.vtu is split on the Hub into two parts. Download both parts for a given i, then concatenate them into a single volume_<i>.vtu in run_<i>/ before running volume benchmarks (see the dataset card for naming). Surface-only tests only need boundary_<i>.vtp.

On-disk layout (what the adapter checks)

ModePer-run directoryDefault file
Surface (default)root/run_<n>/boundary_<n>.vtp (e.g. run_1/boundary_1.vtp)
Volume (dataset.kwargs.inference_domain: volume)root/run_<n>/volume_<n>.vtu

Only directories that contain the required mesh for the selected mode are listed as cases. Overrides for non-standard filenames are documented on DrivAerMLAdapter in physicsnemo.cfd.evaluation.datasets.adapters.drivaerml.

1) Full mirror — Git + Git LFS (dataset maintainers’ example)

Install Git LFS, then clone the dataset repo. The clone root (the directory that contains the run_* folders) is benchmark.datasets[].root.

git lfs install
git clone https://huggingface.co/datasets/neashton/drivaerml
# SSH alternative (if configured on Hugging Face): git clone git@hf.co:datasets/neashton/drivaerml

This pulls the entire ~31 TB collection when LFS objects are fetched — use only with appropriate storage and bandwidth.

2) Selective download for testing — HTTP (wget / curl)

The dataset card documents looping over run_$i and pulling individual files. For surface workflow smoke tests, fetch at least boundary_<i>.vtp per run. Example (adjust the ID list or seq range):

HF_OWNER="neashton"
HF_PREFIX="drivaerml"
LOCAL_DIR="./drivaer_data"
mkdir -p "$LOCAL_DIR"

for i in 1 11 202; do
  RUN_DIR="run_${i}"
  DEST="$LOCAL_DIR/$RUN_DIR"
  mkdir -p "$DEST"
  wget -c "https://huggingface.co/datasets/${HF_OWNER}/${HF_PREFIX}/resolve/main/${RUN_DIR}/boundary_${i}.vtp" \
    -O "$DEST/boundary_${i}.vtp"
done

Then point the benchmark at the parent directory:

python main.py benchmark.datasets.0.root="$(pwd)/drivaer_data"

Adapt the URL pattern to add volume_* parts, STL, or force CSVs as needed; the card’s Example 2 shows the same idea for STL + force files.

3) Hugging Face Hub CLI — partial tree

Install the CLI and prefer include patterns so you do not sync 31 TB by accident:

pip install "huggingface_hub[cli]"
huggingface-cli download neashton/drivaerml --repo-type dataset \
  --local-dir /path/to/drivaerml_subset \
  --include "run_1/*" --include "run_11/*"

Expand --include for every run_<i> you need. For a full local mirror without Git, understand the size implications before omitting filters.

Train / validation split for benchmarking

After the tree is on disk, use the fixed 90/10 split files under drivaer_ml_files/ if you want to align case lists with that split — see drivaer_ml_files/README.md.


Notebooks (exploration, not batch production)

The notebooks/ directory (surface_benchmarking.ipynb, volume_benchmarking.ipynb, adding_a_new_model.ipynb, etc.) supports interactive exploration and onboarding; main.py + conf/ remains the recommendation for logged, repeatable benchmark runs aligned with CI and matrix configs.


Exit codes (automation)

  • main.py and python -m physicsnemo.cfd.evaluation.benchmarks.run both call run_benchmark_cli, which runs run_benchmark and maps BenchmarkPolicyError to process exit 1 (message on stderr). Exit 0 when the benchmark completes without that policy exit; other exceptions propagate to Hydra / Python as usual.
  • Set run.fail_on_all_skipped: true so BenchmarkPolicyError (exit 1) applies if every model×dataset run was skipped (e.g. domain mismatch in matrix mode).
  • Set run.fail_on_any_metric_nan: true so BenchmarkPolicyError (exit 1) applies if any non-skipped aggregate metric in metrics is NaN.
  • Override on the CLI: run.fail_on_all_skipped=true.

Config files (short reference)

FileRole
conf/config_matrix_surface_custom.yamlMatrix surface, local checkpoint / stats_path; run.output_dir: benchmark_results_matrix_surface.
conf/config_matrix_volume_custom.yamlMatrix volume, same for VTU runs; run.output_dir: benchmark_results_matrix_volume.
conf/config_matrix_surface_hf.yamlMatrix surface, HF assets; run.output_dir: benchmark_results_matrix_surface_hf. Default python main.py.
conf/config_matrix_volume_hf.yamlMatrix volume, HF assets; run.output_dir as in file.

Matrix mode: every benchmark.models entry runs against every benchmark.datasets entry; incompatible surface/volume pairs are skipped. Edit checkpoint paths in _custom; comment out any - name: block you do not need.


Config reference (YAML keys)

SectionContents
rundevice, output_dir, seed, batch_size, save_inference_mesh, distributed, fail_on_all_skipped, fail_on_any_metric_nan, optional metrics_cache
model / benchmark.modelsname, checkpoint, stats_path, optional package (hf://org/repo@rev, s3://…, or local dir), checkpoint_relpath / stats_relpath (paths inside the package), inference_domain, kwargs
dataset / benchmark.datasetsname, root, optional case_ids (null = all cases from adapter.list_cases()), kwargs
outputVTK field name maps; streamlines_vector_canonical (volume); optional surface_interpolate_point_to_cell_for_metrics (kNN-IDW point → cell for XmGN/FiGNet-style surfaces so drag/lift/L2 use metric_dtype: cell), plus surface_metrics_idw_k, surface_metrics_idw_device
metricsMetric names or { name: ..., ...kwargs }
reportsOptional PNG pipeline
benchmarkmode, models / datasets, reproducibility (log_env, save_artifacts)

case_id: optional top-level Hydra key: null uses each dataset's case_ids (or all adapter cases); a string runs one case on every dataset; a list runs those cases in matrix mode.

benchmark.reproducibility.log_env: when true, writes full os.environ to env.json under run.output_dir — avoid in shared CI or when HF/AWS/other secrets may be present (default false in code and in all example YAMLs under conf/). Enable only deliberately for short-lived local debugging.

Metrics cache: run.metrics_cache.enabled stores per-case scalars; delete the cache directory for a full recompute. Plots and meshes are not cached.

Remote model assets: Install optional pip install 'nvidia-physicsnemo-cfd[evaluation-hf]' for hf:// and s3:// package roots. Cache directory defaults to ~/.cache/physicsnemo-cfd/models or override with PHYSICSNEMO_CFD_MODEL_CACHE. Built-in matrix models (including domino, which also resolves domino_config from the package) use per-model roots in physicsnemo.cfd.evaluation.assets.builtin_packages when checkpoint / stats_path are omitted; override with explicit paths or model.package as needed. register_default_asset remains available for custom names. See CONTRIBUTING.md for custom-wrapper tiers.


Custom models, datasets, and metrics

Custom models

  1. Subclass CFDModel (physicsnemo.cfd.evaluation.models.model_registry) under physicsnemo/cfd/evaluation/models/wrappers/.
  2. Implement INFERENCE_DOMAIN, OUTPUT_LOCATION, load, prepare_inputs, predict, decode_outputs(raw_output, case, model_input=None) (the engine passes the same model_input returned by prepare_inputs).
  3. register_model("my_model", MyWrapper) in wrappers/__init__.py.

Custom datasets

  1. Subclass DatasetAdapter under physicsnemo/cfd/evaluation/datasets/adapters/.
  2. Implement list_cases, load_caseCanonicalCase.
  3. register_adapter("my_dataset", MyAdapter) in adapters/__init__.py.

Custom metrics

Register functions with physicsnemo.cfd.evaluation.metrics.register_metric, then list the name under metrics: in YAML (see physicsnemo.cfd.evaluation.metrics).

Canonical types

Schema: physicsnemo.cfd.evaluation.datasets.schemaCanonicalCase, prediction keys for surface/volume.


Metrics

Registered names (or dicts with name + kwargs). Surface and volume matrix configs use different metric subsets — drag and lift are registered for domain="surface" only (evaluation/metrics/builtin/forces.py). Do not list drag / lift under volume benchmarks: get_metric has no volume entry, and the benchmark would record NaNs without a loud error.

Surface example:

metrics:
  - l2_pressure
  - l2_shear_stress
  - drag
  - lift

Volume — HF matrix (config_matrix_volume_hf.yaml): L2 field trio only (subset of the custom matrix).

metrics:
  - l2_pressure
  - l2_turbulent_viscosity
  - l2_velocity

Volume — custom matrix (config_matrix_volume_custom.yaml): same trio plus residual metrics in the shipped config.

metrics:
  - l2_pressure
  - l2_turbulent_viscosity
  - l2_velocity
  - continuity_residual_l2
  - momentum_residual_l2

Residual metrics: continuity_residual_l2 and momentum_residual_l2 are registered built-ins; use them when your config and fields support those residuals. The shipped HF volume matrix omits them; the shipped custom volume matrix includes them. You can append them to any volume metrics: list when appropriate.

NameMeaning
l2_pressure, l2_shear_stressSurface L2
l2_pressure_area_weightedArea-weighted L2 pressure
drag, liftCoefficient errors (surface inference only); expands to drag_error, etc.
l2_pressure, l2_velocity, l2_turbulent_viscosity (volume)Volume-field L2 (see l2.py, domain="volume"). Shipped in both volume matrix YAMLs above.
continuity_residual_l2, momentum_residual_l2Volume residual L2; shipped in config_matrix_volume_custom.yaml only among the two matrix examples.

Reports and plots

When reports.enabled and reports.visuals are set, PNGs are written under {run.output_dir}/visuals/. Comparison VTK exists if reports.save_comparison_meshes: true.

When reports.visuals is set, each registered visual receives the full results list (one entry per model × dataset cell); built-ins iterate runs and write {run.output_dir}/visuals/ PNGs with model / dataset in the filename where applicable.

aggregate_volume_errors writes per-run files such as {model}_{dataset}_aggregate_resampled_volume.vtk and {model}_{dataset}_aggregate_volume_*_slice.png (sanitized names). Headless servers may need xvfb (see setup.sh for an example apt line — optional).

NameRole
field_comparison_surfaceSurface GT vs pred
line_plotGT vs pred along plot_coord
design_scatter / design_trendDesign-of-experiments style plots
streamlines_comparisonVolume streamlines

Register more: physicsnemo.cfd.evaluation.reports.register_visual.

Legacy bench_example mapping

workflows/deprecated/bench_example / postprocessing_tools.visualization.utilsEvaluation visual name
plot_field_comparisonsfield_comparison_surface
plot_lineline_plot
plot_design_scatterdesign_scatter
plot_design_trenddesign_trend

line_plot: centerline-style strip plots — see workflows/deprecated/bench_example or a custom register_visual.


Troubleshooting

  • Missing checkpoints / bad paths: inference fails or skips; verify overrides and inference_domain.
  • DrivAerML root wrong: dataset.root must be the directory that directly contains run_* folders with boundary_*.vtp (surface) or volume_*.vtu (volume), not a parent of an extra nesting level; see DrivAerML dataset.
  • CUDA OOM: reduce batch resolution in model.kwargs or use a smaller case_id set.
  • Matrix skips: incompatible model vs dataset domain — check logs for SKIP lines.
  • Editable install: use pip install -e ".[dev]" so local physicsnemo.cfd changes apply.

Advanced (Python API)

physicsnemo.cfd.evaluation.benchmarks.engine.run_benchmark and Config.from_dict are for scripts and tests. python -m physicsnemo.cfd.evaluation.benchmarks.run and python -m physicsnemo.cfd.evaluation.inference accept flat YAML/JSON without Hydra ${...} interpolation unless you materialize values first. physicsnemo.cfd.evaluation.models holds CFDModel and wrappers; evaluation.inference adds log_inference and the compatibility CLI.


Baseline model stubs (surface_baseline, volume_baseline)

Smoke-test wrappers: checkpoint: "", stats_path: "". Use under benchmark.models entries.


DrivAerML train/validation split

The drivaer_ml_files/ directory lists which run_* IDs fall in the proposed 90/10 train/validation partition. Download the dataset first (section above), then use those lists to constrain case_id or dataset.case_ids. Details: drivaer_ml_files/README.md.


Package layout (repository)

PathRole
physicsnemo/cfd/evaluation/config.pyConfig, load_config
physicsnemo/cfd/evaluation/benchmarks/Engine, reports, Hydra helpers
physicsnemo/cfd/evaluation/reports/register_visual, built-in plots
physicsnemo/cfd/postprocessing_tools/Shared metrics / visualization