ExploreGS-GS-DataGen

May 17, 2026 · View on GitHub

Bulk 3D Gaussian Splatting dataset generation. Given a multi-view image collection (DL3DV-10K or MipNeRF-360), this repo trains a Gaussian Splatting model per scene, renders train/test images, visibility masks and depth maps, and emits per-scene metadata (partition.json, cam_extrinsics.npy, cam_intrinsics.npy) for downstream training.

It is a fork of the Inria 3D Gaussian Splatting codebase (train.py, render.py, scene/, gaussian_renderer/, submodules/), with the dataset-generation orchestration layered on top. See LICENSE.md — the original code is for non-commercial research and evaluation use.

Setup

The environment is almost identical to the official Inria 3D Gaussian Splatting repo — if you already have that set up, you can reuse it directly. And then,

pip install GPUtil

The CUDA rasterizer and KNN submodules under submodules/ are installed by the conda environment. SIBR_viewers is the optional interactive viewer (build separately, see the upstream 3DGS instructions). For hardware requirements, CUDA/compiler setup, and troubleshooting, refer to the upstream 3DGS README.

Expected input layout

Each scene directory needs decimated images (images_4/) plus camera data: Our codes is basically for DL3DV-10K dataset. (960P we used.)

<dataset-root>/
  <subset>/                # ex: "1K" for DL3DV
    <scene>/
      images_4/            # required: 4x-downscaled images
      transforms.json      # required for DL3DV
      sparse/              # required for DL3DV

Notes

There are some hardcoded values, including images_4 or image folder directories or etc You would change them for your settings.

Usage

dataset_generation.py scans the dataset root, builds one job per (scene, split_ratio), and dispatches them across idle GPUs (via GPUtil + a thread pool). Each job runs the requested --stages in order.

python dataset_generation.py --dataset dl3dv \
    --dataset-root /path/to/DL3DV-10K-960P \
    --output-dir   /path/to/DL3DV_GS \
    --subset 1K --split-ratio 0.1 0.3 0.5 0.7 0.9 \
    --stages train render visibility

Or refer run.sh

Stages

Stage	Action
`train`	optimize Gaussians (`train.py`)
`render`	render train/test RGB + GT images (`render.py`)
`visibility`	render visibility masks (`render.py --override_visibility_mask`)
`depth`	render depth maps only (`render.py --depth_only`)
`postprocess`	write `partition.json` / `cam_extrinsics.npy` without training (`train.py --save_mode`) — backfills metadata for already-trained scenes
`partition_fix`	validate `partition.json` against `cam_extrinsics.npy`; rewrite it, or delete the scene if unrecoverable
`clean`	`rm -rf` the scene's output directory

Key options

Flag	Description
`--dataset {dl3dv,mipnerf360}`	dataset preset (readiness checks, OMP threads, default subset)
`--subset`	sub-directories under the dataset root to process
`--split-ratio`	train/test split ratios; one output model per ratio
`--cut-start` / `--cut-end`	scene-list slice — give each server a disjoint range to split work across machines
`--skip-existing`	skip scenes that already have `point_cloud.ply` + `cam_extrinsics.npy` + `partition.json`
`--max-workers`	concurrent GPU jobs
`--excluded-gpus`	GPU ids to leave alone
`--dry-run`	print commands without running them

Run python dataset_generation.py --help for the full list.

Output layout

<output-dir>/<subset>/<scene>_<split>/
  point_cloud/iteration_30000/point_cloud.ply
  cameras.json
  cam_extrinsics.npy / cam_intrinsics.npy
  partition.json                 # {"train": [...], "test": [...]} camera indices
  train/ours_30000/{renders,gt}/
  test/ours_30000/{renders,gt}/
  train|test/ours_30000/visibility_mask_global_2ndpick_th0.5/

Single-scene scripts

The orchestrator wraps the standard 3DGS entry points, which can also be run directly on one scene:

python train.py  -s <scene> -m <model_dir> -r 4 -i images_4 --split 0.8 --seed 0
python render.py -m <model_dir> [--override_visibility_mask | --depth_only]
python metrics.py -m <model_dir>