๐ŸŽฏ Gaussian Splatting from VGGT on Large Number of Frames

November 5, 2025 ยท View on GitHub

This repository implements a VGGT-Long โ†’ Gaussian Splatting pipeline, combining and modifying code from the following repositories:

Based on the papers:

For Gaussian Splatting, this project uses code from:


๐Ÿ“Š Comparison: COLMAP vs VGGT-Long based Gaussian Splatting

Datasets Mip-NeRF 360 and Tanks and Temples

Full Frame Example

COLMAPVGGT-Long
COLMAP fullVGGT full

Cropped Region

COLMAPVGGT-Long
COLMAP cropVGGT crop

More examples: Open examples page


๐Ÿ“‰ Training Loss

Training loss


Metrics and computation time

PipelineSceneTime (s)GS StartGS ResultPSNRSSIMLPIPS
COLMAPbonsai1184261,2552,197,82130.8910.93870.142
VGGT-LONGbonsai2069108,5002,017,88329.4190.90090.179
COLMAPkitchen1741290,4441,860,12131.4630.93230.097
VGGT-LONGkitchen143746,7881,890,64824.9670.78520.212
COLMAPignatius731209,5414,366,40520.4770.70500.241
VGGT-LONGignatius89851,1933,193,49918.2630.49590.448
COLMAPgarden500207,9236,388,31027.5370.86060.098
VGGT-LONGgarden69543,0065,436,51824.1030.70920.208

Table presents a comparison between the COLMAP and VGGT-LONG reconstruction pipelines across several scenes from usual NeRF 360 benchmarks + Ignatius scene from https://www.tanksandtemples.org/ Each experiment reports the total computation time to produce sparse model, the initial and resulting number of Gaussian splats (GS), and three common image quality metrics โ€” PSNR, SSIM, and LPIPS.

Overall, COLMAP generally produces higher reconstruction quality, indicated by higher PSNR and SSIM values and lower LPIPS, particularly for the bonsai and kitchen scenes. However, VGGT-LONG tends to generate reconstructions starting from a smaller Gaussian set and sometimes results in a denser splat distribution, suggesting a different optimization and density allocation strategy.

Execution times vary across scenes: in some cases, VGGT-LONG takes longer despite lower perceptual fidelity, illustrating inherent trade-offs between computational efficiency and reconstruction accuracy across the two methods.

โš™๏ธ Setup, Installation & Running

๐Ÿ–ฅ๏ธ 1. Hardware and System Environment

(Describe GPU/OS/requirements here, e.g. CUDA version, Python environment, etc.)

๐Ÿ–ฅ๏ธ 1. Hardware and System Environment

This project was developed, tested, and run in the following hardware/system environment:

Hardware Environment (similar to VGGT-Long paper):

    CPU(s)        13th Gen Intelยฎ Coreโ„ข i7-13700K x 24 
    GPU           NVIDIA GeForce RTX 4090 โ€” 24 GB VRAM
    RAM           125 GB total
    Disk(s)       3.6 TB HDD, 1.8 TB NVMe SSD 

System Environment:
    OS            Ubuntu 22.04.5 LTS
    CUDA Toolkit  โŒ not installed 
    NVIDIA Driver 550.144.03
    Conda         version: conda 25.3.1

๐Ÿ“ฆ 2. Environment Setup

Step 1: Dependency Installation

We use three virtual environments:

  • vggsfm_tmp: Requirements from VGGT-Long repository
  • colmap_env: Requirements from VGGT repository
  • py11: gsplat dependencies

Create these conda environments and install dependencies using the corresponding requirements.txt files.

TODO: Check how to merge virtual environments into one

Step 2: COLMAP Installation

Install COLMAP for bundle adjustment and image undistortion using conda (recommended):

# Activate the colmap_env environment
conda activate colmap_env

# Install COLMAP from conda-forge
conda install -c conda-forge colmap

Alternative system-wide installation:

Step 3: Weights Download

Download all the pre-trained weights needed:

bash ./scripts/download_weights.sh

๐Ÿ”ง Pipeline Structure

  1. Predict camera poses, intrinsics and depth maps using VGGT-Long model

  2. Export to COLMAP using functionality provided by VGGT:

# With bundle adjustment
python demo_colmap.py --scene_dir=/YOUR/SCENE_DIR/ --use_ba --max_query_pts=1024 --query_frame_num=6

Images should be in /YOUR/SCENE_DIR/images/. This folder should contain only the images. The reconstruction result (camera parameters and 3D points) will be automatically saved under /YOUR/SCENE_DIR/sparse/ in the COLMAP format, such as:

SCENE_DIR/
โ”œโ”€โ”€ images/
โ””โ”€โ”€ sparse/
    โ”œโ”€โ”€ cameras.bin
    โ”œโ”€โ”€ images.bin
    โ””โ”€โ”€ points3D.bin

The main modification is in vggt/dependency/track_predict.py where the query frames are changed from DINO-ranked ones to sampling with regular intervals:

# Original: Find query frames
# query_frame_indexes = generate_rank_by_dino(images, query_frame_num=query_frame_num, device=device)

# New: Sample every 3rd frame
query_frame_indexes = list(range(0, len(images), 3))

Additionally, batched processing has been added to save memory for long sequences (200+ images):

  • Feature extraction batching in vggt/dependency/track_predict.py: Processes images in batches of 50 to avoid CUDA out-of-memory errors
  • Correlation block chunking in vggt/dependency/track_modules/base_track_predictor.py: Processes correlation computation in chunks of 50 frames
  • Memory management: Automatic GPU cache clearing between batches to optimize memory usage

These modifications enable processing of long image sequences (e.g., 292 images) that would otherwise exceed GPU memory limits.

  1. Undistort images using pinhole camera model:
colmap image_undistorter \
    --image_path "$IMAGE_DIR" \
    --input_path "$IMAGE_DIR/sparse" \
    --output_path "$UNDISTORTED_COLMAP_OUTPUT" \
    --output_type COLMAP \
    --max_image_size 2000
  1. Gaussian Splatting

The exported COLMAP files can be directly used with gsplat for Gaussian Splatting training. Install gsplat following their official instructions (we recommend gsplat==1.3.0).

An example command to train the model is:

cd "$WORKDIR/../gsplat/examples"
mkdir -p "$WORKDIR/../gsplat/examples/results/$GSPLAT_OUTPUT_DIR"


CUDA_VISIBLE_DEVICES=0 python simple_trainer.py default \
    --data_dir "${REMOTE_IMAGE_DIR}_undistorted" \
    --data_factor 1 \
    --result_dir "$WORKDIR/../gsplat/examples/results/$GSPLAT_OUTPUT_DIR" \
    --save_ply \
    --ply_steps 30000 \
    --disable_viewer \
    --render_traj_path "ellipse"

๐Ÿš€ Running Pipelines

The simplest way to run the pipeline is on a remote machine through SSH connection using the scripts:

Run the pipeline to get sparse model from VGGT-Long:

./src_vggt_colmap/run_pipeline_vggt_long_colmap.sh

Run the pipeline to get sparse model from COLMAP:

./src_vggt_colmap/run_pipeline_vggt_long_colmap.sh

Download results:

./src_vggt_colmap/download_results_vggt_colmap.sh
./src_vggt_colmap/download_results_colmap.sh

Parsing log files to collect metrics:

./src_metrics/collect_metrics.sh

๐Ÿ“„ License

This codebase follows VGGT's license. Please refer to ./LICENSE.txt for applicable terms.

For commercial use, please follow the link to VGGT and utilize the commercial version of the pre-trained weights: VGGT-1B-Commercial.

๐Ÿ“– References

If you use this code, please cite the original papers:

@article{wang2024vggt-long,
  title={VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences},
  author={Wang, Ziyuan and others},
  journal={arXiv preprint arXiv:2507.16443},
  year={2024}
}

@article{wang2023vggsfm,
  title={VGGSfM: Visual Geometry Grounded Deep Structure from Motion},
  author={Wang, Jianyuan and Karaev, Nikita and Rupprecht, Christian and Novotny, David},
  journal={arXiv preprint arXiv:2312.04563},
  year={2023}
}

๐Ÿ”ฌ More Experiments

[Content for additional experiments and results would go here]