UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling ๐
October 27, 2025 ยท View on GitHub
This repository contains the official PyTorch implementation for the paper: "UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling", accepted at ICCV 2025.

๐ Introduction
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method.
๐ ๏ธ Getting Started
Prerequisites
- Python 3.8+
- PyTorch 1.12.0+
- CUDA 11.3+
Installation
-
Clone the repository:
-
Install Python dependencies: We recommend using a virtual environment (e.g., conda or venv).
pip install -r requirements.txt -
Install Mamba and Causal Conv1d:
pip install causal-conv1d mamba-ssm -
Compile custom CUDA layers: Our model relies on custom CUDA operators for PointNet++ and k-Nearest Neighbors (kNN).
- PointNet++ Layers:
cd modules/ python setup.py install cd .. - kNN for PyTorch:
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
- PointNet++ Layers:
For a detailed environment setup, please refer to the requirements.yml file.
๐ Datasets
You will need to download and preprocess the datasets before training and evaluation.
MSR-Action3D
- Download the dataset from Google Drive.
- Extract the
.zipfile to getDepth.rar, and then extract the depth maps. - Preprocess the depth maps into point clouds by running the script:
python scripts/preprocess_file.py --input_dir /path/to/your/Depth --output_dir /path/to/processed_data --num_cpu 11
NTU RGB+D
- Download the dataset from the official website. You will need to request access.
- After downloading, convert the depth maps to point cloud data using our script:
python scripts/depth2point4ntu120.py --data_path /path/to/your/ntu_dataset
Synthia 4D
- Download the dataset from the official project page.
- Extract the
.tarfile. The data should be ready for use without further preprocessing.
๐ Usage
Training
To train the UST-SSM model on a dataset, use the following command structure. Make sure to specify the dataset path and the configuration file.
python train.py --config cfgs/msr-action3d_config.yaml --data_path /path/to/processed_data
Evaluation
To evaluate a trained model, provide the path to your model checkpoint (.pth file).
python test.py --config cfgs/msr-action3d_config.yaml --data_path /path/to/processed_data --checkpoint /path/to/your/model.pth
๐ Acknowledgement
This work builds upon the excellent codebase of PSTNet. We thank the authors for making their code publicly available. We are also grateful for the advancements in State Space Models, particularly Mamba.
โ๏ธ Citation
If you find our work useful for your research, please consider citing our paper:
@inproceedings{li2025ust,
title={UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling},
author={Li, Peiming and Wang, Ziyi and Yuan, Yulin and Liu, Hong and Meng, Xiangming and Yuan, Junsong and Liu, Mengyuan},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={6738--6747},
year={2025}
}