Contents
August 19, 2025 · View on GitHub
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Efstathios Karypidis1,3, Ioannis Kakogeorgiou1, Spyros Gidaris2, Nikos Komodakis1,4,5
1Archimedes/Athena RC 2valeo.ai
3National Technical University of Athens 4University of Crete 5IACM-Forth
This repository contains the official implementation of the paper: Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Contents
- News-ToDos
- Installation
- Dataset Preparation
- Futurist Training
- Evaluation
- Demo
- Citation
- Acknowledgements
News-ToDos
2025-1-14: Arxiv Preprint and GitHub repository are released!
- Add new branches with code for training with vq-vae & separate tokens for each modality
Installation
The code is tested with Python 3.11 and PyTorch 2.0.1+cu121 on Ubuntu 22.04.05 LTS. Create a new conda environment:
conda create -n futurist python=3.11
conda activate futurist
Clone the repository and install the required packages:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/Sta8is/FUTURIST
cd FUTURIST
pip install -r requirements.txt
Dataset Preparation
We use Cityscapes dataset for our experiments. Especially, we use the leftImg8bit_sequence_trainvaltest sequences. In order to extract segmentation maps we utilize Segmenter. In order to extract depth maps we utilize DepthAnythingV2. You can skip downloading leftImg8bit_sequence_trainvaltest and preprocessing and simply download the precomputed segmentation maps from here and depth maps from here. Also, in order to evaluate futurist gtFine needs to be processed using cityscapesScripts. Alternatively, you can download the processed dataset from here. The final structure of the dataset should be as follow.
cityscapes
│
├───leftImg8bit_sequence_depthv2
│ ├───train
│ ├───val
├───leftImg8bit_sequence_segmaps_ids
│ ├───train
│ ├───val
├───gtFine
│ ├───train
│ ├───val
│ ├───test
Futurist-training
To train Futurist with default parameters use the following command:
python train_futurist.py --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 2 --max_epochs 3200 --lr_base 4e-5 --patch_size 16 \
--eval_mode_during_training --evaluate --single_step_sample_train --masking "simple_replace" --seperable_attention --random_horizontal_flip \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence_segmaps_ids" --modality segmaps_depth \
--sequence_length 5 --num_classes 19 --emb_dim 10,10 --accum_iter 4 --w_s 0.85 \
--dst_path "/logdir/futurist" --masking_strategy "par_shared_excl" --modal_fusion "concat"
Evaluation
You can also download the pre-trained model from here or via CLI using
wget https://huggingface.co/Sta8is/FUTURIST/resolve/main/futurist.ckpt
To evaluate Futurist trained model use the following command:
python train_futurist.py --num_gpus=4 --precision 16-mixed --eval_freq 10 --batch_size 2 --max_epochs 3200 --lr_base 4e-5 --patch_size 16 \
--eval_mode_during_training --evaluate --single_step_sample_train --masking "simple_replace" --seperable_attention --random_horizontal_flip \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence_segmaps_ids" --modality segmaps_depth \
--sequence_length 5 --num_classes 19 --emb_dim 10,10 --accum_iter 4 --w_s 0.85 \
--dst_path "/logdir/futurist" --masking_strategy "par_shared_excl" --modal_fusion "concat" \
--eval_ckpt_only --ckpt "/path/to/futurist.ckpt"
Demo
We provide 2 quick demos.
- Demo.
Citation
If you found Futurist useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@InProceedings{Karypidis_2025_CVPR,
author = {Karypidis, Efstathios and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
title = {Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {3793-3803}
Acknowledgements
Our code is partially based on Maskgit-pytorch, DepthAnythingV2, Segmenter for their work and open-source code.