DFormer for RGBD Semantic Segmentation (Jittor Implementation)

August 10, 2025 · View on GitHub

This is the Jittor implementation of DFormer and DFormerv2 for RGBD semantic segmentation. Developed based on the Jittor deep learning framework, it provides efficient solutions for training and inference.

This repository contains the official Jittor implementation of the following papers:

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou*
ICLR 2024. Paper Link | Homepage | PyTorch Version

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
Bo-Wen Yin, Jiao-Long Cao, Ming-Ming Cheng, Qibin Hou*
CVPR 2025. Paper Link | Chinese Version | PyTorch Version

✨ About the Jittor Framework: An Architectural Deep Dive ✨

This project is built upon Jittor, a cutting-edge deep learning framework that pioneers a design centered around Just-In-Time (JIT) compilation and meta-operators. This architecture provides a unique combination of high performance and exceptional flexibility. Instead of relying on static, pre-compiled libraries, Jittor operates as a dynamic, programmable system that compiles itself and the user's code on the fly.

The Core Philosophy: From Static Library to Dynamic Compiler

Jittor's design philosophy is to treat the deep learning framework not as a fixed set of tools, but as a domain-specific compiler. The high-level Python code written by the user serves as a directive to this compiler, which then generates highly optimized, hardware-specific machine code at runtime. This approach unlocks a level of performance and flexibility that is difficult to achieve with traditional frameworks.

Key Innovations of Jittor

A Truly Just-in-Time (JIT) Compiled Framework:

Jittor's most significant innovation is that the entire framework is JIT compiled. This goes beyond merely compiling a static computation graph. When a Jittor program runs, the Python code, including the core framework logic and the user's model, is first parsed into an intermediate representation. The Jittor compiler then performs a series of advanced optimizations—such as operator fusion, memory layout optimization, and dead code elimination—before generating and executing native C++ or CUDA code. This "whole-program" compilation approach means that the framework can adapt to the specific logic of your model, enabling optimizations that are impossible when linking against a static, pre-compiled library.
Meta-Operators and Dynamic Kernel Fusion:

At the heart of Jittor lies the concept of meta-operators. These are not monolithic, pre-written kernels (like in other frameworks), but rather elementary building blocks defined in Python. For instance, a complex operation like Conv2d followed by ReLU is not two separate kernel calls. Instead, Jittor composes them from meta-operators, and its JIT compiler fuses them into a single, efficient CUDA kernel at runtime. This kernel fusion is critical for performance on modern accelerators like GPUs, as it drastically reduces the time spent on high-latency memory I/O and kernel launch overhead, which are often the primary bottlenecks.
The Unified Computation Graph: Flexibility Meets Performance:

Jittor elegantly resolves the classic trade-off between the flexibility of dynamic graphs (like PyTorch) and the performance of static graphs (like TensorFlow 1.x). You can write your model using all the native features of Python, including complex control flow like if/else statements and data-dependent for loops. Jittor's compiler traces these dynamic execution paths and still constructs a graph representation that it can optimize globally. It achieves this by JIT-compiling different graph versions for different execution paths, thus preserving Python's expressiveness without sacrificing optimization potential.
Decoupling of Frontend Logic and Backend Optimization:

Jittor champions a clean separation that empowers researchers. You focus on the "what"—the mathematical logic of your model—using a clean, high-level Python API. Jittor's backend automatically handles the "how"—the complex task of writing high-performance, hardware-specific code. This frees researchers who are experts in their domain (e.g., computer vision) from needing to become experts in low-level GPU programming, thus accelerating the pace of innovation.

🚩 Performance

Chart 1: Comparison of mIoU changes between Jittor implementation and Pytorch implementation of Dformer-Large.

Chart 2: Comparisons of lantency between Jittor implementation and Pytorch implementation

Chart 3: Comparisons of evaluation time between Jittor implementation and Pytorch implementation

Chart 4: Comparisons of Model Size and Evaluation Time between Jittor implementation and Pytorch implementation

🚀 Getting Started

Environment Setup

# Create a conda environment
conda create -n dformer_jittor python=3.8 -y
conda activate dformer_jittor

# Install Jittor
pip install jittor

# Install other dependencies
pip install opencv-python pillow numpy scipy tqdm tensorboardX tabulate easydict

Dataset Preparation

Supported datasets:

NYUDepthv2: An indoor RGBD semantic segmentation dataset.
SUNRGBD: A large-scale dataset for indoor scene understanding.

Download links:

Dataset	GoogleDrive	OneDrive	BaiduNetdisk

Pre-trained Models

Model	Dataset	mIoU	Download Link
DFormer-Small	NYUDepthv2	52.3	BaiduNetdisk
DFormer-Base	NYUDepthv2	54.1	BaiduNetdisk
DFormer-Large	NYUDepthv2	55.8	BaiduNetdisk
DFormerv2-Small	NYUDepthv2	53.7	BaiduNetdisk
DFormerv2-Base	NYUDepthv2	55.3	BaiduNetdisk
DFormerv2-Large	NYUDepthv2	57.1	BaiduNetdisk

Directory Structure

DFormer-Jittor/
├── checkpoints/              # Directory for pre-trained models
│   ├── pretrained/          # ImageNet pre-trained models
│   └── trained/             # Trained models
├── datasets/                # Directory for datasets
│   ├── NYUDepthv2/         # NYU dataset
│   └── SUNRGBD/            # SUNRGBD dataset
├── local_configs/          # Configuration files
├── models/                 # Model definitions
├── utils/                  # Utility functions
├── train.sh               # Training script
├── eval.sh                # Evaluation script
└── infer.sh               # Inference script

📖 Usage

Training

Use the provided training script:

bash train.sh

Or use the Python command directly:

python utils/train.py --config local_configs.NYUDepthv2.DFormer_Base

Evaluation

bash eval.sh

Alternatively:

python utils/eval.py --config local_configs.NYUDepthv2.DFormer_Base --checkpoint checkpoints/trained/NYUDepthv2/DFormer_Base/best.pkl

Inference/Visualization

bash infer.sh

🚩 Performance

Table 1: Comparisons between the existing methods and our DFormer.

Table 2: Comparisons between the existing methods and our DFormerv2.

🔧 Configuration

The project uses Python configuration files located in the local_configs/ directory:

# local_configs/NYUDepthv2/DFormer_Base.py
class C:
    # Dataset configuration
    dataset_name = "NYUDepthv2"
    dataset_dir = "datasets/NYUDepthv2"
    num_classes = 40
    
    # Model configuration
    backbone = "DFormer_Base"
    pretrained_model = "checkpoints/pretrained/DFormer_Base.pth"
    
    # Training configuration
    batch_size = 8
    nepochs = 500
    lr = 0.01
    momentum = 0.9
    weight_decay = 0.0001
    
    # Other configurations
    log_dir = "logs"
    checkpoint_dir = "checkpoints"

📊 Benchmarking

FLOPs and Parameters

python benchmark.py --config local_configs.NYUDepthv2.DFormer_Base

Inference Speed

python utils/latency.py --config local_configs.NYUDepthv2.DFormer_Base

⚠️ Note

Root Cause of the Issue

What is CUTLASS?
CUTLASS (CUDA Templates for Linear Algebra Subroutines) is a high-performance CUDA matrix operation template library launched by NVIDIA, primarily used for efficiently implementing core operators like GEMM/Conv on Tensor Cores. It is utilized by many frameworks (Jittor, PyTorch XLA, TVM, etc.) for custom operators or as a low-level acceleration for Auto-Tuning.

Why does Jittor pull CUTLASS in cuDNN unit tests?
When Jittor loads/compiles external CUDA libraries, it automatically compiles several custom operators from CUTLASS (setup_cutlass()). If the local cache is missing, it will call install_cutlass() to download and extract a cutlass.zip.

Direct Cause of the Crash

The install_cutlass() function in version 1.3.9.14 uses a download link that has become invalid (confirmed by community Issue #642).
After the download fails, a partial ~/.cache/jittor/cutlass directory is left behind; when running the function again, it attempts to execute shutil.rmtree('.../cutlass/cutlass'), but this subdirectory does not exist, triggering a FileNotFoundError and ultimately causing the main process to core dump.

解决方案 (按推荐顺序选择其一)

方案	操作步骤	适用场景
1️⃣ 临时跳过 CUTLASS	`bash<br># 仅对当前 shell 生效<br>export use_cutlass=0<br>python3.8 -m jittor.test.test_cudnn_op<br>`	只想先跑通 cuDNN 单测 / 不需要 CUTLASS 算子
2️⃣ 手动安装 CUTLASS	`bash<br># 清理残留<br>rm -rf ~/.cache/jittor/cutlass<br><br># 手动克隆最新版<br>mkdir -p ~/.cache/jittor/cutlass && \<br>cd ~/.cache/jittor/cutlass && \<br>git clone --depth 1 https://github.com/NVIDIA/cutlass.git cutlass<br><br># 再次运行<br>python3.8 -m jittor.test.test_cudnn_op<br>`	仍想保留 CUTLASS 相关算子功能
3️⃣ 升级 Jittor 至修复版本	`bash<br>pip install -U jittor jittor-utils<br>` 社区 1.3.9.15+ 已把失效链接改到镜像源，升级后即可自动重新下载。	允许升级环境并希望后续自动管理

🤝 Contributing

We welcome all forms of contributions:

Bug Reports: Report issues in GitHub Issues.
Feature Requests: Suggest new features.
Code Contributions: Submit Pull Requests.
Documentation Improvements: Improve README and code comments.

📞 Contact

If you have any questions about our work, feel free to contact us:

Email: bowenyin@mail.nankai.edu.cn, caojiaolong@mail.nankai.edu.cn
GitHub Issues: Submit an issue

📚 Citation

If you use our work in your research, please cite the following papers:

@inproceedings{yin2024dformer,
  title={DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation},
  author={Yin, Bowen and Zhang, Xuying and Li, Zhong-Yu and Liu, Li and Cheng, Ming-Ming and Hou, Qibin},
  booktitle={ICLR},
  year={2024}
}

@inproceedings{yin2025dformerv2,
  title={DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation},
  author={Yin, Bo-Wen and Cao, Jiao-Long and Cheng, Ming-Ming and Hou, Qibin},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={19345--19355},
  year={2025}
}

🙏 Acknowledgements

Our implementation is mainly based on the following open-source projects:

Jittor: A deep learning framework.
DFormer: The original PyTorch implementat ion.
mmsegmentation: A semantic segmentation toolbox.

Thanks to all the contributors for their efforts!

📄 License

This project is for non-commercial use only. See the LICENSE file for details.