DiTFuse
March 7, 2026 · View on GitHub
Official implementation of Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach (TPAMI 2026)
Any questions can be consulted -> (Email:lijiayang.cs@gmail.com)
Looking forward to your ⭐!
📌 TODOs
- release code
- release ckpt
- release arxiv
- IEEE version paper
Core Concept:
The core objective of our work is to demonstrate the superiority of a parallel architecture in information control. In our experiments beyond the main paper, I also tried AdaIN-based information injection and T2I-Adapter-style feature-map addition. However, these approaches inevitably cause information from the two modalities to become entangled—numerically mixed together—making it impossible to truly separate the content of the two input images. This is why explicit information disentanglement is necessary, and why a parallel design is the appropriate choice.
In addition, the M3-style synthetic fusion data construction pipeline can significantly improve the performance of the fusion task itself. Finally, with the rapid progress of unified models for visual understanding and generation, we believe fusion tasks should also actively embrace this trend, incorporating strong visual priors into fusion frameworks. We look forward to future advances enabled by such unified architectures.
🚀 Overview
Setup
For detailed installation and usage instructions, please refer to ➡️ setup.md.
Test & Train
Testing
For testing, please refer to the provided script:➡️ test.md
This script demonstrates how to run DiTFuse in different modes (single, batch, and multi-prompt).
The inference stage requires approximately 12 GB of GPU memory and can be efficiently executed on widely available high-performance GPUs, such as NVIDIA RTX 3090, V100, and RTX 4090.
Training
Training follows the same procedure as OmniGen.
📄 Citation
If you use DiTFuse in your research, please cite:
@ARTICLE{11297852,
author={Li, Jiayang and Jiang, Chengjie and Jiang, Junjun and Liang, Pengwei and Ma, Jiayi and Nie, Liqiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach},
year={2026},
volume={48},
number={4},
pages={3970-3987},
keywords={Semantics;Image fusion;Training;Image segmentation;Transformers;Optimization;Data models;Visual effects;Feature extraction;Electronic mail;Image fusion;DiT;text control},
doi={10.1109/TPAMI.2025.3642842}}
❤️ Acknowledgements
This project is built on OmniGen, a powerful Diffusion Transformer framework developed by VectorSpace Lab.