About
July 23, 2025 · View on GitHub
TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
1The University of Hong Kong
2Nanjing University
3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
(*Equal Contribution. †Corresponding Author.)
Paper | Project Page | LoRA Weights
About
We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
Usage
For Stable Diffusion 3.5, simply run:
python infer/infer_sd3.py
For FLUX.1, run:
python infer/infer_flux.py
Benchmark
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
| Model | Attribute Binding | Object Relationship | Complex | |||
|---|---|---|---|---|---|---|
| Color | Shape | Texture | Spatial | Non-Spatial | ||
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
| FLUX.1-Dev + TACA () | 0.7843 | 0.5362 | 0.6872 | 0.2405 | 0.3041 | 0.4494 |
| FLUX.1-Dev + TACA () | 0.7842 | 0.5347 | 0.6814 | 0.2321 | 0.3046 | 0.4479 |
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
| SD3.5-Medium + TACA () | 0.8074 | 0.5938 | 0.7522 | 0.2678 | 0.3106 | 0.4470 |
| SD3.5-Medium + TACA () | 0.7984 | 0.5834 | 0.7467 | 0.2374 | 0.3111 | 0.4505 |
Showcases
