About

July 23, 2025 · View on GitHub

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv*1, Tianlin Pan*2,3, Chenyang Si*2, Zhaoxi Chen4, Wangmeng Zuo5, Ziwei Liu4†, Kwan-Yee K. Wong1†
1The University of Hong Kong       2Nanjing University
3University of Chinese Academy of Sciences       4Nanyang Technological University
5Harbin Institute of Technology
(*Equal Contribution.    Corresponding Author.)

Paper | Project Page | LoRA Weights

About

We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.

https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

Usage

For Stable Diffusion 3.5, simply run:

python infer/infer_sd3.py

For FLUX.1, run:

python infer/infer_flux.py

Benchmark

Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

ModelAttribute BindingObject RelationshipComplex \uparrow
Color \uparrowShape \uparrowTexture \uparrowSpatial \uparrowNon-Spatial \uparrow
FLUX.1-Dev0.76780.50640.67560.20660.30350.4359
FLUX.1-Dev + TACA (r=64r = 64)0.78430.53620.68720.24050.30410.4494
FLUX.1-Dev + TACA (r=16r = 16)0.78420.53470.68140.23210.30460.4479
SD3.5-Medium0.78900.57700.73280.20870.31040.4441
SD3.5-Medium + TACA (r=64r = 64)0.80740.59380.75220.26780.31060.4470
SD3.5-Medium + TACA (r=16r = 16)0.79840.58340.74670.23740.31110.4505

Showcases