About

July 23, 2025 · View on GitHub

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv*¹, Tianlin Pan*^2,3, Chenyang Si*², Zhaoxi Chen⁴, Wangmeng Zuo⁵, Ziwei Liu^4†, Kwan-Yee K. Wong^1†

¹The University of Hong Kong ²Nanjing University
³University of Chinese Academy of Sciences ⁴Nanyang Technological University
⁵Harbin Institute of Technology

(*Equal Contribution. ^†Corresponding Author.)

Paper | Project Page | LoRA Weights

About

We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.

https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

Usage

For Stable Diffusion 3.5, simply run:

python infer/infer_sd3.py

For FLUX.1, run:

python infer/infer_flux.py

Benchmark

Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

Model	Attribute Binding			Object Relationship		Complex $\uparrow$
	Color $\uparrow$	Shape $\uparrow$	Texture $\uparrow$	Spatial $\uparrow$	Non-Spatial $\uparrow$
FLUX.1-Dev	0.7678	0.5064	0.6756	0.2066	0.3035	0.4359
FLUX.1-Dev + TACA ( $r = 64$ )	0.7843	0.5362	0.6872	0.2405	0.3041	0.4494
FLUX.1-Dev + TACA ( $r = 16$ )	0.7842	0.5347	0.6814	0.2321	0.3046	0.4479
SD3.5-Medium	0.7890	0.5770	0.7328	0.2087	0.3104	0.4441
SD3.5-Medium + TACA ( $r = 64$ )	0.8074	0.5938	0.7522	0.2678	0.3106	0.4470
SD3.5-Medium + TACA ( $r = 16$ )	0.7984	0.5834	0.7467	0.2374	0.3111	0.4505

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

About

Usage

Benchmark

Showcases