diffusion.md
October 6, 2025 · View on GitHub
Diffusion
- (arXiv 2022.12) Scalable Diffusion Models with Transformers, [Paper], [Code]
- (arXiv 2023.03) Masked Diffusion Transformer is a Strong Image Synthesizer, [Paper], [Code]
- (arXiv 2023.04) ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis, [Paper]
- (arXiv 2023.06) DFormer: Diffusion-guided Transformer for Universal Image Segmentation, [Paper], [Code]
- (arXiv 2023.08) Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers, [Paper]
- (arXiv 2023.09) Large-Vocabulary 3D Diffusion Model with Transformer, [Paper], [Project]
- (arXiv 2023.09) Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models, [Paper], [Project]
- (arXiv 2023.12) DiffiT: Diffusion Vision Transformers for Image Generation, [Paper]
- (arXiv 2023.12) DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers, [Paper]
- (arXiv 2024.01) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers, [Paper], [Code]
- (arXiv 2024.01) Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.02) Cross-view Masked Diffusion Transformers for Person Image Synthesis, [Paper]
- (arXiv 2024.02) FiT: Flexible Vision Transformer for Diffusion Model, [Paper], [Code]
- (arXiv 2024.03) Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts, [Paper], [Code]
- (arXiv 2024.03) SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer, [Paper]
- (arXiv 2024.04) WcDT: World-centric Diffusion Transformer for Traffic Scene Generation, [Paper]
- (arXiv 2024.04) Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers, [Paper]
- (arXiv 2024.04) Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers, [Paper]
- (arXiv 2024.04) Lazy Diffusion Transformer for Interactive Image Editing, [Paper], [Project]
- (arXiv 2024.05) U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.05) Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer, [Paper], [Code]
- (arXiv 2024.05) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.05) DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation, [Paper]
- (arXiv 2024.05) Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding, [Paper],[Code]
- (arXiv 2024.05) Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer, [Paper],[Project]
- (arXiv 2024.05) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models, [Paper],[Code]
- (arXiv 2024.05) Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.05) PTQ4DiT: Post-training Quantization for Diffusion Transformers, [Paper]
- (arXiv 2024.05) VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.05) DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention, [Paper],[Code]
- (arXiv 2024.06) Δ-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers, [Paper]
- (arXiv 2024.06) Dimba: Transformer-Mamba Diffusion Models, [Paper],[Code]
- (arXiv 2024.06) AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation, [Paper]
- (arXiv 2024.06) DiTFastAttn: Attention Compression for Diffusion Transformer Models, [Paper],[Code]
- (arXiv 2024.06) Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT, [Paper],[Code]
- (arXiv 2024.07) FORA: Fast-Forward Caching in Diffusion Transformer Acceleration, [Paper],[Code]
- (arXiv 2024.07) VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control, [Paper]
- (arXiv 2024.07) Scaling Diffusion Transformers to 16 Billion Parameters, [Paper],[Code]
- (arXiv 2024.07) DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving, [Paper],[Code]
- (arXiv 2024.07) Diffusion Feedback Helps CLIP See Better, [Paper],[Code]
- (arXiv 2024.08) Tora: Trajectory-oriented Diffusion Transformer for Video Generation, [Paper],[Code]
- (arXiv 2024.08) Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing, [Paper]
- (arXiv 2024.08) DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose, [Paper]
- (arXiv 2024.08) MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer, [Paper]
- (arXiv 2024.09) Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task, [Paper],[Code]
- (arXiv 2024.09) DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing, [Paper]
- (arXiv 2024.09) Token Caching for Diffusion Transformer Acceleration, [Paper]
- (arXiv 2024.10) ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.10) HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration, [Paper]
- (arXiv 2024.10) EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing, [Paper]
- (arXiv 2024.10) MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation, [Paper]
- (arXiv 2024.10) Dynamic Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.10) SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, [Paper]
- (arXiv 2024.10) Boosting Camera Motion Control for Video Diffusion Transformers, [Paper]
- (arXiv 2024.10) The Ingredients for Robotic Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.10) FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification, [Paper]
- (arXiv 2024.10) Precipitation Nowcasting Using Diffusion Transformer with Causal Attention, [Paper]
- (arXiv 2024.10) Group Diffusion Transformers are Unsupervised Multitask Learners, [Paper]
- (arXiv 2024.10) Diffusion Transformer Policy, [Paper]
- (arXiv 2024.10) On Inductive Biases That Enable Generalization of Diffusion Transformers, [Paper]
- (arXiv 2024.10) GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation, [Paper],[Code]
- (arXiv 2024.10) EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching, [Paper],[Code]
- (arXiv 2024.10) In-Context LoRA for Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.11) Learning Where to Edit Vision Transformers, [Paper],[Code]
- (arXiv 2024.11) Adaptive Caching for Faster Video Generation with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.11) DiT4Edit: Diffusion Transformer for Image Editing, [Paper]
- (arXiv 2024.11) DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction, [Paper],[Code]
- (arXiv 2024.11) DiT4Edit: Diffusion Transformer for Image Editing, [Paper],[Code]
- (arXiv 2024.11) Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing, [Paper]
- (arXiv 2024.11) LaVin-DiT: Large Vision Diffusion Transformer, [Paper]
- (arXiv 2024.11) FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on, [Paper],[Code]
- (arXiv 2024.11) Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study, [Paper],[Code]
- (arXiv 2024.11) Accelerating Vision Diffusion Transformers with Skip Branches, [Paper],[Code]
- (arXiv 2024.11) Towards Precise Scaling Laws for Video Diffusion Transformers, [Paper]
- (arXiv 2024.11) On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality, [Paper]
- (arXiv 2024.11) LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis, [Paper],[Code]
- (arXiv 2024.12) TinyFusion: Diffusion Transformers Learned Shallow, [Paper],[Code]
- (arXiv 2024.12) CPA: Camera-pose-awareness Diffusion Transformer for Video Generation, [Paper]
- (arXiv 2024.12) Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks, [Paper],[Code]
- (arXiv 2024.12) OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows, [Paper],[Code]
- (arXiv 2024.12) ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer, [Paper]
- (arXiv 2024.12) UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics, [Paper],[Code]
- (arXiv 2024.12) Video Motion Transfer with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation, [Paper]
- (arXiv 2024.12) FlexDiT: Dynamic Token Density Control for Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.12) Causal Diffusion Transformers for Generative Modeling, [Paper],[Code]
- (arXiv 2024.12) AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration, [Paper]
- (arXiv 2024.12) ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer, [Paper]
- (arXiv 2024.12) Video Diffusion Transformers are In-Context Learners, [Paper],[Code]
- (arXiv 2024.12) Efficient Scaling of Diffusion Transformers for Text-to-Image Generation, [Paper]
- (arXiv 2024.12) CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up, [Paper]
- (arXiv 2024.12) Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer, [Paper]
- (arXiv 2024.12) DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation, [Paper],[Code]
- (arXiv 2024.12) Accelerating Diffusion Transformers with Dual Feature Caching, [Paper],[Code]
- (arXiv 2025.01) SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration, [Paper],[Code]
- (arXiv 2025.01) Ingredients: Blending Custom Photos with Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.01) GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking, [Paper],[Code]
- (arXiv 2025.01) Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.01) MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer, [Paper]
- (arXiv 2025.01) ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning, [Paper]
- (arXiv 2025.01) 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering, [Paper],[Code]
- (arXiv 2025.01) LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation, [Paper],[Code]
- (arXiv 2025.01) CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation, [Paper],[Code]
- (arXiv 2025.01) PackDiT: Joint Human Motion and Text Generation via Mutual Prompting, [Paper]
- (arXiv 2025.01) ITVTON: Virtual Try-On Diffusion Transformer Model Based on Integrated Image and Text, [Paper]
- (arXiv 2025.01) SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.02) Accelerating Diffusion Transformer via Error-Optimized Cache, [Paper]
- (arXiv 2025.02) LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.02) ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features, [Paper]
- (arXiv 2025.02) UniForm: A Unified Diffusion Transformer for Audio-Video Generation, [Paper]
- (arXiv 2025.02) HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation, [Paper]
- (arXiv 2025.02) VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.02) Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile, [Paper]
- (arXiv 2025.02) CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers, [Paper]
- (arXiv 2025.02) DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training, [Paper]
- (arXiv 2025.02) E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization, [Paper]
- (arXiv 2025.02) SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers, [Paper]
- (arXiv 2025.02) Designing Parameter and Compute Efficient Diffusion Transformers using Distillation, [Paper]
- (arXiv 2025.02) RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.02) RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.02) VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.02) FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute, [Paper]
- (arXiv 2025.03) LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding, [Paper],[Code]
- (arXiv 2025.03) Accelerating Diffusion Transformer via Gradient-Optimized Cache, [Paper]
- (arXiv 2025.03) Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching, [Paper],[Code]
- (arXiv 2025.03) EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer, [Paper]
- (arXiv 2025.03) TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation, [Paper]
- (arXiv 2025.03) Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping, [Paper]
- (arXiv 2025.03) X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation, [Paper],[Code]
- (arXiv 2025.03) U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers, [Paper]
- (arXiv 2025.03) OminiControl2: Efficient Conditioning for Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.03) Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework, [Paper],[Code]
- (arXiv 2025.03) UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.03) AudioX: Diffusion Transformer for Anything-to-Audio Generation, [Paper],[Code]
- (arXiv 2025.03) Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers, [Paper]
- (arXiv 2025.03) NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers, [Paper]
- (arXiv 2025.03) DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.03) Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection, [Paper],[Code]
- (arXiv 2025.03) Personalize Anything for Free with Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.03) FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing, [Paper],[Code]
- (arXiv 2025.03) Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts, [Paper]
- (arXiv 2025.03) BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers, [Paper]
- (arXiv 2025.03) U-REPA: Aligning Diffusion U-Nets to ViTs, [Paper],[Code]
- (arXiv 2025.03) Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings, [Paper]
- (arXiv 2025.03) Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer, [Paper],[Code]
- (arXiv 2025.03) EDiT: Efficient Diffusion Transformers with Linear Compressed Attention, [Paper]
- (arXiv 2025.03) Towards Transformer-Based Aligned Generation with Self-Coherence Guidance, [Paper],[Code]
- (arXiv 2025.03) AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers, [Paper]
- (arXiv 2025.03) DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation, [Paper],[Code]
- (arXiv 2025.03) FullDiT: Multi-Task Video Generative Foundation Model with Full Attention, [Paper]
- (arXiv 2025.03) ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On, [Paper],[Code]
- (arXiv 2025.03) Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy, [Paper],[Code]
- (arXiv 2025.03) JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization, [Paper],[Code]
- (arXiv 2025.03) DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers, [Paper]
- (arXiv 2025.03) DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model, [Paper]
- (arXiv 2025.04) SkyReels-A2: Compose Anything in Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.04) DDT: Decoupled Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.04) Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing, [Paper]
- (arXiv 2025.04) DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation, [Paper],[Code]
- (arXiv 2025.04) DreamFuse: Adaptive Image Fusion with Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.04) Insert Anything: Image Insertion via In-Context Editing in DiT, [Paper],[Code]
- (arXiv 2025.04) RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild, [Paper],[Code]
- (arXiv 2025.04) DiTPainter: Efficient Video Inpainting with Diffusion Transformers, [Paper]
- (arXiv 2025.04) RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild, [Paper],[Code]
- (arXiv 2025.04) Insert Anything: Image Insertion via In-Context Editing in DiT, [Paper],[Code]
- (arXiv 2025.04) DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer, [Paper]
- (arXiv 2025.04) AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation, [Paper],[Code]
- (arXiv 2025.04) GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.04) In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.05) JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.05) FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing, [Paper]
- (arXiv 2025.05) Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition, [Paper],[Code]
- (arXiv 2025.05) Generative Pre-trained Autoregressive Diffusion Transformer, [Paper]
- (arXiv 2025.05) TopoDiT-3D: Topology-Aware Diffusion Transformer with Bottleneck Structure for 3D Point Cloud Generation, [Paper],[Code]
- (arXiv 2025.05) Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis, [Paper],[Code]
- (arXiv 2025.05) No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves, [Paper],[Code]
- (arXiv 2025.05) FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation, [Paper],[Code]
- (arXiv 2025.06) HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.06) DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas, [Paper]
- (arXiv 2025.06) Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) Playing with Transformer at 30+ FPS via Next-Frame Diffusion, [Paper]
- (arXiv 2025.06) SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers, [Paper]
- (arXiv 2025.06) EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering, [Paper],[Code]
- (arXiv 2025.06) MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation, [Paper]
- (arXiv 2025.06) Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces, [Paper]
- (arXiv 2025.06) Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression, [Paper],[Code]
- (arXiv 2025.06) EraserDiT: Fast Video Inpainting with Diffusion Transformer Model, [Paper],[Code]
- (arXiv 2025.06) Emergent Temporal Correspondences from Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.06) Video Virtual Try-on with Conditional Diffusion Transformer Inpainter, [Paper]
- (arXiv 2025.06) XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation, [Paper],[Code]
- (arXiv 2025.06) TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation, [Paper]
- (arXiv 2025.06) OutDreamer: Video Outpainting with a Diffusion Transformer, [Paper]
- (arXiv 2025.07) Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers, [Paper]
- (arXiv 2025.07) FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.07) Taming Diffusion Transformer for Real-Time Mobile Video Generation, [Paper],[Code]
- (arXiv 2025.07) SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging, [Paper],[Code]
- (arXiv 2025.07) AnimeColor: Reference-based Animation Colorization with Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.08) LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.08) DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework, [Paper],[Code]
- (arXiv 2025.08) Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation, [Paper],[Code]
- (arXiv 2025.08) FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer, [Paper]
- (arXiv 2025.08) Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off, [Paper],[Code]
- (arXiv 2025.08) RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer, [Paper]
- (arXiv 2025.08) UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation, [Paper],[Code]
- (arXiv 2025.08) KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training, [Paper],[Code]
- (arXiv 2025.08) MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization, [Paper]
- (arXiv 2025.08) Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer, [Paper],[Code]
- (arXiv 2025.08) Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing, [Paper],[Code]
- (arXiv 2025.08) LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation, [Paper],[Code]
- (arXiv 2025.08) Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.08) HiMat: DiT-based Ultra-High Resolution SVBRDF Generation, [Paper]
- (arXiv 2025.08) DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation, [Paper],[Code]
- (arXiv 2025.08) MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers, [Paper]
- (arXiv 2025.08) MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration, [Paper]
- (arXiv 2025.09) Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders, [Paper]
- (arXiv 2025.09) SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching, [Paper],[Code]
- (arXiv 2025.09) BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching, [Paper]
- (arXiv 2025.09) LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence, [Paper]
- (arXiv 2025.09) FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.09) Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.09) OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models, [Paper],[Code]
- (arXiv 2025.09) DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation, [Paper]
- (arXiv 2025.09) QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification, [Paper],[Code]
- (arXiv 2025.09) Stitch: Training-Free Position Control in Multimodal Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.09) LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing, [Paper]
- (arXiv 2025.09) DiTraj: training-free trajectory control for video diffusion transformer, [Paper],[Code]
- (arXiv 2025.09) Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.09) RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer, [Paper]
- (arXiv 2025.09) SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention, [Paper],[Code]
- (arXiv 2025.10) DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing, [Paper]