(Source: Make-A-Video, SimDA, PYoCo, SVD ,
Video LDM and Tune-A-Video)
- [News] The updated version is available on arXiv.
- [News] Our survey is accepted by ACM Computing Surveys (CSUR).
- [News] The Chinese translation is available on Zhihu. Special thanks to Dai-Wenxun for this.
If you have any suggestions or find our work helpful, feel free to contact us
Homepage: Zhen Xing
Email: zhenxingfd@gmail.com
If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{xing2023survey,
title={A survey on video diffusion models},
author={Xing, Zhen and Feng, Qijun and Chen, Haoran and Dai, Qi and Hu, Han and Xu, Hang and Wu, Zuxuan and Jiang, Yu-Gang},
journal={ACM Computing Surveys},
year={2023},
publisher={ACM New York, NY}
}
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation |  |  |  | May, 2025 |
| Identity-Preserving Text-to-Video Generation by Frequency Decomposition |  |  |  | CVPR, 2025 |
| ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation |  |  |  | NeurIPS, 2024 |
| Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers |  |  |  | CVPR, 2024 |
| CelebV-Text: A Large-Scale Facial Text-Video Dataset |  |  | - | CVPR, 2023 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation |  |  | - | May, 2023 |
| VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation |  | - | - | May, 2023 |
| Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions |  | - | - | Nov, 2021 |
| Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval |  | - | - | ICCV, 2021 |
| MSR-VTT: A Large Video Description Dataset for Bridging Video and Language |  | - | - | CVPR, 2016 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation |  |  |  | May, 2025 |
| Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos |  |  | - | Jul., 2024 |
| ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation |  |  |  | NeurIPS, 2024 |
| STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models |  |  | - | ICLR, 2024 |
| Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment |  | - | - | Mar, 2024 |
| Towards A Better Metric for Text-to-Video Generation |  | - |  | Jan, 2024 |
| AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI |  | - | - | Jan, 2024 |
| VBench: Comprehensive Benchmark Suite for Video Generative Models |  |  |  | Nov, 2023 |
| FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation |  | - | - | NeurIPS, 2023 |
| CVPR 2023 Text Guided Video Editing Competition |  | - | - | Oct., 2023 |
| EvalCrafter: Benchmarking and Evaluating Large Video Generation Models |  |  |  | Oct., 2023 |
| Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset |  | - | - | Sep., 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| Helios: Real Real-Time Long Video Generation Model |  |  |  | Arxiv, 2026 |
| Identity-Preserving Text-to-Video Generation by Frequency Decomposition |  |  |  | CVPR, 2025 |
| Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning |  |  |  | NeurIPS 2024 |
| Movie Gen |  | - |  | Oct, 2024 |
| CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer |  |  | - | Oct, 2024 |
| Grid Diffusion Models for Text-to-Video Generation |  |  |  | CVPR, 2024 |
| MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators |  |  |  | Apr., 2024 |
| Mora: Enabling Generalist Video Generation via A Multi-Agent Framework |  | - | - | Mar., 2024 |
| VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis |  | - | - | Mar., 2024 |
| Genie: Generative Interactive Environments |  | - |  | Feb., 2024 |
| Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis |  | - |  | Feb., 2024 |
| Lumiere: A Space-Time Diffusion Model for Video Generation |  | - |  | Jan, 2024 |
| UNIVG: TOWARDS UNIFIED-MODAL VIDEO GENERATION |  | - |  | Jan, 2024 |
| VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models |  |  |  | Jan, 2024 |
| 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model |  | - |  | Jan, 2024 |
| MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation |  | - |  | Jan, 2024 |
| VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM |  | - |  | Jan, 2024 |
| A Recipe for Scaling up Text-to-Video Generation with Text-free Videos |  |  |  | Dec, 2023 |
| InstructVideo: Instructing Video Diffusion Models with Human Feedback |  |  |  | Dec, 2023 |
| VideoLCM: Video Latent Consistency Model |  | - | - | Dec, 2023 |
| Photorealistic Video Generation with Diffusion Models |  | - |  | Dec, 2023 |
| Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation |  |  |  | Dec, 2023 |
| Delving Deep into Diffusion Transformers for Image and Video Generation |  | - |  | Dec, 2023 |
| StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter |  |  |  | Nov, 2023 |
| MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation |  | - |  | Nov, 2023 |
| ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models |  |  |  | Nov, 2023 |
| Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets |  |  |  | Nov, 2023 |
| FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline |  |  |  | Nov, 2023 |
| MoVideo: Motion-Aware Video Generation with Diffusion Models |  | - |  | Nov, 2023 |
| Make Pixels Dance: High-Dynamic Video Generation |  | - |  | Nov, 2023 |
| Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning |  | - |  | Nov, 2023 |
| Optimal Noise pursuit for Augmenting Text-to-Video Generation |  | - | - | Nov, 2023 |
| VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning |  | - |  | Nov, 2023 |
| VideoCrafter1: Open Diffusion Models for High-Quality Video Generation |  |  |  | Oct, 2023 |
| SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction |  |  |  | Oct, 2023 |
| DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors |  |  |  | Oct., 2023 |
| LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation |  |  |  | Oct., 2023 |
| DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model |  |  |  | Oct, 2023 |
| MotionDirector: Motion Customization of Text-to-Video Diffusion Models |  |  |  | Oct, 2023 |
| VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning |  |  |  | Sep., 2023 |
| Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation |  |  |  | Sep., 2023 |
| LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models |  |  |  | Sep., 2023 |
| Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation |  |  |  | Sep., 2023 |
| VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation |  | - |  | Sep., 2023 |
| MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text |  | - | - | Jul., 2023 |
| Text2Performer: Text-Driven Human Video Generation |  |  |  | Apr., 2023 |
| AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning |  |  |  | Jul., 2023 |
| Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models |  | - |  | Aug., 2023 |
| SimDA: Simple Diffusion Adapter for Efficient Video Generation |  |  |  | CVPR, 2024 |
| Dual-Stream Diffusion Net for Text-to-Video Generation |  | - | - | Aug., 2023 |
| ModelScope Text-to-Video Technical Report |  |  |  | Aug., 2023 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation |  |  | - | Jul., 2023 |
| VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation |  | - | - | May, 2023 |
| Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models |  | - |  | May, 2023 |
| Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models |  | - |  | - |
| Latent-Shift: Latent Diffusion with Temporal Shift |  | - |  | - |
| Probabilistic Adaptation of Text-to-Video Models |  | - |  | Jun., 2023 |
| NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation |  | - |  | Mar., 2023 |
| ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation | - | - | - | IJCNN, 2023 |
| MagicVideo: Efficient Video Generation With Latent Diffusion Models |  | - |  | - |
| Phenaki: Variable Length Video Generation From Open Domain Textual Description |  | - |  | - |
| Imagen Video: High Definition Video Generation With Diffusion Models |  | - |  | - |
| VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |  |  |  | - |
| MAGVIT: Masked Generative Video Transformer |  | - |  | Dec., 2022 |
| Make-A-Video: Text-to-Video Generation without Text-Video Data |  | - |  | - |
| Latent Video Diffusion Models for High-Fidelity Video Generation With Arbitrary Lengths |  |  |  | Nov., 2022 |
| CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers |  |  | - | May, 2022 |
| Video Diffusion Models |  | - |  | - |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO |  |  | - | May, 2025 |
| VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models |  |  |  | Mar, 2024 |
| TRAILBLAZER: TRAJECTORY CONTROL FOR DIFFUSION-BASED VIDEO GENERATION |  |  |  | Jan, 2024 |
| FreeInit: Bridging Initialization Gap in Video Diffusion Models |  |  |  | Dec, 2023 |
| MTVG : Multi-text Video Generation with Text-to-Video Models |  | - |  | Dec, 2023 |
| F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis |  | - | - | Nov, 2023 |
| AdaDiff: Adaptive Step Selection for Fast Diffusion |  | - | - | Nov, 2023 |
| FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax |  |  |  | Nov, 2023 |
| 🏀GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning |  |  |  | Nov, 2023 |
| FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling |  |  |  | Oct, 2023 |
| ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation |  |  |  | Oct, 2023 |
| LLM-grounded Video Diffusion Models |  |  |  | Oct, 2023 |
| Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator |  |  | - | NeurIPS, 2023 |
| DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis |  |  |  | Aug, 2023 |
| Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation |  |  | - | May, 2023 |
| Text2video-Zero: Text-to-Image Diffusion Models Are Zero-Shot Video Generators |  |  |  | Mar., 2023 |
| PEEKABOO: Interactive Video Generation via Masked-Diffusion 🫣 |  |  |  | CVPR, 2024 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures |  | - |  | Feb., 2026 |
| EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses |  |  |  | CVPR 2026 |
| 🔥🔥StableAnimator: High-Quality Identity-Preserving Human Image Animation🔥🔥 |  |  |  | Nov., 2024 |
| MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model |  |  |  | ECCV 2024 |
| MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance |  |  |  | Jul., 2024 |
| Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance |  |  |  | Mar., 2024 |
| Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions |  | - | - | Mar., 2024 |
| Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons |  | - | - | Jan., 2024 |
| DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models |  | - |  | Dec., 2023 |
| MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model |  |  |  | Nov., 2023 |
| Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation |  |  |  | Nov., 2023 |
| MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer |  |  |  | Nov., 2023 |
| DisCo: Disentangled Control for Referring Human Dance Generation in Real World |  |  |  | Jul., 2023 |
| Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model |  | - | - | Aug., 2023 |
| DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion |  |  |  | Apr., 2023 |
| Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos |  |  |  | Apr., 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance |  |  |  | Mar. 2025 |
| MOTIONCLONE: TRAINING-FREE MOTION CLONING FOR CONTROLLABLE VIDEO GENERATION |  |  |  | Jun., 2024 |
| Tora: Trajectory-oriented Diffusion Transformer for Video Generation |  |  |  | CVPR 2025 |
| MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model |  |  |  | ECCV 2024 |
| Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance |  |  |  | Mar., 2024 |
| Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling |  | - | - | Jan., 2024 |
| Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation |  | - | - | Jan., 2024 |
| Customizing Motion in Text-to-Video Diffusion Models |  | - |  | Dec., 2023 |
| VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models |  |  |  | CVPR 2024 |
| AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance |  |  |  | Nov., 2023 |
| Motion-Conditioned Diffusion Model for Controllable Video Synthesis |  | - |  | Apr., 2023 |
| DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory |  | - | - | Aug., 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| Identity-Preserving Text-to-Video Generation by Frequency Decomposition |  |  |  | Nov., 2024 |
| PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation |  |  |  | ECCV 2024 |
| TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models |  |  |  | CVPR 2024 |
| Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model |  |  |  | NeurIPS 2024 |
| Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation |  | - |  | Mar., 2024 |
| AtomoVideo: High Fidelity Image-to-Video Generation |  | - |  | Mar., 2024 |
| Animated Stickers: Bringing Stickers to Life with Video Diffusion |  | - | - | Feb., 2024 |
| CONSISTI2V: Enhancing Visual Consistency for Image-to-Video Generation |  | - |  | Feb., 2024 |
| I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models |  | - | - | Dec., 2023 |
| PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models |  | - |  | Dec., 2023 |
| DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance |  | - |  | Nov., 2023 |
| LivePhoto: Real Image Animation with Text-guided Motion Control |  |  |  | Nov., 2023 |
| VideoBooth: Diffusion-based Video Generation with Image Prompts |  |  |  | Nov., 2023 |
| Decouple Content and Motion for Conditional Image-to-Video Generation |  | - | - | Nov, 2023 |
| I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models |  | - | - | Nov, 2023 |
| Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image |  | - | - | MM, 2023 |
| Generative Image Dynamics |  | - |  | Sep., 2023 |
| LaMD: Latent Motion Diffusion for Video Generation |  | - | - | Apr., 2023 |
| Conditional Image-to-Video Generation with Latent Flow Diffusion Models |  |  | - | CVPR 2023 |
| NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis |  |  |  | CVPR 2022 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|
| UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control |  | - | - | Mar., 2024 |
| Magic-Me: Identity-Specific Video Customized Diffusion |  | - |  | Feb., 2024 |
| InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions |  | - |  | Feb., 2024 |
| Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion |  | - |  | Feb., 2024 |
| Boximator: Generating Rich and Controllable Motions for Video Synthesis |  | - |  | Feb., 2024 |
| AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning |  | - | - | Jan., 2024 |
| ActAnywhere: Subject-Aware Video Background Generation |  | - |  | Jan., 2024 |
| CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects |  | - | - | Jan., 2024 |
| MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions |  |  |  | Jan., 2024 |
| PEEKABOO: Interactive Video Generation via Masked-Diffusion |  | - |  | Dec., 2023 |
| CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling |  | - | - | Dec., 2023 |
| Fine-grained Controllable Video Generation via Object Appearance and Context |  | - |  | Nov., 2023 |
| GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware Generation |  | - |  | Nov., 2023 |
| Panacea: Panoramic and Controllable Video Generation for Autonomous Driving |  | - |  | Nov., 2023 |
| SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models |  | - |  | Nov., 2023 |
| VideoComposer: Compositional Video Synthesis with Motion Controllability |  |  |  | Jun., 2023 |
| NExT-GPT: Any-to-Any Multimodal LLM |  | - | - | Sep, 2023 |
| MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images |  | - |  | Jun, 2023 |
| Any-to-Any Generation via Composable Diffusion |  |  |  | May, 2023 |
| Mm-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation |  |  | - | CVPR 2023 |
| Title | arXiv | Github | Website | Pub. & Date |
|---|
| SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction |  |  |  | CVPR, 2025 |
| AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction |  |  |  | Jun, 2024 |
| STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction |  |  | - | Dec, 2023 |
| Video Diffusion Models with Local-Global Context Guidance |  |  | - | IJCAI, 2023 |
| Seer: Language Instructed Video Prediction with Latent Diffusion Models |  | - |  | Mar., 2023 |
| MaskViT: Masked Visual Pre-Training for Video Prediction |  |  |  | Jun, 2022 |
| Diffusion Models for Video Prediction and Infilling |  |  |  | TMLR 2022 |
| McVd: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation |  |  |  | NeurIPS 2022 |
| Diffusion Probabilistic Modeling for Video Generation |  |  | - | Mar., 2022 |
| Flexible Diffusion Modeling of Long Videos |  |  |  | May, 2022 |
| Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models |  |  |  | May, 2023 |
| Title | arXiv | Github | Website | Pub. Date |
|---|
| VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing |  |  |  | Jun, 2024 |
| FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation |  | - | - | Mar., 2024 |
| FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing |  | - | - | Mar., 2024 |
| DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing |  | - |  | Mar, 2024 |
| Video Editing via Factorized Diffusion Distillation |  | - | - | Mar, 2024 |
| FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis |  |  |  | Dec, 2023 |
| MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers |  | - |  | Dec, 2023 |
| Neutral Editing Framework for Diffusion-based Video Editing |  | - |  | Dec, 2023 |
| VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence |  | - |  | Nov, 2023 |
| VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models |  |  |  | Nov, 2023 |
| Motion-Conditioned Image Animation for Video Editing |  | - |  | Nov, 2023 |
| MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation |  | - | - | Sep, 2023 |
| MagicEdit: High-Fidelity and Temporally Coherent Video Editing |  | - | - | Aug, 2023 |
| Edit Temporal-Consistent Videos with Image Diffusion Model |  | - | - | Aug, 2023 |
| Structure and Content-Guided Video Synthesis With Diffusion Models |  | - |  | ICCV, 2023 |
| Dreamix: Video Diffusion Models Are General Video Editors |  | - |  | Feb, 2023 |
| Title | arXiv | Github | Website | Pub. Date |
|---|
| MVOC: a training-free multiple video object composition method with diffusion models |  |  |  | Jun, 2024 |
| VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing |  |  |  | Jun, 2024 |
| EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing |  |  |  | March, 2024 |
| UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing |  | - |  | Feb, 2024 |
| Object-Centric Diffusion for Efficient Video Editing |  | - | - | Jan, 2024 |
| RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing |  | - | - | Dec, 2023 |
| VidToMe: Video Token Merging for Zero-Shot Video Editing |  |  |  | Dec, 2023 |
| A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing |  |  |  | Dec, 2023 |
| AnimateZero: Video Diffusion Models are Zero-Shot Image Animators |  |  | - | Dec, 2023 |
| RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models |  |  |  | Dec, 2023 |
| BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models |  | - |  | Nov., 2023 |
| Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion |  | - | - | Nov., 2023 |
| FastBlend: a Powerful Model-Free Toolkit Making Video Stylization Easier |  |  | - | Oct., 2023 |
| LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation |  | - | - | Nov., 2023 |
| Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models |  | - | - | Oct., 2023 |
| LOVECon: Text-driven Training-Free Long Video Editing with ControlNet |  |  | - | Oct., 2023 |
| FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing |  | - |  | Oct., 2023 |
| Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models |  |  |  | ICLR, 2024 |
| MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance |  | - | - | Aug., 2023 |
| EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints |  | - | - | Aug., 2023 |
| ControlVideo: Training-free Controllable Text-to-Video Generation |  |  | - | May, 2023 |
| TokenFlow: Consistent Diffusion Features for Consistent Video Editing |  |  |  | Jul., 2023 |
| VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing |  | - |  | Jun., 2023 |
| Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation |  | - |  | Jun., 2023 |
| Zero-Shot Video Editing Using Off-the-Shelf Image Diffusion Models |  |  |  | Mar., 2023 |
| FateZero: Fusing Attentions for Zero-shot Text-based Video Editing |  |  |  | Mar., 2023 |
| Pix2video: Video Editing Using Image Diffusion |  | - |  | Mar., 2023 |
| InFusion: Inject and Attention Fusion for Multi Concept Zero Shot Text based Video Editing |  | - |  | Aug., 2023 |
| Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising |  |  |  | May, 2023 |
| Title | arXiv | Github | Website | Pub. & Date |
|---|
| Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models |  | - |  | Feb., 2024 |
| MotionCrafter: One-Shot Motion Customization of Diffusion Models |  |  | - | Dec., 2023 |
| DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing |  | - |  | Dec., 2023 |
| MotionEditor: Editing Video Motion via Content-Aware Diffusion |  |  |  | CVPR, 2024 |
| Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning |  | - |  | Nov., 2023 |
| Cut-and-Paste: Subject-Driven Video Editing with Attention Control |  | - | - | Nov, 2023 |
| StableVideo: Text-driven Consistency-aware Diffusion Video Editing |  |  |  | ICCV, 2023 |
| Shape-aware Text-driven Layered Video Editing |  | - | - | CVPR, 2023 |
| SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing |  |  | - | May, 2023 |
| Towards Consistent Video Editing with Text-to-Image Diffusion Models |  | - | - | Mar., 2023 |
| Edit-A-Video: Single Video Editing with Object-Aware Consistency |  | - |  | Mar., 2023 |
| Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation |  |  |  | ICCV, 2023 |
| ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing |  |  |  | May, 2023 |
| Video-P2P: Video Editing with Cross-attention Control |  |  |  | Mar., 2023 |
| SinFusion: Training Diffusion Models on a Single Image or Video |  |  |  | Nov., 2022 |
| Title | arXiv | Github | Website | Pub. Date |
|---|
| EchoReel: Enhancing Action Generation of Existing Video Diffusion Modelsl |  | - | - | Mar., 2024 |
| VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model |  | - | - | Mar., 2024 |
| SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion |  | - | - | Mar., 2024 |
| VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models |  | - | - | Mar., 2024 |
| Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation |  | - | - | Mar., 2024 |
| DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction |  | - | - | Mar., 2024 |
| Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval |  | - | - | Jan., 2024 |
| Diffusion Reward: Learning Rewards via Conditional Video Diffusion |  |  |  | Dec., 2023 |
| ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models |  | - |  | Nov., 2023 |
| Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models |  |  | - | Nov., 2023 |
| Flow-Guided Diffusion for Video Inpainting |  |  | - | Nov., 2023 |
| Breathing Life Into Sketches Using Text-to-Video Priors |  | - | - | Nov., 2023 |
| Infusion: Internal Diffusion for Video Inpainting |  | - | - | Nov., 2023 |
| DiffusionVMR: Diffusion Model for Video Moment Retrieval |  | - | - | Aug., 2023 |
| DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation |  | - | - | Aug., 2023 |
| CoTracker: It is Better to Track Together |  |  |  | Aug., 2023 |
| Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion Representations |  | - | - | ICIAP, 2023 |
| Exploring Diffusion Models for Unsupervised Video Anomaly Detection |  | - | - | Apr., 2023 |
| Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection |  | - | - | ICCV, 2023 |
| Diffusion Action Segmentation |  | - | - | Mar., 2023 |
| DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion |  |  |  | Mar., 2023 |
| DiffusionRet: Generative Text-Video Retrieval with Diffusion Model |  |  | - | ICCV, 2023 |
| MomentDiff: Generative Video Moment Retrieval from Random to Real |  |  |  | Jul., 2023 |
| Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition |  |  |  | Feb., 2023 |
| Refined Semantic Enhancement Towards Frequency Diffusion for Video Captioning |  | - | - | Nov., 2022 |
| A Generalist Framework for Panoptic Segmentation of Images and Videos |  |  |  | Oct., 2022 |
| DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models |  | - | - | Jul., 2023 |
| CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming |  | - | - | Mar., 2023 |
| Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition |  | - | - | Jul., 2023 |
| PDPP: Projected Diffusion for Procedure Planning in Instructional Videos |  |  | - | CVPR 2023 |