Survey of Video Diffusion Models: Foundations, Implementations, and Applications

February 2, 2026 · View on GitHub

Paper

Yimu Wang1,*, Xuye Liu1,*, Wei Pang1,*, Li Ma3,*, Shuai Yuan2,*, Paul Debevec3, Ning Yu3,†

1University of Waterloo, 2Duke University, 3Netflix Eyeline Studios
*Contributed Equally, Corresponding Author

Abstract

In this survey and github repository, we provide a comprehensive overview of the recent advances in video diffusion models. We cover the foundations of video generative models, including GANs, auto-regressive models, and diffusion models. We also discuss the learning foundations, including classic denoising diffusion models, flow matching, and training-free methods. Additionally, we explore various architectures, including UNet and diffusion transformers. We discuss the applications of video diffusion models, including video generation, enhancement, personalization, and 3D-aware video generation. Finally, we highlight the benefits of video diffusion models to other domains, such as video representation learning and video retrieval.

Moreover, to facilitate the understanding of video diffusion models, we provide a cheatsheet including commonly used training datasets, training engineering techniques, and evaluation metrics. We also provide a list of video diffusion models in academia and industry.

image

Table of Contents

Foundations

Video generative paradigms

GAN video models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video GenerationarXivStarWebsiteICCV 2023
StyleLipSync: Style-based Personalized Lip-sync Video GenerationarXiv-WebsiteICCV 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant LearningarXiv--ICCV 2023
AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video AvatarsarXiv--NeurIPS 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2arXivStarWebsiteCVPR 2022
Generating Videos with Dynamics-aware Implicit Generative Adversarial NetworksarXivStarWebsiteICLR 2022
A Good Image Generator Is What You Need for High-Resolution Video SynthesisarXivStarWebsiteICLR 2021
Analyzing and Improving the Image Quality of StyleGANarXivStar-CVPR 2020
Train Sparsely, Generate Densely: Memory-efficient Unsupervised Training of High-resolution Temporal GANarXivStar-IJCV 2020
Adversarial Video Generation on Complex DatasetsarXiv--arXiv 2019
MoCoGAN: Decomposing Motion and Content for Video GenerationarXivStar-CVPR 2018
Temporal Generative Adversarial Nets with Singular Value ClippingarXivStarWebsiteICCV 2017

Auto-regressive video models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersarXivStarWebsiteICLR 2023
Single Image Video Prediction with Auto-Regressive GANsarXiv--Sensors 2022
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image GeneratorarXiv-WebsiteICIP 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive TransformerarXivStarWebsiteECCV 2022
VideoGPT: Video Generation using VQ-VAE and TransformersarXivStarWebsitearXiv 2021
Latent Video TransformerarXivStar-arXiv 2020
Parallel Multiscale Autoregressive Density EstimationarXiv--ICML 2017
Video Pixel NetworksarXiv--ICML 2017

Video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CCEdit: Creative and Controllable Video Editing via Diffusion ModelsarXiv--arXiv 2024
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion ModelsarXivStarWebsiteCVPR 2023
Make-A-Video: Text-to-Video Generation without Text-Video DataarXivStarWebsitearXiv 2022
MagicVideo: Efficient Video Generation with Latent Diffusion ModelsarXiv-WebsitearXiv 2022
Imagen Video: High Definition Video Generation with Diffusion ModelsarXiv--arXiv 2022
Video Diffusion ModelsarXivStarWebsitearXiv 2022
Cascaded Diffusion Models for High Fidelity Image GenerationarXiv-WebsiteJMLR 2022
High-Resolution Image Synthesis with Latent Diffusion ModelsarXivStar-CVPR 2022

Auto-regressive video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
From Slow Bidirectional to Fast Causal Video GeneratorsarXivStarWebsitearXiv 2024
Progressive Autoregressive Video Diffusion ModelsarXivStarWebsitearXiv 2024
Pyramidal Flow Matching for Efficient Video Generative ModelingarXivStarWebsitearXiv 2024
ART·V: Auto-Regressive Text-to-Video Generation with Diffusion ModelsarXivStarWebsiteCVPR 2024

Learning foundations

Classic denoising diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Elucidating the Design Space of Diffusion-Based Generative ModelsarXivStar-NeurIPS 2022
Denoising Diffusion Implicit ModelsarXivStar-ICLR 2021
Improved Denoising Diffusion Probabilistic ModelsarXivStar-ICML 2021
Denoising Diffusion Probabilistic ModelsarXivStarWebsiteNeurIPS 2020
Deep Unsupervised Learning using Nonequilibrium ThermodynamicsarXivStar-ICML 2015

Flow matching and rectified flow

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Voicebox: Text-Guided Multilingual Universal Speech Generation at ScalearXivStarWebsiteNeurIPS 2023
Boosting Fast and High-Quality Speech Synthesis with Linear DiffusionarXiv--ICASSP 2023
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image GenerationarXiv--ICLR 2024
Stochastic Interpolants: A Unifying Framework for Flows and DiffusionsarXivStar-arXiv 2023
Flow Matching for Generative ModelingarXiv--arXiv 2022

Learning from feedback and reward models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Improving Dynamic Object Interactions in Text-to-Video Generation with AI FeedbackarXiv--arXiv 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance DesignarXivStarWebsitearXiv 2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward FeedbackarXivStarWebsiteNeurIPS 2024
InstructVideo: Instructing Video Diffusion Models with Human FeedbackarXivStarWebsiteCVPR 2024
Click to Move: Controlling Video Generation with Sparse MotionarXivStar-ICCV 2021

One-shot and few-shot learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Make an Image Move: Few-Shot Based Video Generation Guided by CLIP---ICPR 2025
LAMP: Learn A Motion Pattern for Few-Shot-Based Video GenerationarXivStarWebsitearXiv 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationarXivStarWebsiteICCV 2023

Training-free methods

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion ControlarXiv-WebsitearXiv 2024
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion ModelsarXivStarWebsitearXiv 2024
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT PlanningarXivStarWebsiteCVPR 2024
Peekaboo: Interactive Video Generation via Masked-DiffusionarXivStarWebsiteCVPR 2024
ControlVideo: Training-free Controllable Text-to-Video GenerationarXivStarWebsiteICLR 2024
Magic-Me: Identity-Specific Video Customized DiffusionarXivStarWebsitearXiv 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing TasksarXivStarWebsitearXiv 2024
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM AnimatorarXivStarWebsiteNeurIPS 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video EditingarXivStar-ICCV 2023
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video GeneratorsarXivStarWebsiteICCV 2023

Token learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion ControlarXiv-WebsitearXiv 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video GenerationarXiv--arXiv 2023
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual InversionarXivStarWebsiteICLR 2023

Guidances

Classifier guidance

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CFG++: Manifold-Constrained Classifier Free Guidance for Diffusion ModelsarXivStarWebsitearXiv 2024
LLM-grounded Video Diffusion ModelsarXivStarWebsiteICLR 2024
Exploring Compositional Visual Generation with Latent Classifier GuidancearXiv--CVPRW 2023
Diffusion Models Beat GANs on Image SynthesisarXiv--NeurIPS 2021

Classifier-free guidance

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Classifier-Free Diffusion GuidancearXivStarWebsite2022

Diffusion model frameworks

Pixel diffusion and latent diffusion

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Latte: Latent Diffusion Transformer for Video GenerationarXivStarWebsiteTMLR 2025
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video GenerationarXivStarWebsitearXiv 2023
Preserve Your Own Correlation: A Noise Prior for Video Diffusion ModelsarXiv--ICCV 2023
ModelScope Text-to-Video Technical ReportarXiv-WebsitearXiv 2023
Structure and Content-Guided Video Synthesis with Diffusion ModelsarXiv--ICCV 2023
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion ModelsarXivStarWebsiteCVPR 2023
Text-To-4D Dynamic Scene GenerationarXiv-WebsitearXiv 2023

Optical-flow-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped NoisearXivStarWebsiteCVPR 2025
Infinite-Resolution Integral Noise Warping for Diffusion ModelsarXivStarWebsiteICLR 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion ModelingarXivStarWebsiteSIGGRAPH 2024
A Dynamic Multi-Scale Voxel Flow Network for Video PredictionarXivStarWebsiteCVPR 2023
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow EstimationarXivStar-arXiv 2023
FlowFormer: A Transformer Architecture for Optical FlowarXivStarWebsitearXiv 2022
RAFT: Recurrent All-Pairs Field Transforms for Optical FlowarXivStar-ECCV 2020
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow EstimationarXivStar-CVPR 2018
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost VolumearXivStar-CVPR 2018
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep NetworksarXivStar-arXiv 2016
FlowNet: Learning Optical Flow with Convolutional NetworksarXivStar-ICCV 2015
A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them---IJCV 2014
Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods---IJCV 2005
A Framework for the Robust Estimation of Optical Flow---ICCV 1993
Determining Optical Flow---AI Journal 1981

Noise scheduling

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Rethinking the Noise Schedule of Diffusion-Based Generative ModelsarXiv--ICLR 2024
On the Importance of Noise Scheduling for Diffusion ModelsarXiv--arXiv 2023
simple diffusion: End-to-end diffusion for high resolution imagesarXivStar-ICML 2023
Elucidating the Design Space of Diffusion-Based Generative ModelsarXivStar-NeurIPS 2022
Improved Denoising Diffusion Probabilistic ModelsarXivStar-ICML 2021
Denoising Diffusion Probabilistic ModelsarXivStarWebsiteNeurIPS 2020
Score-Based Generative Modeling through Stochastic Differential EquationsarXivStar-ICLR 2021

Agent-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
VideoAgent: Self-Improving Video GenerationarXivStarWebsiteICLR 2025
Mora: Enabling Generalist Video Generation via A Multi-Agent FrameworkarXivStar-arXiv 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative FrameworkarXivStar-arXiv 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation FrameworkarXiv-WebsitearXiv 2023
DriveGAN: Towards a Controllable High-Quality Neural SimulationarXivStarWebsiteCVPR 2021

Architectures

UNet

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
MoVideo: Motion-Aware Video Generation with Diffusion ModelsarXiv-WebsiteECCV 2025
ModelScope Text-to-Video Technical ReportarXiv--arXiv 2023
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video GenerationarXiv-WebsitearXiv 2023
MagicVideo: Efficient Video Generation with Latent Diffusion ModelsarXiv-WebsitearXiv 2022
Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary LengthsarXiv--arXiv 2022
Imagen Video: High Definition Video Generation with Diffusion ModelsarXiv--arXiv 2022
Video Diffusion ModelsarXivStarWebsitearXiv 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingarXiv--NeurIPS 2022
High-Resolution Image Synthesis with Latent Diffusion ModelsarXivStar-CVPR 2022
An Image is Worth 16x16 Words: Transformers for Image Recognition at ScalearXiv--ICLR 2021
U-Net: Convolutional Networks for Biomedical Image SegmentationarXivStar-MICCAI 2015

Diffusion transformers

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Open-Sora: Democratizing Efficient Video Production for AllarXivStar-arXiv 2024
From Slow Bidirectional to Fast Causal Video GeneratorsarXivStarWebsitearXiv 2024
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant TransformersarXiv-WebsitearXiv 2024
GenTron: Diffusion Transformers for Image and Video GenerationarXiv-WebsiteCVPR 2024
VDT: General-purpose Video Diffusion Transformers via Mask ModelingarXivStar-ICLR 2024
Text2Performer: Text-Driven Human Video GenerationarXivStar-ICCV 2023
Scalable Diffusion Models with TransformersarXiv-WebsiteICCV 2023
CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersarXiv--ICLR 2023
VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation---ICLR 2023
ViViT: A Video Vision TransformerarXiv--ICCV 2021

VAE for latent space compression

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
HunyuanVideo: A Systematic Framework for Large Video Generative ModelsarXiv--arXiv 2025
SkyReels V1: Human-Centric Video Foundation Model-StarWebsitearXiv 2025
Magic 1-for-1: Generating One Minute Video Clips Within One MinutearXivStarWebsitearXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation ModelarXivStarWebsitearXiv 2025
Latte: Latent Diffusion Transformer for Video GenerationarXivStarWebsiteTMLR 2025
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models Without Specific TuningarXivStarWebsiteICLR 2024
VideoPoet: A Large Language Model for Zero-Shot Video GenerationarXiv-WebsitePMLR 2024
LaVie: Latent Video Encoding for Diffusion ModelsarXivStarWebsiteIJCV 2024
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video GenerationarXivStarWebsitearXiv 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video GenerationarXivStarWebsitearXiv 2023
Make-A-Video: Text-to-Video Generation Without Text-Video DataarXivStarWebsiteICLR 2023
Phenaki: Variable Length Video Generation from Open Domain Textual DescriptionsarXivStarWebsiteICLR 2023
Denoising Diffusion Probabilistic ModelsarXivStar-NeurIPS 2020

Text encoders

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Magic 1-for-1: Generating one minute video clips within one minutearXivStarWebsitearXiv 2025
SkyReels v1: Human-centric video foundation modelarXivStar-arXiv 2025
Hunyuanvideo: A systematic framework for large video generative modelsarXiv-WebsitearXiv 2025
Step-video-t2v technical report: The practice, challenges, and future of video foundation modelarXiv-WebsitearXiv 2025
Identity-Preserving Text-to-Video Generation by Frequency DecompositionarXivStarWebsitearXiv 2024
An empirical study and analysis of text-to-image generation using large language model-powered textual representationarXiv--arXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understandingarXiv-WebsitearXiv 2024
FIT: Flexible Vision Transformer for Diffusion ModelarXivStarWebsitearXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert TransformerarXivStarWebsitearXiv 2024
Scaling Rectified Flow Transformers for High-Resolution Image SynthesisarXivStarWebsiteICML 2024
SIT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant TransformersarXivStarWebsitearXiv 2024
Kolors: Effective training of diffusion model for photorealistic text-to-image synthesisarXiv-WebsitearXiv 2024
Open-Sora: Democratizing Efficient Video Production for AllarXivStarWebsitearXiv 2024
Open-Sora-PlanarXivStarWebsitearXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Video GenerationarXivStarWebsiteCVPR 2024
Latte: Latent Diffusion Transformer for Video GenerationarXivStarWebsitearXiv 2024
FluxarXivStarWebsitearXiv 2023
Scalable Diffusion Models with TransformersarXivStarWebsiteICCV 2023
All are Worth Words: A ViT Backbone for Diffusion ModelsarXivStarWebsiteCVPR 2023
Baichuan 2: Open Large-Scale Language ModelsarXivStarWebsitearXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat ModelsarXivStarWebsitearXiv 2023
LLaMA: Open and Efficient Foundation Language ModelsarXivStarWebsitearXiv 2023
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte ModelsarXivStar-TACL 2022
Imagen Video: High Definition Video Generation with Diffusion ModelsarXiv-WebsitearXiv 2022
Hierarchical Text-Conditional Image Generation with CLIP LatentsarXiv-WebsitearXiv 2022
High-Resolution Image Synthesis with Latent Diffusion ModelsarXivStarWebsiteCVPR 2022
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion ModelsarXivStar-arXiv 2021
GLM: General Language Model Pretraining with Autoregressive Blank InfillingarXivStarWebsitearXiv 2021
Learning Transferable Visual Models From Natural Language SupervisionarXivStarWebsiteICML 2021
Zero-Shot Text-to-Image GenerationarXiv--ICML 2021
Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerarXivStar-JMLR 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXivStarWebsiteNAACL 2019

Implementation

Datasets

More datasets could be found on Pixabay, Mixkit, Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. Also, there are some datasets at Midjourney V5.1 Cleaned Data, Unsplash-lite, AnimateBench, Pexels-400k, and LAION-AESTHETICS.

Title
arXiv
GitHub
Website
Conference & Year
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality TeachersarXivStarWebsiteCVPR 2024
VBench: Comprehensive Benchmark Suite for Video Generative ModelsarXivStarWebsiteCVPR 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text DataarXivStarWebsiteICLR 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured CaptionsarXivStarWebsiteNeurIPS 2024
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion ModelsarXivStarWebsiteNeurIPS 2024
Vript: A Video Is Worth Thousands of WordsarXivStar-NeurIPS 2024
VideoCrafter2arXivStarWebsitearXiv 2024
Open-Sora: Democratizing Efficient Video Production for AllarXivStarWebsitearXiv 2024
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video GeneratorarXivStarWebsiteICCV 2023
Temporally Consistent Transformers for Video GenerationarXivStarWebsiteICML 2023
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and MethodarXivStar-NeurIPS 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video GenerationarXivStar-NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AIarXivStarWebsiteTBench 2023
AdaPool: Exponential Adaptive Pooling for Information-Retaining DownsamplingarXivStar-TIP 2023
Swap Attention in Spatiotemporal Diffusions for Text-to-Video GenerationarXivStar-arXiv 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsarXivStar-CVPR 2022
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video InpaintingarXivStar-CVPR 2022
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-ResolutionarXiv-WebsiteCVPR 2022
Learning Audio-Video Modalities from Image CaptionsarXiv-WebsiteECCV 2022
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video EditingarXivStar-ECCV 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingarXivStarWebsiteNeurIPS 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image GenerationarXiv-WebsiteTMLR 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation LearningarXivStarWebsiteICCV 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalarXivStarWebsiteICCV 2021
MERLOT: Multimodal Neural Script Knowledge ModelsarXivStarWebsiteNeurIPS 2021
Learning Video Representations from Textual Web SupervisionarXiv--arXiv 2020
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video ClipsarXiv-WebsiteICCV 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language ResearcharXiv-WebsiteICCV 2019
Towards Automatic Learning of Procedures from Web Instructional VideosarXiv-WebsiteAAAI 2018
How2: A Large-scale Dataset for Multimodal Language UnderstandingarXivStarWebsitearXiv 2018
Quo Vadis, Action Recognition? A New Model and the Kinetics DatasetarXivStar-CVPR 2017
Localizing Moments in Video with Natural LanguagearXivStar-ICCV 2017
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language--WebsiteCVPR 2016
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding--WebsiteCVPR 2015
A Dataset for Movie DescriptionarXiv--CVPR 2015
UCF101: A Dataset of 101 Human Actions Classes From Videos in The WildarXiv-WebsitearXiv 2012

Training engineering

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal ModelsarXivStarWebsiteICLR 2025
SAM 2: Segment Anything in Images and VideosarXiv-WebsiteICLR 2025
Motion Prompting: Controlling Video Generation with Motion TrajectoriesarXiv-WebsiteCVPR 2025
SimDA: Simple Diffusion Adapter for Efficient Video GenerationarXivStarWebsiteCVPR 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion PriorsarXivStarWebsiteECCV 2024
CogVLM2: Visual Language Models for Image and Video UnderstandingarXivStar-arXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert TransformerarXivStarWebsitearXiv 2024
Open-Sora: Democratizing Efficient Video Production for AllarXivStarWebsitearXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understandingarXiv-WebsitearXiv 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningarXivStarWebsitearXiv 2024
Cogvideo: Large-scale pretraining for text-to-video generation via transformersarXivStar-ICLR 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large DatasetsarXivStar-arXiv 2023
LLaMA: Open and Efficient Foundation Language ModelsarXivStarWebsitearXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat ModelsarXivStarWebsitearXiv 2023
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningarXivStar-NeurIPS 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessarXivStar-NeurIPS 2022
Visual Prompt TuningarXivStar-ECCV 2022
CoCa: Contrastive Captioners are Image-Text Foundation ModelsarXivStar-TMLR 2022
ZeRO: memory optimizations toward training trillion parameter modelsarXiv--Supercomputing 2020

Evaluation metrics and benchmarking findings

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
VBench: Comprehensive Benchmark Suite for Video Generative ModelsarXivStarWebsiteCVPR 2024
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative ModelsarXivStarWebsiteCVPR 2024
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative ModelsarXivStar-ICLR 2024
TC-Bench: Benchmarking Temporal Compositionality in Conditional Video GenerationarXivStar-arXiv 2024
Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical PerspectivesarXivStar-ICCV 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video GenerationarXivStar-NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AIarXivStarWebsiteTBench 2023
Learning Transferable Visual Models From Natural Language SupervisionarXivStarWebsiteICML 2021
RAFT: Recurrent All-Pairs Field Transforms for Optical FlowarXivStar-ECCV 2020
FVD: A new Metric for Video Generation ---ICLR Workshop 2019
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash EquilibriumarXivStar-NeurIPS 2017
Blind video temporal consistency---TOG 2015

Industry models

Title
arXiv
GitHub
Website
Conference & Year
Magic 1-for-1: Generating one minute video clips within one minutearXivStarWebsitearXiv 2025
SkyReels v1: Human-centric video foundation modelarXivStar-arXiv 2025
Step-Video-T2VarXiv-WebsitearXiv 2024
HunyuanVideoarXiv-WebsitearXiv 2024
Sora--Website2024
STIVarXivStarWebsitearXiv 2024
LTX-VideoarXivStarWebsitearXiv 2024
AllegroarXivStarWebsitearXiv 2024
JimengarXiv-WebsitearXiv 2024
Mochi 1arXiv-WebsitearXiv 2024
EasyAnimatearXivStarWebsitearXiv 2024
Vidu--Website2024
VideoCrafter2arXivStarWebsitearXiv 2024
VideoCrafter1arXivStarWebsitearXiv 2023
MiraarXiv-WebsitearXiv 2024
Hailuo AI--Website2024
LumierearXiv-WebsitearXiv 2024
VideoPoetarXiv-WebsitearXiv 2023
LumaAI Ray 2--Website2024
LumaAI Dream Machine--Website2023
Veo-2--Website2024
Veo-1--Website2023
Nova Real--Website2024
Wanx 2.1--Website2024
Kling--Website2024
Show-1arXivStarWebsiteNeurIPS 2023
MovieGenarXiv-WebsitearXiv 2024
Pika--Website2023
Vchitect-2.0--Website2024
OptisarXivStarWebsiteNeurIPS 2023
VLoggerarXivStarWebsiteICCV 2023
SeinearXivStarWebsiteCVPR 2023
LaviearXivStarWebsiteICCV 2023
MiracleVision--Website2023
PhenakiarXivStarWebsiteICLR 2024
W.A.L.TarXiv-WebsitearXiv 2024
Imagen videoarXiv-Website2022
GEN-3 Alpha--Website2024
GEN-2--Website2023
GEN-1--Website2022

Academia models

Title
arXiv
GitHub
Website
Conference & Year
RepVideo: Rethinking Cross-Layer Representation for Video GenerationarXivStarWebsitearXiv 2025
CausVid: Causality-Aware Video Generation with Slow-Fast Diffusion ModelsarXivStarWebsiteCVPR 2025
Open-Sora Plan: Open-Source Large Video Generation ModelarXivStarWebsitearXiv 2024
Open-Sora: Democratizing Efficient Video Production for AllarXivStarWebsitearXiv 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video SynthesisarXiv-WebsitearXiv 2024
SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant TransformersarXivStarWebsitearXiv 2024
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided PlanningarXivStarWebsiteCOLM 2024
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency LearningarXivStarWebsitearXiv 2024
I4VGEN: Interactive Video Generation via Integrated Dynamic ControlarXivStarWebsitearXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Text-to-Video GenerationarXivStarWebsitearXiv 2023
AnimateDiff-v2arXivStarWebsiteICLR 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video GenerationarXivStarWebsitearXiv 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High-Definition Text-to-Video GenerationarXiv-WebsitearXiv 2023
Dysen-VDM: Diffusion Model with Dynamic Spatio-Temporal Fusion for Video GenerationarXivStar-arXiv 2023
HiGen: Hierarchical 3D Feature Generation for 3D-Aware Image Synthesis and ManipulationarXivStar-arXiv 2023
ModelScope Text-to-Video Technical ReportarXiv-WebsitearXiv 2023
InstructVideo: Instructing Video Diffusion Models with Human FeedbackarXiv-WebsiteCVPR 2024
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivStarWebsiteNeurIPS 2023
VideoFusion: Decomposed Diffusion Models for High-Quality Video GenerationarXiv--CVPR 2023
MagViT-v2: Masked Generative Video TransformerarXivStar-arXiv 2023
MagViT: Masked Generative Video TransformerarXivStar-arXiv 2022
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video GenerationarXiv-WebsitearXiv 2023
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion ModelsarXivStarWebsiteCVPR 2023
Video Diffusion ModelsarXivStarWebsitearXiv 2022
Make-A-Video: Text-to-Video Generation without Text-Video DataarXivStarWebsiteICLR 2023
MagicVideo: Efficient Video Generation With Latent Diffusion ModelsarXiv-WebsitearXiv 2022
CogVideoX: Enhancing Video Understanding in the Era of Large Language ModelsarXivStarWebsitearXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersarXivStarWebsiteICLR 2023
VideoGPT: Video Generation using VQ-VAE and TransformersarXivStarWebsitearXiv 2021

Applications

Conditions

Image condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CogVideoX: Text-to-Video Diffusion Models with An Expert TransformerarXivStarWebsiteICLR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion ControlarXiv-WebsiteICLR 2025
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text GuidancearXivStarWebsiteICASSP 2025
EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsarXiv-WebsiteECCV 2024
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion ModelsarXivStarWebsiteCVPR 2025
MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and EditingarXivStarWebsiteECCV 2024
ConsistI2V: Enhancing Visual Consistency for Image-to-Video GenerationarXivStarWebsiteTMLR
I2V-Adapter: A General Image-to-Video Adapter for Diffusion ModelsarXivStarWebsiteSIGGRAPH 2024
ID-Animator: Zero-Shot Identity-Preserving Human Video GenerationarXivStarWebsitearXiv 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video GenerationarXiv-WebsitearXiv 2024
Generative Image DynamicsarXiv-WebsiteCVPR 2024
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image ModelsarXivStarWebsiteCVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion ModelsarXiv-WebsiteCVPR 2024
AtomoVideo: High Fidelity Image-to-Video GenerationarXiv-WebsitearXiv 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modelingarXivStarWebsiteSIGGRAPH 2024
Seer: Language Instructed Video Prediction with Latent Diffusion ModelsarXivStarWebsiteICLR 2024
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion GuidancearXivStarWebsitearXiv 2023
VideoBooth: Diffusion-based Video Generation with Image PromptsarXivStarWebsiteCVPR 2024
Sparsectrl: Adding sparse controls to text-to-video diffusion modelsarXivStarWebsiteECCV 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion PriorsarXivStarWebsiteECCV 2024
Adding Conditional Control to Text-to-Image Diffusion ModelsarXivStarWebsiteICCV 2023
Stable video diffusion: Scaling latent video diffusion models to large datasetsarXivStarWebsiteArxix 2023
Make pixels dance: High-dynamic video generationarXiv-WebsiteCVPR 2024
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion ModelsarXivStarWebsitearXiv 2023
Videocrafter1: Open diffusion models for high-quality video generationarXivStarWebsitearXiv 2023
VDT: General-purpose Video Diffusion Transformers via Mask ModelingarXivStarWebsiteICLR 2024
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivStarWebsiteNIPS 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion ModelsarXivStarWebsiteCVPR 2023

Spatial condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video GenerationarXiv-WebsiteSIGGRAPH 2025
ObjCtrl-2.5D: Training-free Object Control with Camera PosesarXivStarWebsitearXiv 2024
Motion Prompting: Controlling Video Generation with Motion TrajectoriesarXiv-WebsiteCVPR 2025
SG-I2V: Self-Guided Trajectory Control in Image-to-Video GenerationarXivStarWebsiteICLR 2025
MVideo: Motion Control for Enhanced Complex Action Video GenerationarXiv-WebsitearXiv 2024
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion ControlarXiv-WebsitearXiv 2024
Tora: Trajectory-oriented Diffusion Transformer for Video GenerationarXivStarWebsiteCVPR 2025
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionarXiv-WebsiteSIGGRAPH 2024
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion ModelsarXivStarWebsitearXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video GenerationarXivStarWebsiteNeurIPS 2024
DragAnything: Motion Control for Anything using Entity RepresentationarXivStarWebsiteECCV 2024
Boximator: Generating Rich and Controllable Motions for Video SynthesisarXiv-WebsitearXiv 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion ModelingarXivStarWebsiteSIGGRAPH 2024
PEEKABOO: Interactive Video Generation via Masked-DiffusionarXivStarWebsiteCVPR 2024
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion ModelsarXiv-WebsiteECCV 2024
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and TrajectoryarXivStarWebsitearXiv 2023

Camera parameter condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video GenerationarXiv-WebsiteSIGGRAPH 2025
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video GenerationarXivStarWebsiteICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion TransformersarXivStarWebsiteCVPR 2025
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated AttentionarXiv-WebsitearXiv 2024
VD3D: Taming Large Video Diffusion Transformers for 3D Camera ControlarXiv-WebsiteICLR 2025
CamCo: Camera-Controllable 3D-Consistent Image-to-Video GenerationarXiv-WebsitearXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video GenerationarXivStarWebsiteNeurIPS 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera ControlarXivStarWebsiteNeurIPS 2024
CameraCtrl: Enabling Camera Control for Text-to-Video GenerationarXivStarWebsitearXiv 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object MotionarXivStarWebsitearXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video GenerationarXivStarWebsiteSIGGRAPH 2024

Audio condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation ModelsarXiv-WebsitearXiv 2025
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingarXivStarWebsiteACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real TimearXiv-WebsiteNeurIPS 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head AvatarsarXivStarWebsiteCVPR 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and QuantizationarXiv--CVPR 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait AnimationarXivStar-arXiv 2024
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsarXivStarWebsiteECCV 2024
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short VideoarXivStarWebsiteICCV 2023
Audio-Driven Co-Speech Gesture Video GenerationarXivStarWebsiteNeurIPS 2022
CCVS: Context-aware Controllable Video SynthesisarXivStarWebsiteNeurIPS 2021

High-level video condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and GenerationarXivStarWebsiteICLR 2024
MotionClone: Training-Free Motion Cloning for Controllable Video GenerationarXivStarWebsiteICLR 2025
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion ModelsarXivStarWebsiteSIGGRAPH Asia 2024
ReVideo: Remake a Video with Motion and Content ControlarXivStarWebsiteNeurIPS 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingarXivStarWebsiteACM MM 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing TasksarXivStarWebsiteTMLR 2024
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance EditingarXivStarWebsitearXiv 2024
VidToMe: Video Token Merging for Zero-Shot Video EditingarXivStarWebsiteCVPR 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video SynthesisarXiv-WebsiteCVPR 2024
SAVE: Protagonist Diversification with Structure Agnostic Video EditingarXivStarWebsiteECCV 2024
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion ModelsarXivStarWebsiteCVPR 2024
DiffusionAtlas: High-Fidelity Consistent Diffusion Video EditingarXiv-WebsitearXiv 2023
DragVideo: Interactive Drag-style Video EditingarXivStarWebsiteECCV 2024
Drag-A-Video: Non-rigid Video Editing with Point-based InteractionarXivStarWebsitearXiv 2023
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point CorrespondencearXivStarWebsiteCVPR 2024
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video EditingarXivStarWebsiteCVPR 2024
Motion-Conditioned Image Animation for Video EditingarXivStarWebsitearXiv 2023
MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware DiffusionarXivStarWebsiteICML 2024
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character AnimationarXivStarWebsiteCVPR 2024
Consistent Video-to-Video Transfer Using Synthetic DatasetarXivStarWebsiteICLR 2024
MotionDirector: Motion Customization of Text-to-Video Diffusion ModelsarXivStarWebsiteECCV 2024
SimDA: Simple Diffusion Adapter for Efficient Video GenerationarXivStarWebsiteCVPR 2024
MagicEdit: High-Fidelity and Temporally Coherent Video EditingarXivStarWebsitearXiv 2023
CoDeF: Content Deformation Fields for Temporally Consistent Video ProcessingarXivStarWebsiteCVPR 2024
StableVideo: Text-driven Consistency-aware Diffusion Video EditingarXivStarWebsiteICCV 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNetarXivStarWebsitearXiv 2023
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivStarWebsiteNeurIPS 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video TranslationarXivStarWebsiteSIGGRAPH Asia 2023
Video Colorization with Pre-trained Text-to-Image Diffusion ModelsarXivStarWebsitearXiv 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video EditingarXiv-WebsiteTMLR 2024
DisCo: Disentangled Control for Realistic Human Dance GenerationarXivStarWebsiteCVPR 2024
Towards Consistent Video Editing with Text-to-Image Diffusion ModelsarXiv--NeurIPS 2023
Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion ModelsarXiv---
ControlVideo: Training-free Controllable Text-to-Video GenerationarXivStarWebsiteICLR 2024
InstructVid2Vid: Controllable Video Editing with Natural Language InstructionsarXiv--arXiv 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free VideosarXivStarWebsiteAAAI 2024
DreamPose: Fashion Image-to-Video Synthesis via Stable DiffusionarXivStarWebsiteICCV 2023
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion ModelsarXivStar-IEEE Trans On Multimedia, 2023
Pix2Video: Video Editing using Image DiffusionarXivStarWebsiteICCV 2023
Structure and Content-Guided Video Synthesis with Diffusion ModelsarXiv-WebsiteICCV 2023
Shape-aware Text-driven Layered Video EditingarXivStarWebsiteCVPR 2023
DPE: Disentanglement of Pose and Expression for General Video Portrait EditingarXivStarWebsiteCVPR 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationarXivStarWebsiteICCV 2023
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video EncodingarXivStarWebsiteCVPR 2023
Layered Neural Atlases for Consistent Video EditingarXivStarWebsiteSIGGRAPH Asia 2021

Other conditions

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Mind the Time: Temporally-Controlled Multi-Event Video GenerationarXiv-WebsiteCVPR 2025
PhysGen: Rigid-Body Physics-Grounded Image-to-Video GenerationarXivStarWebsiteECCV 2024
MotionCraft: Physics-based Zero-Shot Video GenerationarXivStarWebsiteAAAI 2025
VideoAgent: Long-form Video Understanding with Large Language Model as AgentarXivStarWebsiteECCV 2024
Synthetic Generation of Face Videos with Plethysmograph PhysiologyarXiv--CVPR 2022

Enhancement

Video denoising and deblurring

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Video restoration based on deep learning: a comprehensive survey---2022

Video inpainting

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CoCoCo: Improving Text-Guided Video Inpainting for Better ConsistencyarXivStarWebsiteAAAI 2025
UniPaint: Unified Space-time Video Inpainting via Mixture-of-ExpertsarXiv--arXiv 2024
Semantically Consistent Video Inpainting with Conditional Diffusion ModelsarXiv--arXiv 2024
AVID: Any-Length Video Inpainting with Diffusion ModelarXivStarWebsiteCVPR 2024
Towards language-driven video inpainting via multimodal large language modelsarXivStarWebsiteCVPR 2024
Deep Learning-Based Image and Video Inpainting: A SurveyarXiv--arXiv 2024
Smartbrush: Text and shape guided object inpainting with diffusion modelarXiv--CVPR 2023
Repaint: Inpainting using denoising diffusion probabilistic modelsarXivStar-CVPR 2022
Free-form video inpainting with 3d gated convolution and temporal patchganarXivStar-ICCV 2019
Generative image inpainting with contextual attentionarXivStar-CVPR 2018
Context encoders: Feature learning by inpaintingarXivStarWebsiteCVPR 2016

Video interpolation and extrapolation/prediction

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Video interpolation with diffusion modelsarXiv-WebsiteCVPR 2024
ToonCrafter: Generative Cartoon InterpolationarXivStarWebsiteTOG 2024
Ldmvfi: Video frame interpolation with latent diffusion modelsarXivStarWebsiteAAAI 2024
Tell me what happened: Unifying text-guided video completion via multimodal masked video generationarXivStarWebsiteCVPR 2023
MCVD - Masked Conditional Video Diffusion for PredictionarXivStarWebsiteNeurIPS 2022
Diffusion models for video prediction and infillingarXivStarWebsiteTMLR 2022

Video super-resolution

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
FreeScale: High-Resolution Video Generation with Cascaded Diffusion ModelsarXivStarWebsitearXiv 2024
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptationarXivStarWebsiteECCV 2024
VEnhancer: Generative Space-Time Enhancement for Video GenerationarXivStarWebsitearXiv 2024
Exploiting diffusion prior for real-world image super-resolutionarXivStarWebsiteIJCV 2024
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-ResolutionarXiv--CVPR 2024
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-ResolutionarXivStarWebsiteCVPR 2024
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-ResolutionarXiv--WACV Workshop 2024

Combining multiple video enhancement tasks

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
UniPaint: Unified Space-time Video Inpainting via Mixture-of-ExpertsarXiv--arXiv 2024
VEnhancer: Generative Space-Time Enhancement for Video GenerationarXivStarWebsitearXiv 2024
MCVD - Masked Conditional Video Diffusion for PredictionarXivStarWebsiteNeurIPS 2022

Personalization

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Dynamic Concepts Personalization from Single VideosarXiv-WebsiteSIGGRAPH 2025
VideoAlchemy: Open-set Personalization in Video GenerationarXivStarWebsiteCVPR 2025
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic DegradationarXivStarWebsitearXiv 2024
Identity-PreservingText-to-VideoGenerationbyFrequencyDecompositionarXivStarWebsiteCVPR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion ControlarXiv-WebsiteICLR 2025
Still-Moving: Customized Video Generation without Customized Video DataarXivStarWebsiteACM TOG 2024
MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography EstimationarXivStar-CVPR 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingarXivStarWebsiteACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real TimearXiv-WebsiteNeurIPS 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and QuantizationarXiv--CVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsarXivStarWebsiteECCV 2024
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text GuidancearXivStarWebsiteICASSP 2025
Audio-Driven Co-Speech Gesture Video GenerationarXivStarWebsiteNeurIPS 2022

Consistency

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models arXivStarWebsiteCVPR 2025
How i warped your noise: a temporally-correlated noise prior for diffusion models--WebsiteICLR 2024
Tokenflow: Consistent diffusion features for consistent video editingarXivStarWebsiteICLR 2024
Seine: Short-to-long video diffusion model for generative transition and prediction.arXivStarWebsiteICLR 2024
VideoBooth: Diffusion-based Video Generation with Image PromptsarXivStarWebsiteCVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Vieo Diffusion ModelsarXiv-WebsiteCVPR 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head AvatarsarXivStarWebsiteCVPR 2024
CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video EditingarXivStar-CVPR 2024
VidToMe: Video Token Merging for Zero-Shot Video EditingarXivStarWebsiteCVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsarXivStarWebsiteECCV 2024
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video GenerationarXivStarWebsiteNeurIPS 2024
Streetscapes: Large-scale consistent street view generation using autoregressive video diffusionarXiv-WebsiteSIGGRAPH 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modelingarXivStarWebsiteSIGGRAPH 2024
Consisti2v: Enhancing visual consistency for image-to-video generationarXivStarWebsiteTMLR 2024
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from TextarXivStarWebsitearXiv 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait AnimationarXivStar-arXiv 2024
Flexifilm: Long video generation with flexible conditionsarXivStarWebsitearXiv 2024
Towards Smooth Video CompositionarXivStarWebsiteICLR 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion ModelsarXivStar-CVPR 2023
MoStGAN-V: Video Generation with Temporal Motion StylesarXivStar-CVPR 2023
MOSO: Decomposing MOtion, Scene and Object for Video PredictionarXivStar-CVPR 2023
Stablevideo: Text-driven consistency-aware diffusion video editingarXivStarHuggingFace DemoICCV 2023
Preserve your own correlation: A noise prior for video diffusion modelsarXiv-WebsiteICCV 2023
Scenescape: Text-driven consistent scene generationarXivStarWebsiteNeurIPS 2023
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodERarXivStar-NeurIPS 2023
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivStarWebsiteNeurIPS 2023
DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic ModelsarXivStarWebsitearXiv 2023
Generating Videos with Dynamics-aware Implicit Generative Adversarial NetworksarXivStarWebsiteICLR 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2arXivStarWebsiteCVPR 2022

Long video

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
MovieDreamer: Hierarchical Generation for Coherent Long Visual SequencesarXivStarWebsiteICLR 2025
FreeNoise: Tuning-Free Longer Video Diffusion via Noise ReschedulingarXivStarWebsiteICLR 2024
ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion ModelsarXivStar-arXiv 2024
Video-Infinity: Distributed Long Video GenerationarXivStar-arXiv 2024
Progressive Autoregressive Video Diffusion ModelsarXivStarWebsitearXiv 2024
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video InfillingarXivStar-arXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersarXivStar-ICLR 2023
Towards Smooth Video CompositionarXivStarWebsiteICLR 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video GenerationarXiv-WebsiteACL 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-DenoisingarXivStarWebsitearXiv 2023
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2arXiv-WebsiteCVPR 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive TransformerarXivStarWebsiteECCV 2022
Generating Long Videos of Dynamic ScenesarXivStarWebsiteNeurIPS 2022
Flexible Diffusion Modeling of Long VideosarXiv-WebsiteNeurIPS 2022

3D-aware video diffusion

Training on 3D dataset

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video GenerationarXivStarWebsiteICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video GenerationarXiv-WebsitearXiv 2025
HumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationarXivStarWebsiteNeurIPS D&B 2024
Generating 3D-Consistent Videos from Unposed Internet PhotosarXiv-WebsiteCVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attentionarXiv-WebsitearXiv 2024
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D VideosarXivStarWebsiteCVPR 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion ModelsarXivStarWebsiteECCV 2024
Generative Camera Dolly: Extreme Monocular Dynamic Novel View SynthesisarXivStarWebsiteECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video GenerationarXiv-Website-
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera ControlarXivStarWebsiteNeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion ModelsarXivStarWebsiteNeurIPS 2024
V3D: Video Diffusion Models are Effective 3D GeneratorsarXivStarWebsitearXiv 2024
Sora Generates Videos with Stunning Geometrical ConsistencyarXivStarWebsitearXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D GenerationarXiv-WebsiteICML 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text DataarXivStarWebsiteICLR 2024
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D VisionarXivStarWebsiteCVPR 2024
Stable video diffusion: Scaling latent video diffusion models to large datasetsarXivStarWebsitearXiv 2023
Objaverse-XL: A Universe of 10M+ 3D ObjectsarXivStarWebsiteNeurIPS D&B 2023
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and GenerationarXivStarWebsiteCVPR 2023
MVImgNet: A Large-scale Dataset of Multi-view ImagesarXivStarWebsiteCVPR 2023
Objaverse: A Universe of Annotated 3D ObjectsarXivStarWebsiteCVPR 2023
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalarXivStarWebsiteICCV 2021
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household ItemsarXiv-WebsiteICRA 2022
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category ReconstructionarXivStarWebsiteICCV 2021
Stereo magnification: Learning view synthesis using multiplane imagesarXivStarWebsiteSIGGRAPH 2018

Architecture for 3D diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View ConsistencyarXivStarWebsiteICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video GenerationarXiv-WebsitearXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation ControlarXivStarWebsitearXiv 2025
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned DiffusionarXivStarWebsitearXiv 2025
Wonderland: Navigating 3D Scenes from a Single ImagearXivStarWebsitearXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion ModelarXivStarWebsitearXiv 2024
Generating 3D-Consistent Videos from Unposed Internet PhotosarXiv-WebsiteCVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attentionarXiv-WebsitearXiv 2024
ControlDreamer: Blending Geometry and Style in Text-to-3DarXivStarWebsiteBMVC 2024
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D GenerationarXiv-WebsiteECCV 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion ModelarXivStarWebsiteNeurIPS 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video GenerationarXiv-WebsiteArxix 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion ModelsarXiv-WebsiteNeurIPS 2024
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object ReconstructionarXiv-WebsiteECCV 2024
MVDream: Multi-view Diffusion for 3D GenerationarXivStarWebsiteICLR 2024
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view GenerationarXivStarWebsiteCVPR 2024
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained DiffusionarXivStarWebsiteCVPR 2024
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content GenerationarXivStarWebsiteECCV 2024
SPAD: Spatially Aware Multi-View DiffusersarXivStarWebsiteCVPR 2024
MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware DiffusionarXivStarWebsiteNeuIPS 2023
ConsistNet: Enforcing 3D Consistency for Multi-view Images DiffusionarXiv-WebsiteCVPR 2024

Camera conditioning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Motion Prompting: Controlling Video Generation with Motion TrajectoriesarXiv-WebsiteCVPR 2025
Vd3d: Taming large video diffusion transformers for 3d camera controlarXivStarWebsiteICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion TransformersarXivStarWebsiteCVPR 2025
Cameractrl: Enabling camera control for text-to-video generationarXivStarWebsiteICLR 2025
I2VControl-Camera: Precise Video Camera Control with Adjustable Motion StrengtharXivStarWebsiteICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video GenerationarXiv-WebsitearXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation ControlarXivStarWebsitearXiv 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion ModelsarXiv-WebsitearXiv 2024
Wonderland: Navigating 3D Scenes from a Single ImagearXivStarWebsitearXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion ModelarXivStarWebsitearXiv 2024
HumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationarXivStarWebsiteNeurIPS D&B 2024
Cavia: Camera-controllable multi-view video diffusion with view-integrated attentionarXiv-WebsitearXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion ModelsarXivStarWebsiteMultimedia 2024
MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera ControlarXiv-WebsitearXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video GenerationarXivStarWebsiteSIGGRAPH 2024
Controlling Space and Time with Diffusion ModelsarXiv-WebsiteICLR 2025
Generative Camera Dolly: Extreme Monocular Dynamic Novel View SynthesisarXivStarWebsiteECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video GenerationarXiv-WebsitearXiv 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera ControlarXivStarWebsiteNeurIPS 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion ModelsarXiv-WebsiteNeurIPS 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object MotionarXivStarWebsiteSIGGRAPH 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusionarXiv-WebsiteECCV 2024

Inference-time tricks

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View SynthesizerarXivStarWebsiteICLR 2025
Training-free Camera Control for Video GenerationarXiv-WebsiteICLR 2025
ViVid-1-to-3: Novel View Synthesis with Video Diffusion ModelsarXivStarWebsiteCVPR 2024

Benefits to other domains

Video representation learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Deep video representation learning: a surveyarXiv--Multimedia Tools and Applications 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video RetrievalarXivStar-CVPR 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness RecognitionarXivStar-NeurIPS 2024
M²Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image GenerationarXivStarWebsitearXiv 2024
Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment-Star-CVPR Workshops 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-trainingarXiv--ICCV 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsarXiv--ICCV Workshop 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only TrainingarXivStar-ICLR 2023
Visual Consensus Modeling for Video-Text Retrieval---AAAI 2022
End-to-End Referring Video Object Segmentation with Multimodal TransformersarXivStar-CVPR 2022
Language as Queries for Referring Video Object SegmentationarXivStar-CVPR 2022
Align and Prompt: Video-and-Language Pre-training with Entity PromptsarXivStar-CVPR 2022
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation---CVPR 2022
CenterCLIP: Token Clustering for Efficient Text-Video RetrievalarXivStar-SIGIR 2022
GL-RG: Global-Local Representation Granularity for Video CaptioningarXivStar-IJCAI 2022
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingarXivStar-CVPR 2021
Learning Transferable Visual Models From Natural Language SupervisionarXivStar-ICML 2021
Multi-modal Transformer for Video RetrievalarXivStarWebsiteECCV 2020
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark-Star-ECCV 2020
Asymmetric 3D Convolutional Neural Networks for action recognition---Pattern Recognition 2019
Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM NetworksarXiv--TIP 2018
Deep Sequential Context Networks for Action Prediction---CVPR 2017
Attention Is All You NeedarXiv--NeurIPS 2017
Deep Residual Learning for Image RecognitionarXivStar-CVPR 2016
Spatio-Temporal LSTM with Trust Gates for 3D Human Action RecognitionarXivStar-ECCV 2016
Learning Spatiotemporal Features with 3D Convolutional NetworksarXiv--ICCV 2015
ImageNet Classification with Deep Convolutional Neural Networks---NeurIPS 2012
Long Short-Term Memory---Neural Comput., 1997
Neural Networks and Physical Systems with Emergent Collective Computational Abilities---PNAS, 1982

Video retrieval

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Towards Retrieval Augmented Generation over Large Video LibrariesarXiv--IEEE HSI 2024
Video Enriched Retrieval Augmented Generation Using Aligned Video CaptionsarXivStar-SIGIR Workshop 2024
GenTron: Diffusion Transformers for Image and Video GenerationarXiv-WebsiteCVPR 2024
iRAG: Advancing RAG for Videos with an Incremental ApproacharXiv--CIKM 2024
Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented GenerationarXivStar-SIGIR 2024
Multimodal Federated Learning via Contrastive Representation EnsemblearXivStar-ICLR 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant LearningarXiv--ICCV 2023
Animate-A-Story: Storytelling with Retrieval-Augmented Video GenerationarXiv--arXiv 2023
Visual Consensus Modeling for Video-Text Retrieval---AAAI 2022
FLAVA: A Foundational Language And Vision Alignment ModelarXivStarWebsiteCVPR 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalarXivStarWebsiteCVPR 2022
Exposing the Limits of Video-Text Models through Contrast Sets-Star-NAACL 2022
CenterCLIP: Token Clustering for Efficient Text-Video RetrievalarXivStar-SIGIR 2022
Boosting Video-Text Retrieval with Explicit High-Level SemanticsarXiv--ACM MM 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text RetrievalarXivStar-ACM MM 2022
Cross-Modal Discrete Representation LearningarXiv--ACL 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video RetrievalarXivStar-ECCV 2022
CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioningarXivStar-Neurocomputing 2022
Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision---TMM 2022
Deep Unified Cross-Modality Hashing by Pairwise Data Alignment---IJCAI 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingarXivStar-CVPR 2021
MDMMT: Multidomain Multimodal Transformer for Video RetrievalarXivStar-CVPR Workshop 2021
TEACHTEXT: CrossModal Generalized Distillation for Text-Video RetrievalarXivStarWebsiteICCV 2021
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax LossarXiv--arXiv 2021
CLIP2TV: Align, Match and Distill for Video-Text RetrievalarXiv--arXiv 2021
ActBERT: Learning Global-Local Video-Text RepresentationsarXiv--CVPR 2020
End-to-End Learning of Visual Representations from Uncurated Instructional VideosarXivStarWebsiteCVPR 2020
Multi-modal Transformer for Video RetrievalarXivStarWebsiteECCV 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksarXivStar-ECCV 2020
Searching Privately by Imperceptible Lying: A Novel Private Hashing Method with Differential Privacy---ACM MM 2020
Language Models are Few-Shot LearnersarXiv--NeurIPS 2020
Language Models are Few-Shot LearnersarXiv--NeurIPS 2020
Large-Scale Adversarial Training for Vision-and-Language Representation LearningarXiv--NeurIPS 2020
StyleGuide: Zero-Shot Sketch-Based Image Retrieval Using Style-Guided Image Generation---TMM 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXiv--NAACL 2019
Dual Encoding for Zero-Example Video RetrievalarXivStar-CVPR 2019
Language Models are Unsupervised Multitask Learners-Star-OpenAI 2019
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question AnsweringarXiv--CVPR 2017

Video QA and captioning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language UnderstandingarXivStarWebsiteNeurIPS 2024
Video Question Answering: Datasets, Algorithms and ChallengesarXivStar-EMNLP 2022
Video Question Answering via Gradually Refined Attention over Appearance and Motion-Star-ACM MM 2017
Video Question Answering via Hierarchical Dual-Level Attention Network Learning---ACM MM 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question AnsweringarXivStar-CVPR 2017
Leveraging Video Descriptions to Learn Video Question AnsweringarXiv--AAAI 2017

3D and 4D generation

Video diffusion for 3D generation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Wonderland: Navigating 3D Scenes from a Single ImagearXivStarWebsitearXiv 2024
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion ModelarXivStarWebsitearXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion ModelsarXivStarWebsiteMultimedia 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion ModelsarXivStarWebsiteECCV 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusionarXiv-WebsiteECCV 2024
V3D: Video Diffusion Models are Effective 3D GeneratorsarXivStarWebsitearXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D GenerationarXiv-WebsiteICML 2024

Video diffusion for 4D generation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View ConsistencyarXivStarWebsiteICLR 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion ModelsarXiv-WebsitearXiv 2024
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion ModelsarXiv-WebsiteNeurIPS 2024
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian SplattingarXiv-WebsitearXiv 2024
4Diffusion: Multi-view Video Diffusion Model for 4D GenerationarXivStarWebsiteNeurIPS 2024
TC4D: Trajectory-Conditioned Text-to-4D GenerationarXivStarWebsiteECCV 2024
Animate3D: Animating Any 3D Model with Multi-view Video DiffusionarXivStarWebsiteNeurIPS 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion ModelarXivStarWebsiteNeurIPS 2024
DreamGaussian4D: Generative 4D Gaussian SplattingarXivStarWebsitearXiv 2024
EG4D: Explicit Generation of 4D Object without Score DistillationarXivStarWebsiteICLR 2025
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian SurfelsarXivStarWebsiteNeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion ModelsarXivStarWebsiteNeurIPS 2024
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation SamplingarXivStarWebsiteCVPR 2024
Dream-in-4D: A Unified Approach for Text-and Image-Guided 4D Scene GenerationarXivStarWebsiteCVPR 2024
STAG4D: Spatial-Temporal Anchored Generative 4D GaussiansarXivStarWebsiteECCV 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative ModelarXivStarWebsitearXiv 2024
Animate124: Animating One Image to 4D Dynamic ScenearXivStarWebsitearXiv 2024
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion ModelsarXiv-WebsiteCVPR 2024
Text-to-4d dynamic scene generationarXiv-WebsiteICML 2023

Ethical considerations

Deepfake and misinformation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated ContentarXiv--CVPR 2025
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak AttacksarXiv--Arxiv 2025
AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video AvatarsarXiv--NeurIPS 2022
Multi-attentional Deepfake DetectionarXivStar-CVPR 2021
On the Detection of Digital Face ManipulationarXiv--CVPR 2020
FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human FacesarXiv--CVPR 2018

Content and privacy

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright ProtectionarXiv--ACM MM 2025
Beyond Public Access in LLM Pre-Training DataarXivStar-Arxiv 2025
Investigating Memorization in Video Diffusion ModelarXiv--Arxiv 2024

Bias and representation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Gender Bias in Text-to-Video Generation Models: A case study of SoraarXiv--TIS 2025
Bias and Fairness in Large Language Models: A SurveyarXiv--CL 2024
Investigating Memorization in Video Diffusion ModelarXiv--Arxiv 2024

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Regulation and NLP (RegNLP): Taming Large Language ModelsarXiv--ACM MM 2025
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright ProtectionarXiv--ACM MM 2025
Investigating Memorization in Video Diffusion ModelarXiv--Arxiv 2024

Transparency and disclosure

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
VideoShield: Regulating Diffusion-based Video Generation Models via WatermarkingarXivStar-ICLR 2025
PostMark: A Robust Blackbox Watermark for Large Language ModelsarXivStar-EMNLP 2024
Advancing Beyond Identification: Multi-bit Watermark for Large Language ModelsarXivStar-NAACL 2024
Investigating Memorization in Video Diffusion ModelarXiv--Arxiv 2024

Quality control and safety

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video GenerationarXivStar-ICLR 2025
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak AttacksarXiv--Arxiv 2025
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video ModelsarXiv--Arxiv 2025
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language ModelsarXiv--Arxiv 2023

Computational resources and environmental impact

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMsarXiv--Findings of EMNLP 2024

Benchmark datasets

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
BadVideo: Stealthy Backdoor Attack against Text-to-Video GenerationarXivStarWebsiteICCV 2025
Towards Understanding Unsafe Video GenerationarXivStar-NDSS 2025
SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent ExplanationsarXivStarWebsiteICLR 2025
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness RecognitionarXivStar-NeurIPS 2024
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative ModelsarXivStar-NeurIPS 2024

Citation

If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@misc{wang2025surveyvideodiffusionmodels,
      title={Survey of Video Diffusion Models: Foundations, Implementations, and Applications}, 
      author={Yimu Wang and Xuye Liu and Wei Pang and Li Ma and Shuai Yuan and Paul Debevec and Ning Yu},
      year={2025},
      eprint={2504.16081},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16081}, 
}

Acknowledgement

The format of this repo is built based on Awesome-Video-Diffusion-Models.