Skip to content

Latest commit

 

History

History
849 lines (405 loc) · 55.2 KB

video-generation.md

File metadata and controls

849 lines (405 loc) · 55.2 KB

Video Generation Survey

A reading list of video generation

Repo for open-sora

[2024.03] HPC-AI Open-Sora

[2024.03] PKU Open-Sora Plan

Related surveys

Awesome-Video-Diffusion-Models

Awesome-Text-to-Image

👉 Models to play with

Open source

Non-open source

Translation

  • Goenhance.ai[Page]

👉 Databases

  • HowTo100M

    [ICCV 2019] Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [PDF, Project ]

  • HD-VILA-100M

    [CVPR 2022]Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions [PDF, Page]

  • Web10M

    [ICCV 2021]Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [PDF, Project ]

  • UCF-101

    [arxiv 2012] Ucf101: A dataset of 101 human actions classes from videos in the wild [PDF, Project ]

  • Sky Time-lapse

    [CVPR 2018] Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks [PDF, Project ]

  • TaiChi

    [NIPS 2019] First order motion model for image animation [ PDF, Project ]

  • Celebv-text

    [arxiv ]CelebV-Text: A Large-Scale Facial Text-Video Dataset [PDF, Page]

  • Youku-mPLUG

    [arxiv 2023.06]Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks [PDF]

  • InternVid

    [arxiv 2023.07]InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [PDF]

  • DNA-Rendering

    [arxiv 2023.07] DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering [PDF]

  • Vimeo25M (not open-source)

    [arxiv 2023.09] LAVIE: HIGH-QUALITY VIDEO GENERATION WITH CASCADED LATENT DIFFUSION MODELS [PDF]

  • HD-VG-130M

    [arxiv 2023.06]VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [PDF, Page]

  • Panda-70M

[arxiv 2024.03]A Large-Scale Dataset with 70M High-Quality Video-Caption Pairs [PDF, Page]

👉 GAN/VAE-based methods

[NIPS 2016] ---VGAN--- Generating Videos with Scene Dynamics [PDF, code ]

[ICCV 2017] ---TGAN--- Temporal Generative Adversarial Nets with Singular Value Clipping [PDF, code ]

[CVPR 2018] ---MoCoGAN--- MoCoGAN: Decomposing Motion and Content for Video Generation [PDF, code ]

[NIPS 2018] ---SVG--- Stochastic Video Generation with a Learned Prior [PDF, code ]

[ECCV 2018] Probabilistic Video Generation using Holistic Attribute Control [PDF, code]

[CVPR 2019; CVL ETH] ---SWGAN--- Sliced Wasserstein Generative Models [PDF, code ]

[NIPS 2019; NVLabs] ---vid2vid--- Few-shot Video-to-Video Synthesis [PDF, code ]

[arxiv 2020; Deepmind] ---DVD-GAN--- ADVERSARIAL VIDEO GENERATION ON COMPLEX DATASETS [PDF, code ]

[IJCV 2020] ---TGANv2--- Train Sparsely, Generate Densely: Memory-efficient Unsupervised Training of High-resolution Temporal GAN [PDF, code ]

[PMLR 2021] ---TGANv2-ODE--- Latent Neural Differential Equations for Video Generation [PDF, code ]

[ICLR 2021 ] ---DVG--- Diverse Video Generation using a Gaussian Process Trigger [PDF, code ]

[Arxiv 2021; MRSA] ---GODIVA--- GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions [PDF, code ]

*[CVPR 2022 ] ---StyleGAN-V-- StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 [PDF, code ]

*[NeurIPs 2022] ---MCVD--- MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation [PDF, code]

👉 Implicit Neural Representations

[ICLR 2022] Generating videos with dynamics-aware implicit generative adversarial networks [PDF, code ]

👉 Transformer-based

[arxiv 2021] ---VideoGPT-- VideoGPT: Video Generation using VQ-VAE and Transformers [PDF, code ]

[ECCV 2022; Microsoft] ---NÜWA-- NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [PDF, code ]

[NIPS 2022; Microsoft] ---NÜWA-Infinity-- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis [PDF, code ]

[Arxiv 2020; Tsinghua] ---CogVideo-- CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers [PDF, code ]

*[ECCV 2022] ---TATS-- Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [PDF, code]

*[arxiv 2022; Google] ---PHENAKI-- PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS [PDF, code ]

[arxiv 2022.12]MAGVIT: Masked Generative Video Transformer[PDF]

[arxiv 2023.11]Optimal Noise pursuit for Augmenting Text-to-Video Generation [PDF]

[arxiv 2024.01]WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens [PDF,Page]

👉 Diffusion-based methods

*[NIPS 2022; Google] ---VDM-- Video Diffusion Models [PDF, code ]

*[arxiv 2022; Meta] ---MAKE-A-VIDEO-- MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA [PDF, code ]

*[arxiv 2022; Google] ---IMAGEN VIDEO-- IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS [PDF, code ]

*[arxiv 2022; ByteDace] MAGIC VIDEO:Efficient Video Generation With Latent Diffusion Models [PDF, code]

*[arxiv 2022; Tencent] LVDM Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths [PDF, code]

[AAAI 2022; JHU ] VIDM: Video Implicit Diffusion Model [PDF]

[arxiv 2023.01; Meta] Text-To-4D Dynamic Scene Generation [PDF, Page]

[arxiv 2023.03]Video Probabilistic Diffusion Models in Projected Latent Space [PDF, Page]

[arxiv 2023.03]Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE [PDF]

[arxiv 2023.03]Decomposed Diffusion Models for High-Quality Video Generation [PDF]

[arxiv 2023.03]NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [PDF]

*[arxiv 2023.04]Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation [PDF]

*[arxiv 2023.04]Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [PDF, Page]

[arxiv 2023.04]LaMD: Latent Motion Diffusion for Video Generation [PDF]

*[arxiv 2023.05]Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models[PDF, Page]

[arxiv 2023.05]VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [PDF]

[arxiv 2023.08]ModelScope Text-to-Video Technical Report [PDF]

[arxiv 2023.08]Dual-Stream Diffusion Net for Text-to-Video Generation [PDF]

[arxiv 2023.08]SimDA: Simple Diffusion Adapter for Efficient Video Generation [PDF, Page]

[arxiv 2023.08]Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models [PDF, Page]

[arxiv 2023.09]Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation[PDF,Page]

[arxiv 2023.09]LAVIE: HIGH-QUALITY VIDEO GENERATION WITH CASCADED LATENT DIFFUSION MODELS [PDF, Page]

[arxiv 2023.09]VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [PDF, Page]

[arxiv 2023.10]Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation [PDF, Page]

[arxiv 2023.10]LLM-grounded Video Diffusion Models [PDF,Page]

[arxiv 2023.10]VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [PDF,Page]

[arxiv 2023.11]Make Pixels Dance: High-Dynamic Video Generation [PDF, Page]

[arxiv 2023.11]Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets[PDF, Page]

[arxiv 2023.11]Kandinsky Video [PDF,Page]

[arxiv 2023.12]GenDeF: Learning Generative Deformation Field for Video Generation [PDF,Page]

[arxiv 2023.12]GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [PDF,Page]

[arxiv 2023.12]Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [PDF, Page]

[arxiv 2023.12]AnimateZero:Video Diffusion Models are Zero-Shot Image Animators [PDF,Page]

[arxiv 2023.12]Photorealistic Video Generation with Diffusion Models [PDF,Page]

[arxiv 2023.12]A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [[PDF](A Recipe for Scaling up Text-to-Video Generation with Text-free Videos),Page]

[arxiv 2023.12]MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation [PDF, Page]

[arxiv 2024.1]Latte: Latent Diffusion Transformer for Video Generation [PDF,Page]

[arxiv 2024.1]VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models [PDF,Page]

[arxiv 2024.1]Lumiere: A Space-Time Diffusion Model for Video Generation [PDF, Page]

[arxiv 2024.02]Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [PDF]

[arxiv 2024.02]Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis[PDF,Page]

[arxiv 2024.03]Mora: Enabling Generalist Video Generation via A Multi-Agent Framework[PDF]

[arxiv 2024.03]Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [PDF,Page]

[arxiv 2024.04]Grid Diffusion Models for Text-to-Video Generation [PDF]

[arxiv 2024.04]MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators [PDF]

[arxiv 2024.05]Matten: Video Generation with Mamba-Attention [PDF]

[arxiv 2024.05]Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models [PDF,Page]

LLMs-based

[arxiv 2023.12]VideoPoet: A Large Language Model for Zero-Shot Video Generation [PDF,Page]

State Space-based

[arxiv 2024.03]SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces [PDF,Page]

improve Video Diffusion models

[arxiv 2023.10]ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models [PDF, Page]

[arxiv 2023.10]FreeU: Free Lunch in Diffusion U-Net [PDF, Page]

[arxiv 2023.12]FreeInit: Bridging Initialization Gap in Video Diffusion Models [PDF,Page]

multi-prompt

[arxiv 2023.12]MTVG : Multi-text Video Generation with Text-to-Video Models [PDF]

[arxiv 2024.05]TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation [PDF,Page]

long video generation

[arxiv 2023.]Gen-L-Video: Long Video Generation via Temporal Co-Denoising [PDF, Page]

[arxiv 2023.10]FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling [PDF,Page]

[arxiv 2023.12]VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models[PDF,Page]

[arxiv 2023.12]AVID: Any-Length Video Inpainting with Diffusion Model [PDF,Page]

[arxiv 2023.12]RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing [PDF]

[arxiv 2024.03]VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis [PDF,Page]

[arxiv 2024.03]StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [PDF]

[arxiv 2024.04]FlexiFilm: Long Video Generation with Flexible Conditions [PDF]

Higher Resolution

[arxiv 2023.10] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models [PDF, Page]

infinity scene /360

[arxiv 2023.12]Going from Anywhere to Everywhere[PDF,Page]

[arxiv 2024.1]360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model [PDF]

Video Story

[arxiv 2023.05]TaleCrafter: Interactive Story Visualization with Multiple Characters [PDF, Page]

[arxiv 2023.07]Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [PDF, Page]

[arxiv 2024.01]VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM [PDF, Page]

[arxiv 2024.01]Vlogger: Make Your Dream A Vlog [PDF,Page]

[arxiv 2024.03]AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production [PDF,Page]

[arxiv 2024.04]StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [PDF,Page]

👉 Controllable Video Generation

*[arxiv 2023.04]Motion-Conditioned Diffusion Model for Controllable Video Synthesis [PDF, Page]

[arxiv 2023.06]Video Diffusion Models with Local-Global Context Guidance [PDF]

[arxiv 2023.06]VideoComposer: Compositional Video Synthesis with Motion Controllability [PDF]

[arxiv 2023.07]Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [PDF, Page]

[arxiv 2023.10]MotionDirector: Motion Customization of Text-to-Video Diffusion Models [PDF,Page]

[arxiv 2023.11]MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer [PDF, Page]

[arxiv 2023.11]Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer[PDF,Page]

[arxiv 2023.11]SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models[PDF, Page]

[arxiv 2023.11]MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model [PDF, Page]

[arxiv 2023.12]DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models [PDF, Page]

[arxiv 2023.12]Fine-grained Controllable Video Generation via Object Appearance and Context [PDF,Page]

[arxiv 2023.12]Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [PDF,Page]

[arxiv 2023.12]Peekaboo: Interactive Video Generation via Masked-Diffusion [PDF,Page]

[arxiv 2023.12]InstructVideo: Instructing Video Diffusion Models with Human Feedback [PDF,Page]

[arxiv 2024.01]Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation[PDF]

[arxiv 2024.01]Synthesizing Moving People with 3D Control [PDF,PDF]

[arxiv 2024.02]Boximator: Generating Rich and Controllable Motions for Video Synthesis [PDF,Page]

[arxiv 2024.02]InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions [PDF,Page]

[arxiv 2024.02]EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions[PDF,Page]

[arxiv 2024.03]Animate Your Motion: Turning Still Images into Dynamic Videos [PDF,Page]

[arxiv 2024.03]Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance [PDF,Page]

[arxiv 2024.03]Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [PDF, Page]

[arxiv 2024.04]Motion Inversion for Video Customization [PDF,Page]

[arxiv 2023.12]Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [PDF,Page]

Camera

[arxiv 2023.12]MotionCtrl: A Unified and Flexible Motion Controller for Video Generation [PDF,Page]

[arxiv 2024.02]Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [PDF,Page]

[arxiv 2024.04]CameraCtrl: Enabling Camera Control for Text-to-Video Generation [PDF,Page]

[arxiv 2024.04]Customizing Text-to-Image Diffusion with Camera Viewpoint Control [PDF,Page]

[arxiv 2024.04]MotionMaster: Training-free Camera Motion Transfer For Video Generation[PDF]

Video in/outpainting

[MM 2023.09]Hierarchical Masked 3D Diffusion Model for Video Outpainting [PDF]

[arxiv 2023.11]Flow-Guided Diffusion for Video Inpainting [PDF]

[arxiv 2024.01]ActAnywhere: Subject-Aware Video Background Generation [PDF, Page]

[arxiv 2024.03]CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility [PDF,Page]

[arxiv 2024.03]Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation [PDF,Page]

[arxiv 2024.04]AudioScenic: Audio-Driven Video Scene Editing [PDF]

[arxiv 2024.05]Semantically Consistent Video Inpainting with Conditional Diffusion Models [[PDF(https://arxiv.org/abs/2405.00251)]

Video Quality

[arxiv 2024.03]VideoElevator : Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models[PDF,Page]

Video SR

[arxiv 2023.11]Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models [PDF]

[arxiv 2023.12]Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [PDF,Page]

[arxiv 2023.12]Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements [PDF,Page]

[arxiv 2024.03]Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution [PDF]

[arxiv 2024.04]VideoGigaGAN: Towards Detail-rich Video Super-Resolution [PDF, Page]

downstream apps

[arxiv 2023.11]Breathing Life Into Sketches Using Text-to-Video Priors [PDF,Page]

[arxiv 2023.11]Flow-Guided Diffusion for Video Inpainting [PDF]

[arxiv 2024.02]Animated Stickers: Bringing Stickers to Life with Video Diffusion [PDF]

[arxiv 2024.03]DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation [PDF,Page]

[arxiv 2024.03]Intention-driven Ego-to-Exo Video Generation [PDF]

[arxiv 2024.04]PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation [PDF,Page]

[arxiv 2024.04]Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos [PDF,Page]

Video Concept

[arxiv 2023.07]Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [PDF, Page]

[arxiv 2023.11]VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning[PDF,Page]

[arxiv 2023.12]VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model [PDF,Page]

[arxiv 2023.12]VideoBooth: Diffusion-based Video Generation with Image Prompts [PDF,Page]

[arxiv 2023.12]DreamVideo: Composing Your Dream Videos with Customized Subject and Motion [PDF,Page]

[arxiv 2023.12]PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models [PDF]

[arxiv 2024.01]CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [PDF]

[arxiv 2024.02]Magic-Me: Identity-Specific Video Customized Diffusion [PDf,Page]

[arxiv 2024.03]EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing [PDF,Page]

[arxiv 2024.04]AniClipart: Clipart Animation with Text-to-Video Priors [PDF,Page]

[arxiv 2024.04]ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [PDF,Page]

Image-to-video Generation

[arxiv 2023.09]VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation [PDF]

[arxiv 2023.09]Generative Image Dynamics [PDF,Page]

[arxiv 2023.10]DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [PDF, Page]

[arxiv 2023.11]SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [PDF,Page]

[arxiv 2023.11]I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models [PDF,Page]

[arxiv 2023.11]Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [PDF,Page]

[arxiv 2023.11]MoVideo: Motion-Aware Video Generation with Diffusion Models[PDF,Page]

[arxiv 2023.11]Make Pixels Dance: High-Dynamic Video Generation[PDF,Page]

[arxiv 2023.11]Decouple Content and Motion for Conditional Image-to-Video Generation [PDF]

[arxiv 2023.12]ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models [PDF, Page]

[arxiv 2023.12]MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation [PDF, Page]

[arxiv 2023.12]DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [PDF,Page]

[arxiv 2023.12]LivePhoto: Real Image Animation with Text-guided Motion Control [PDF, Page]

[arxiv 2023.12]I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models [PDF]

[arxiv 2023.11] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [PDF,Page]

[arxiv 2024.01]UniVG: Towards UNIfied-modal Video Generation [PDF,Page]

[arxiv 2024.03]Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation [PDF,Page]

[arxiv 2024.03]AtomoVideo: High Fidelity Image-to-Video Generation [PDF,Page]

[arxiv 2024.03]Pix2Gif: Motion-Guided Diffusion for GIF Generation[PDF,Page]

[arxiv 2024.03]Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts [PDF,Page]

[arxiv 2024.03]TimeRewind: Rewinding Time with Image-and-Events Video Diffusion [PDF,Page]

[arxiv 2024.03]TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models [PDF,Page]

[arxiv 2024.04]LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [PDF]

[arxiv 2024.04]TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models [PDF,Page]

4D generation

[arxiv 2023.11]Animate124: Animating One Image to 4D Dynamic Scene [PDF,Page]

[arxiv 2023.12]4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling[PDF, Page]

[arxiv 2023.12]4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [PDF,Page]

[arxiv 2023.12]DreamGaussian4D: Generative 4D Gaussian Splatting [PDF, Page]

Audio-to-video Generation

[arxiv 2023.09]Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [PDF]

[arxiv 2024.02]Seeing and Hearing Open-domain Visual-Audio Generation with Diffusion Latent Aligners [PDF,Page]

[arxiv 2024.04]TAVGBench: Benchmarking Text to Audible-Video Generation [PDF,Page]

Video editing with video models

[arxiv 2023.12]VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models[PDF,Page]

[arxiv 2023.12]Neutral Editing Framework for Diffusion-based Video Editing [PDF,Page]

[arxiv 2024.01]FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis[PDF,Page]

[arxiv 2024.02]UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [PDF,Page]

[arxiv 2024.02]Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [PDF,Page]

[arxiv 2024.03]FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing[PDF]

[arxiv 2024.03]DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing [PDF,Page]

[arxiv 2024.03]EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [PDF]

[arxiv 2024.03]Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion [PDF,Page]

[arxiv 2024.03]AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks [PDF,Page]

[arxiv 2024.04]Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models [PDF]

Image Model for video generation and editing

*[arxiv 2022.12]Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [PDF, Page]

[arxiv 2023.03]Video-P2P: Video Editing with Cross-attention Control [PDF, Page]

[arxiv 2023.03]Edit-A-Video: Single Video Editing with Object-Aware Consistency [PDF, Page]

[arxiv 2023.03]FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [PDF, Page]

[arxiv 2023.03]Pix2Video: Video Editing using Image Diffusion [PDF]

->[arxiv 2023.03]Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [PDF, code]

[arxiv 2023.03]Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models[PDF,code]

[arxiv 2023.04]Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos[PDF]

[arxiv 2023.05]ControlVideo: Training-free Controllable Text-to-Video Generation [PDF, Page]

[arxiv 2023.05]Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models[PDF, Page]

[arxiv-2023.05]Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation [PDF, Page]

[arxiv 2023.05]Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models [PDF]

[arxiv 2023.05]SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing [PDF]

[arxiv 2023.05]InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [PDF]

[arxiv 2023.05] ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing [PDF, Page]

[arxiv 2023.05]Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising [PDF,Page]

[arxiv 2023.06]Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance [PDF, Page]

[arxiv 2023.06]VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [PDF,Page]

*[arxiv 2023.06]Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [PDF, Page]

*[arxiv 2023.07]AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [PDF, Page]

*[arxiv 2023.07]TokenFlow: Consistent Diffusion Features for Consistent Video Editing [PDF,Page]

[arxiv 2023.07]VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet [PDF, Page]

[arxiv 2023.08]CoDeF: Content Deformation Fields for Temporally Consistent Video Processing [PDF, Page]

[arxiv 2023.08]DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory [PDF, Page]

[arxiv 2023.08]StableVideo: Text-driven Consistency-aware Diffusion Video Editing [PDF, Page]

[arxiv 2023.08]Edit Temporal-Consistent Videos with Image Diffusion Model [PDF]

[arxiv 2023.08]EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints [PDF]

[arxiv 2023.08]MagicEdit: High-Fidelity and Temporally Coherent Video Editing [PDF, Page]

[arxiv 2023.09]MagicProp: Diffusion-based Video Editing via Motionaware Appearance Propagation[PDF]

[arxiv 2023.09]Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator[PDF, Page]

[arxiv 2023.09]CCEdit: Creative and Controllable Video Editing via Diffusion Models [PDF]

[arxiv 2023.10]Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [PDF,Page]

[arxiv 2023.10]FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing [PDF,Page]

[arxiv 2023.10]ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [PDF,Page]

[arxiv 2023.10, nerf] DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing [PDF, Page]

[arxiv 2023.10]LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [PDF,Page]

[arxiv 2023.11]LATENTWARP: CONSISTENT DIFFUSION LATENTS FOR ZERO-SHOT VIDEO-TO-VIDEO TRANSLATION [PDF]

[arxiv 2023.11]Cut-and-Paste: Subject-Driven Video Editing with Attention Control[PDF]

[arxiv 2023.11]MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation [PDF]

[arxiv 2023.12]Motion-Conditioned Image Animation for Video Editing [PDF, Page]

[arxiv 2023.12]RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models [PDF,Page]

[arxiv 2023.12]DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing [PDF]

[arxiv 2023.12]MagicStick: Controllable Video Editing via Control Handle Transformations [PDF,Page]

[arxiv 2023.12]SAVE: Protagonist Diversification with Structure Agnostic Video Editing [PDF,Page]

[arxiv 2023.12]VidToMe: Video Token Merging for Zero-Shot Video Editing [PDF,Page]

[arxiv 2023.12]Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis [PDF,Page]

[arxiv 2024.1]Object-Centric Diffusion for Efficient Video Editing [PDF]

[arxiv 2024.1]VASE: Object-Centric Shape and Appearance Manipulation of Real Videos [PDF,Page]

[arxiv 2024.03]FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [PDF,Page]

[arxiv 2024.04]GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [PDF]

[arxiv 2024.05]Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing [PDF,Page]

👉 Video Completion (animation, interpolation, prediction)

[arxiv 2022; Meta] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation [PDF, code]

[arxiv 2023.03]LDMVFI: Video Frame Interpolation with Latent Diffusion Models[PDF]

*[arxiv 2023.03]Seer: Language Instructed Video Prediction with Latent Diffusion Models [PDF]

[arxiv 2023.10]DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [PDF, Page]

[arxiv 2024.03]Explorative Inbetweening of Time and Space [PDF,Page]

[arxiv 2024.04]Video Interpolation With Diffusion Models [PDF,Page]

[arxiv 2024.04]Sparse Global Matching for Video Frame Interpolation with Large Motion [PDF,Page]

[arxiv 2024.04]LADDER: An Efficient Framework for Video Frame Interpolation [PDF]

[arxiv 2024.04]Motion-aware Latent Diffusion Models for Video Frame Interpolation [PDF]

[arxiv 2024.04]Event-based Video Frame Interpolation with Edge Guided Motion Refinement [PDF]

[arxiv 2024.04]StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [PDF,Page]

motion transfer

[arxiv 2023.05]LEO: Generative Latent Image Animator for Human Video Synthesis [PDF,Page]

*[arxiv 2023.03]Conditional Image-to-Video Generation with Latent Flow Diffusion Models [PDF]

[arxiv 2023.07]DisCo: Disentangled Control for Referring Human Dance Generation in Real World [PDF, Page]

[arxiv 2023.12]MotionEditor: Editing Video Motion via Content-Aware Diffusion [PDF,Page]

[arxiv 2023.12]Customizing Motion in Text-to-Video Diffusion Models [PDF,Page]

[arxiv 2023.12]MotionCrafter: One-Shot Motion Customization of Diffusion Models [PDF]

[arxiv 2024.01]Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation[PDF]

[arxiv 2024.03]Spectral Motion Alignment for Video Motion Transfer using Diffusion Models[PDF,Page]

style transfer

[arxiv 2023.06]Probabilistic Adaptation of Text-to-Video Models [PDF]

[arxiv 2023.11]Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion[PDF]

[arxiv 2023.12]StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter[PDF,Page]

[arxiv 2023.12]DragVideo: Interactive Drag-style Video Editing [PDF]

[arxiv 2024.03]FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [PDF,Page]

Evaluation

[arxiv 2023.10]EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [PDF,Page]

[arxiv 2023.11]FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation [PDF]

[arxiv 2023.11]Online Video Quality Enhancement with Spatial-Temporal Look-up Tables [PDF] ]

[arxiv 2024.03]STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [PDF]

[arxiv 2024.03]Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment [PDF]

Survey

[arxiv 2023.03]A Survey on Video Diffusion Models [PDF]

[arxiv 2024.05]Video Diffusion Models: A Survey [PDF]

Evaluation

[ICCV 2023]Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives [PDF,Page]

[arxiv 2023.10]EvalCrafter: Benchmarking and Evaluating Large Video Generation Models[PDF, Page]

[arxiv 2023.11]HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment [PDF]

[arxiv 2023.12]VBench: Comprehensive Benchmark Suite for Video Generative Models [PDF, Page]

[arxiv 2024.02]Perceptual Video Quality Assessment: A Survey [PDF]

[arxiv 2024.02]KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos [PDf]

[arxiv 2024.03]Modular Blind Video Quality Assessment [PDF]

Speed

[arxiv 2023.12]F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis [PDF]

[arxiv 2023.12]VideoLCM: Video Latent Consistency Model [PDF]

[arxiv 2024.01]FlashVideo: A Framework for Swift Inference in Text-to-Video Generation [PDF]

[arxiv 2024.01]AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning [PDF,Page]

[arxiv 2024.03]AnimateDiff-Lightning: Cross-Model Diffusion Distillation [PDF]

Others

[arxiv 2023.05]AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion [PDF]

[arxiv 2023.05]Multi-object Video Generation from Single Frame Layouts [PDF]

[arxiv 2023.06]Learn the Force We Can: Multi-Object Video Generation from Pixel-Level Interactions [PDF]

[arxiv 2023.08]DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis [PDF]

CV Related

[arxiv 2022.12; ByteDace]PV3D: A 3D GENERATIVE MODEL FOR PORTRAIT VIDEO GENERATION [PDF]

[arxiv 2022.12]MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation[PDF]

[arxiv 2022.12]Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [PDF, Page]

[arxiv 2023.01]Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [PDF, Page]

[arxiv 2023.01]DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis [PDF, Page]

[arxiv 2023.02 Google]Scaling Vision Transformers to 22 Billion Parameters [PDF]

[arxiv 2023.05]VDT: An Empirical Study on Video Diffusion with Transformers [PDF, code]

[arxiv 2024] MAGVIT-V2 : Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [PDF]

NLP related

[arxiv 2022.10]DIFFUSEQ: SEQUENCE TO SEQUENCE TEXT GENERATION WITH DIFFUSION MODELS [PDF]

[arxiv 2023.02]The Flan Collection: Designing Data and Methods for Effective Instruction Tuning [PDF]

Speech

[arxiv 2023.01]Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers[PDF, Page]