ComfyUI-MOVA

February 24, 2026 · View on GitHub

ComfyUI custom nodes for MOVA — a foundation model that generates synchronized video and audio in a single inference pass.

Features

Joint video + audio generation — MOVA produces video frames and matching audio simultaneously, no separate TTS/sound model needed
Image-to-Video (I2V) — provide a reference image as the first frame for high-quality, controlled generation
Text-to-Video-Audio (T2VA) — generate from a text prompt alone (best results come from generating a first frame with any T2I node first)
Audio injection — feed pre-generated audio (speech, music) to drive lip sync and audio-conditioned video
Auto duration — automatically match video length to input audio
Two nodes — MOVA Model Loader and MOVA Generate, designed to connect directly to VHS for video output

Installation

1. Clone into ComfyUI custom nodes

cd ComfyUI/custom_nodes
git clone --recursive https://github.com/richservo/comfyui-mova.git

The --recursive flag pulls the MOVA source code as a git submodule into mova_src/. This is the recommended setup.

2. Install dependencies

cd comfyui-mova
pip install -r requirements.txt

3. Models

Models download automatically from HuggingFace on first use into ComfyUI/models/mova/. Available variants:

Variant	Resolution	HuggingFace Repo
MOVA-360p	352x640	OpenMOSS-Team/MOVA-360p
MOVA-720p	1280x720	OpenMOSS-Team/MOVA-720p

To use a local model path instead, enter the path in the local_model_path field on the loader node.

Nodes

MOVA Model Loader

Loads the MOVA pipeline and applies memory offloading.

Parameter	Type	Default	Description
`model_variant`	Dropdown	MOVA-360p	Which model to load
`local_model_path`	STRING	""	Override with a local path (leave empty for auto-download)
`offload_mode`	Dropdown	group	`group` (recommended for 16GB), `cpu`, or `none`
`dtype`	Dropdown	bf16	`bf16` or `fp16`

Output: MOVA_PIPE — connect to MOVA Generate.

MOVA Generate

Runs inference and outputs video frames + audio.

Parameter	Type	Default	Description
`mova_pipe`	MOVA_PIPE	—	Pipeline from the loader
`prompt`	STRING	""	Text prompt
`negative_prompt`	STRING	(Chinese default)	Negative prompt
`seed`	INT	42	Random seed
`width`	INT	640	Video width (step 16)
`height`	INT	352	Video height (step 16)
`num_frames`	INT	193	Number of frames. Ignored when `auto_duration` is on.
`auto_duration`	BOOLEAN	False	Calculate frame count from input audio length
`fps`	FLOAT	24.0	Video frame rate
`num_inference_steps`	INT	25	Denoising steps (MOVA default is 50; 25 is a good speed/quality trade-off)
`cfg_scale`	FLOAT	5.0	Classifier-free guidance scale
`sigma_shift`	FLOAT	5.0	Scheduler sigma shift
`attention_mode`	Dropdown	auto	`auto`, `sage` (SageAttention, ~50% less VRAM), or `sdpa` (PyTorch SDPA)
`vae_tiling`	BOOLEAN	False	Tile VAE decode to save VRAM. Essential for 720p on 16GB.
`tile_size`	INT	256	VAE tile size in pixels (only used when tiling is enabled)
`remove_video_dit`	BOOLEAN	True	Free stage-1 DiT after expert switch to save ~28GB RAM

Optional inputs:

Parameter	Type	Default	Description
`image`	IMAGE	—	Reference image (first frame). Without this, MOVA runs in T2VA mode.
`audio`	AUDIO	—	Pre-generated audio to drive lip sync
`audio_strength`	FLOAT	0.8	How strongly input audio conditions generation (1.0 = full, 0.0 = ignore)

Outputs: IMAGE (video frames) and AUDIO (waveform).

Usage

Basic I2V workflow

[Load Image] → image
                        → [MOVA Generate] → images → [VHS_VideoCombine]
[MOVA Model Loader] → mova_pipe ↗            audio ↗

Load a reference image with any ComfyUI image node
Connect it to the image input on MOVA Generate
Connect both images and audio outputs to VHS_VideoCombine

T2VA (text only)

Leave the image input disconnected. MOVA will generate from a black first frame. For better results, generate a first frame with a T2I model (e.g. Flux, SD3) and connect it as the reference image.

Audio injection (lip sync)

Generate speech with ComfyUI-MOSS-TTS or any audio source
Connect the AUDIO output to MOVA Generate's audio input
Enable auto_duration to match video length to the audio
Adjust audio_strength — 0.8 preserves most of the original audio while allowing the model to harmonize it with the video

720p generation

720p requires significantly more VRAM. On 16GB:

Set attention_mode to sage (requires SageAttention installed)
Enable vae_tiling
Keep remove_video_dit enabled

VRAM Guide

Tested with group offload on a 16GB GPU:

Resolution	Settings	Peak VRAM	Notes
360p (352x640)	defaults	~9-12 GB	Works on 16GB out of the box
360p + sage	`attention_mode=sage`	~8-9 GB	Comfortable headroom
720p (1280x720)	sage + tiling	~14-15 GB	Tight on 16GB, works with sage + vae_tiling

RAM usage is ~50-80GB depending on model variant and remove_video_dit setting.

Optional Dependencies

SageAttention

Reduces attention VRAM usage by ~50% using INT8 quantized Q/K matrices. Required for 720p on 16GB GPUs.

pip install sageattention

Select sage in the attention_mode dropdown to use it.

VideoHelperSuite (VHS)

Not a dependency, but the recommended way to save output. VHS_VideoCombine accepts IMAGE + AUDIO inputs and muxes them into MP4 with FFmpeg.

Install from ComfyUI Manager or: https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite

MOVA Source Code

The MOVA source is included as a git submodule in mova_src/. If you didn't clone with --recursive, initialize it manually:

cd ComfyUI/custom_nodes/comfyui-mova
git submodule update --init --recursive

Alternatively, you can point to an existing MOVA checkout by setting the MOVA_REPO_PATH environment variable.

Known Limitations

Multi-speaker audio binding — MOVA does not currently support binding specific audio tracks to specific speakers in multi-person scenes
Last-image conditioning — first+last frame conditioning is not yet exposed in the upstream MOVA pipeline (inputs are prepared in code but commented out pending upstream support)
Windows distributed — torch.distributed is monkey-patched for single-GPU use since NCCL/gloo can be unreliable on Windows; this is transparent but worth noting

Credits

MOVA — OpenMOSS/MOVA by the OpenMOSS team
Paper — MOVA: Adapting Mixture of Video Diffusion Experts for Video Generation
Models — OpenMOSS-Team on HuggingFace