ComfyUI-MOVA
February 24, 2026 · View on GitHub
ComfyUI custom nodes for MOVA — a foundation model that generates synchronized video and audio in a single inference pass.
Features
- Joint video + audio generation — MOVA produces video frames and matching audio simultaneously, no separate TTS/sound model needed
- Image-to-Video (I2V) — provide a reference image as the first frame for high-quality, controlled generation
- Text-to-Video-Audio (T2VA) — generate from a text prompt alone (best results come from generating a first frame with any T2I node first)
- Audio injection — feed pre-generated audio (speech, music) to drive lip sync and audio-conditioned video
- Auto duration — automatically match video length to input audio
- Two nodes —
MOVA Model LoaderandMOVA Generate, designed to connect directly to VHS for video output
Installation
1. Clone into ComfyUI custom nodes
cd ComfyUI/custom_nodes
git clone --recursive https://github.com/richservo/comfyui-mova.git
The --recursive flag pulls the MOVA source code as a git submodule into mova_src/. This is the recommended setup.
2. Install dependencies
cd comfyui-mova
pip install -r requirements.txt
3. Models
Models download automatically from HuggingFace on first use into ComfyUI/models/mova/. Available variants:
| Variant | Resolution | HuggingFace Repo |
|---|---|---|
| MOVA-360p | 352x640 | OpenMOSS-Team/MOVA-360p |
| MOVA-720p | 1280x720 | OpenMOSS-Team/MOVA-720p |
To use a local model path instead, enter the path in the local_model_path field on the loader node.
Nodes
MOVA Model Loader
Loads the MOVA pipeline and applies memory offloading.
| Parameter | Type | Default | Description |
|---|---|---|---|
model_variant | Dropdown | MOVA-360p | Which model to load |
local_model_path | STRING | "" | Override with a local path (leave empty for auto-download) |
offload_mode | Dropdown | group | group (recommended for 16GB), cpu, or none |
dtype | Dropdown | bf16 | bf16 or fp16 |
Output: MOVA_PIPE — connect to MOVA Generate.
MOVA Generate
Runs inference and outputs video frames + audio.
| Parameter | Type | Default | Description |
|---|---|---|---|
mova_pipe | MOVA_PIPE | — | Pipeline from the loader |
prompt | STRING | "" | Text prompt |
negative_prompt | STRING | (Chinese default) | Negative prompt |
seed | INT | 42 | Random seed |
width | INT | 640 | Video width (step 16) |
height | INT | 352 | Video height (step 16) |
num_frames | INT | 193 | Number of frames. Ignored when auto_duration is on. |
auto_duration | BOOLEAN | False | Calculate frame count from input audio length |
fps | FLOAT | 24.0 | Video frame rate |
num_inference_steps | INT | 25 | Denoising steps (MOVA default is 50; 25 is a good speed/quality trade-off) |
cfg_scale | FLOAT | 5.0 | Classifier-free guidance scale |
sigma_shift | FLOAT | 5.0 | Scheduler sigma shift |
attention_mode | Dropdown | auto | auto, sage (SageAttention, ~50% less VRAM), or sdpa (PyTorch SDPA) |
vae_tiling | BOOLEAN | False | Tile VAE decode to save VRAM. Essential for 720p on 16GB. |
tile_size | INT | 256 | VAE tile size in pixels (only used when tiling is enabled) |
remove_video_dit | BOOLEAN | True | Free stage-1 DiT after expert switch to save ~28GB RAM |
Optional inputs:
| Parameter | Type | Default | Description |
|---|---|---|---|
image | IMAGE | — | Reference image (first frame). Without this, MOVA runs in T2VA mode. |
audio | AUDIO | — | Pre-generated audio to drive lip sync |
audio_strength | FLOAT | 0.8 | How strongly input audio conditions generation (1.0 = full, 0.0 = ignore) |
Outputs: IMAGE (video frames) and AUDIO (waveform).
Usage
Basic I2V workflow
[Load Image] → image
→ [MOVA Generate] → images → [VHS_VideoCombine]
[MOVA Model Loader] → mova_pipe ↗ audio ↗
- Load a reference image with any ComfyUI image node
- Connect it to the
imageinput on MOVA Generate - Connect both
imagesandaudiooutputs to VHS_VideoCombine
T2VA (text only)
Leave the image input disconnected. MOVA will generate from a black first frame. For better results, generate a first frame with a T2I model (e.g. Flux, SD3) and connect it as the reference image.
Audio injection (lip sync)
- Generate speech with ComfyUI-MOSS-TTS or any audio source
- Connect the AUDIO output to MOVA Generate's
audioinput - Enable
auto_durationto match video length to the audio - Adjust
audio_strength— 0.8 preserves most of the original audio while allowing the model to harmonize it with the video
720p generation
720p requires significantly more VRAM. On 16GB:
- Set
attention_modetosage(requires SageAttention installed) - Enable
vae_tiling - Keep
remove_video_ditenabled
VRAM Guide
Tested with group offload on a 16GB GPU:
| Resolution | Settings | Peak VRAM | Notes |
|---|---|---|---|
| 360p (352x640) | defaults | ~9-12 GB | Works on 16GB out of the box |
| 360p + sage | attention_mode=sage | ~8-9 GB | Comfortable headroom |
| 720p (1280x720) | sage + tiling | ~14-15 GB | Tight on 16GB, works with sage + vae_tiling |
RAM usage is ~50-80GB depending on model variant and remove_video_dit setting.
Optional Dependencies
SageAttention
Reduces attention VRAM usage by ~50% using INT8 quantized Q/K matrices. Required for 720p on 16GB GPUs.
pip install sageattention
Select sage in the attention_mode dropdown to use it.
VideoHelperSuite (VHS)
Not a dependency, but the recommended way to save output. VHS_VideoCombine accepts IMAGE + AUDIO inputs and muxes them into MP4 with FFmpeg.
Install from ComfyUI Manager or: https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
MOVA Source Code
The MOVA source is included as a git submodule in mova_src/. If you didn't clone with --recursive, initialize it manually:
cd ComfyUI/custom_nodes/comfyui-mova
git submodule update --init --recursive
Alternatively, you can point to an existing MOVA checkout by setting the MOVA_REPO_PATH environment variable.
Known Limitations
- Multi-speaker audio binding — MOVA does not currently support binding specific audio tracks to specific speakers in multi-person scenes
- Last-image conditioning — first+last frame conditioning is not yet exposed in the upstream MOVA pipeline (inputs are prepared in code but commented out pending upstream support)
- Windows distributed — torch.distributed is monkey-patched for single-GPU use since NCCL/gloo can be unreliable on Windows; this is transparent but worth noting
Credits
- MOVA — OpenMOSS/MOVA by the OpenMOSS team
- Paper — MOVA: Adapting Mixture of Video Diffusion Experts for Video Generation
- Models — OpenMOSS-Team on HuggingFace