ComfyUI-MOVA

February 24, 2026 · View on GitHub

ComfyUI custom nodes for MOVA — a foundation model that generates synchronized video and audio in a single inference pass.

Features

  • Joint video + audio generation — MOVA produces video frames and matching audio simultaneously, no separate TTS/sound model needed
  • Image-to-Video (I2V) — provide a reference image as the first frame for high-quality, controlled generation
  • Text-to-Video-Audio (T2VA) — generate from a text prompt alone (best results come from generating a first frame with any T2I node first)
  • Audio injection — feed pre-generated audio (speech, music) to drive lip sync and audio-conditioned video
  • Auto duration — automatically match video length to input audio
  • Two nodesMOVA Model Loader and MOVA Generate, designed to connect directly to VHS for video output

Installation

1. Clone into ComfyUI custom nodes

cd ComfyUI/custom_nodes
git clone --recursive https://github.com/richservo/comfyui-mova.git

The --recursive flag pulls the MOVA source code as a git submodule into mova_src/. This is the recommended setup.

2. Install dependencies

cd comfyui-mova
pip install -r requirements.txt

3. Models

Models download automatically from HuggingFace on first use into ComfyUI/models/mova/. Available variants:

VariantResolutionHuggingFace Repo
MOVA-360p352x640OpenMOSS-Team/MOVA-360p
MOVA-720p1280x720OpenMOSS-Team/MOVA-720p

To use a local model path instead, enter the path in the local_model_path field on the loader node.

Nodes

MOVA Model Loader

Loads the MOVA pipeline and applies memory offloading.

ParameterTypeDefaultDescription
model_variantDropdownMOVA-360pWhich model to load
local_model_pathSTRING""Override with a local path (leave empty for auto-download)
offload_modeDropdowngroupgroup (recommended for 16GB), cpu, or none
dtypeDropdownbf16bf16 or fp16

Output: MOVA_PIPE — connect to MOVA Generate.

MOVA Generate

Runs inference and outputs video frames + audio.

ParameterTypeDefaultDescription
mova_pipeMOVA_PIPEPipeline from the loader
promptSTRING""Text prompt
negative_promptSTRING(Chinese default)Negative prompt
seedINT42Random seed
widthINT640Video width (step 16)
heightINT352Video height (step 16)
num_framesINT193Number of frames. Ignored when auto_duration is on.
auto_durationBOOLEANFalseCalculate frame count from input audio length
fpsFLOAT24.0Video frame rate
num_inference_stepsINT25Denoising steps (MOVA default is 50; 25 is a good speed/quality trade-off)
cfg_scaleFLOAT5.0Classifier-free guidance scale
sigma_shiftFLOAT5.0Scheduler sigma shift
attention_modeDropdownautoauto, sage (SageAttention, ~50% less VRAM), or sdpa (PyTorch SDPA)
vae_tilingBOOLEANFalseTile VAE decode to save VRAM. Essential for 720p on 16GB.
tile_sizeINT256VAE tile size in pixels (only used when tiling is enabled)
remove_video_ditBOOLEANTrueFree stage-1 DiT after expert switch to save ~28GB RAM

Optional inputs:

ParameterTypeDefaultDescription
imageIMAGEReference image (first frame). Without this, MOVA runs in T2VA mode.
audioAUDIOPre-generated audio to drive lip sync
audio_strengthFLOAT0.8How strongly input audio conditions generation (1.0 = full, 0.0 = ignore)

Outputs: IMAGE (video frames) and AUDIO (waveform).

Usage

Basic I2V workflow

[Load Image] → image
                        → [MOVA Generate] → images → [VHS_VideoCombine]
[MOVA Model Loader] → mova_pipe ↗            audio ↗
  1. Load a reference image with any ComfyUI image node
  2. Connect it to the image input on MOVA Generate
  3. Connect both images and audio outputs to VHS_VideoCombine

T2VA (text only)

Leave the image input disconnected. MOVA will generate from a black first frame. For better results, generate a first frame with a T2I model (e.g. Flux, SD3) and connect it as the reference image.

Audio injection (lip sync)

  1. Generate speech with ComfyUI-MOSS-TTS or any audio source
  2. Connect the AUDIO output to MOVA Generate's audio input
  3. Enable auto_duration to match video length to the audio
  4. Adjust audio_strength — 0.8 preserves most of the original audio while allowing the model to harmonize it with the video

720p generation

720p requires significantly more VRAM. On 16GB:

  1. Set attention_mode to sage (requires SageAttention installed)
  2. Enable vae_tiling
  3. Keep remove_video_dit enabled

VRAM Guide

Tested with group offload on a 16GB GPU:

ResolutionSettingsPeak VRAMNotes
360p (352x640)defaults~9-12 GBWorks on 16GB out of the box
360p + sageattention_mode=sage~8-9 GBComfortable headroom
720p (1280x720)sage + tiling~14-15 GBTight on 16GB, works with sage + vae_tiling

RAM usage is ~50-80GB depending on model variant and remove_video_dit setting.

Optional Dependencies

SageAttention

Reduces attention VRAM usage by ~50% using INT8 quantized Q/K matrices. Required for 720p on 16GB GPUs.

pip install sageattention

Select sage in the attention_mode dropdown to use it.

VideoHelperSuite (VHS)

Not a dependency, but the recommended way to save output. VHS_VideoCombine accepts IMAGE + AUDIO inputs and muxes them into MP4 with FFmpeg.

Install from ComfyUI Manager or: https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite

MOVA Source Code

The MOVA source is included as a git submodule in mova_src/. If you didn't clone with --recursive, initialize it manually:

cd ComfyUI/custom_nodes/comfyui-mova
git submodule update --init --recursive

Alternatively, you can point to an existing MOVA checkout by setting the MOVA_REPO_PATH environment variable.

Known Limitations

  • Multi-speaker audio binding — MOVA does not currently support binding specific audio tracks to specific speakers in multi-person scenes
  • Last-image conditioning — first+last frame conditioning is not yet exposed in the upstream MOVA pipeline (inputs are prepared in code but commented out pending upstream support)
  • Windows distributed — torch.distributed is monkey-patched for single-GPU use since NCCL/gloo can be unreliable on Windows; this is transparent but worth noting

Credits