GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
January 10, 2026 ยท View on GitHub
Method Overview

This repository provides the inference code for GenMAC, enabling compositional text-to-video generation. Follow the steps below to set up and generate your videos.
๐จ Installation
Step 1: Install Environment
Set up the required environment using Anaconda:
conda create -n genmac python=3.9 -y
conda activate genmac
pip install -r requirements.txt
Step 2: Download Required Files
-
Pretrained Model
- Download the pretrained model from Hugging Face. Place the model in the following directory (checkpoints/base_512_v2/model.ckpt):
mkdir -p checkpoints/base_512_v2 huggingface-cli download VideoCrafter/VideoCrafter2 model.ckpt --local-dir checkpoints/base_512_v2
- Download the pretrained model from Hugging Face. Place the model in the following directory (checkpoints/base_512_v2/model.ckpt):
-
Tokenizer
- Download the tokenizer from this link. Place it in the following directory (checkpoints/tokenizer):
huggingface-cli download ali-vilab/text-to-video-ms-1.7b \ --include "tokenizer/*" \ --local-dir checkpoints
- Download the tokenizer from this link. Place it in the following directory (checkpoints/tokenizer):
๐ก Quick Start
Step 1: Prepare API Key
GenMAC uses GPT-4o for its multi-agent collaboration. To enable access to the model:
-
Add your OpenAI API key to the file:
utils/api_key.py -
Ensure that your system has internet access to connect to OpenAI's servers.
Step 2: Run Inference
To generate videos, use the following command:
python scripts/run_t2v.py
Note: This process requires approximately 78GB of GPU VRAM
Additional Configurations
-
Specify the Prompt
Update the prompt text in the following file:
assets/prompt/prompt.txt -
Adjust Iterations
Modify the maximum iteration number (MAX_ITER) in Line 22 of the script (scripts/run_t2v.py):
e.g.,
MAX_ITER = 5 -
Set Seed
Set the seed value in Line 13 of the script (scripts/run_t2v.py):
e.g.,
seed = 12345678
Output
Generated videos will be saved in the following directory:
results/<seed>_<timestamp>/iter_<MAX_ITER-1>/video/results_t2v_baseline_0/base_512_v2/
Example: If the following parameters are used:
- MAX_ITER = 5
- seed = 12345678
- Script executed on 2025-07-25-22-21-24 The generated videos will be located in:
results/12345678_2025-07-25-22-21-24/iter_4/video/results_t2v_baseline_0/base_512_v2/
โญ Acknowledgements
We would like to thank the following great open-source projects and research works: LVD, VideoCrafter2.
๐ Citation
@article{huang2024genmaccompositionaltexttovideogeneration,
title={GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration},
author={Kaiyi Huang and Yukun Huang and Xuefei Ning and Zinan Lin and Yu Wang and Xihui Liu},
year={2024},
eprint={2412.04440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.04440},
}