Goku: Flow Based Video Generative Foundation Models

February 10, 2025 ยท View on GitHub

arXivย  project pageย 

Goku: Flow Based Video Generative Foundation Models
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
HKU, ByteDance

Overview

Goku is a new family of joint image-and-video generation models based on rectified flow Transformers. It is designed to achieve industry-grade performance, integrating advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation.

Key contributions include:

  • ๐Ÿ“Š High-quality fine-grained image and video data curation.
  • ๐Ÿ”„ The pioneering use of rectified flow for enhanced interaction among video and image tokens.
  • ๐ŸŒŸ Superior qualitative and quantitative performance in both image and video generation tasks.

Goku supports multiple generation tasks:

  • ๐ŸŽฌ Text-to-Video Generation
  • ๐Ÿ–ผ๏ธ Image-to-Video Generation
  • ๐ŸŽจ Text-to-Image Generation

Performance Benchmarks ๐Ÿ…

Goku achieves top scores on major benchmarks:

  • 0.76 on GenEval (text-to-image generation)
  • 83.65 on DPG-Bench (text-to-image generation)
  • 84.85 on VBench (text-to-video generation)

VBench Performance ๐Ÿ†

Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.

MethodTotal ScoreQuality ScoreSampling ScoreStyle ConsistencyBackground ConsistencyTemporal FlickeringMotion SmoothnessDynamic DegreeSubject QualityImaging QualityObject ClassHuman ActionObject RelationshipColorScenePrompt StyleOverall Consistency
AnimateDiff-V280.2782.9069.7595.3097.6898.7597.7640.8367.1670.1090.9036.8892.6087.4734.6050.1922.42
VideoCrafter-2.080.4482.2073.4296.8598.2298.4197.7342.5063.1367.2292.5540.6695.0092.9235.8655.2925.13
OpenSora V1.279.2380.7173.3094.4597.9099.4798.2047.2256.1860.9483.3758.4185.8087.4967.5142.4723.89
Show-178.9380.4272.9895.5398.0299.1298.2444.4457.3558.6693.0745.4795.6086.3553.5047.0323.06
Gen-382.3284.1175.1797.1096.6298.6199.2360.1463.3466.8287.8153.6496.4080.9065.0954.5724.31
Pika-1.080.6982.9271.7796.9497.3699.7499.5047.5062.0461.8788.7243.0886.2090.5761.0349.8322.26
CogVideoX-5B81.6182.7577.0496.2396.5298.6696.9270.9761.9862.9085.2362.1199.4082.8166.3553.2024.91
Kling81.8583.3975.6898.3397.6099.3099.4046.9461.2165.6287.2468.0593.4089.9073.0350.8619.62
Mira71.8778.7844.2196.2396.9298.2997.5460.3342.5160.1652.0612.5263.8042.2427.8316.3421.89
CausVid84.2785.6578.7597.5397.1996.2498.0592.6964.1568.8892.9972.1599.8080.1764.6556.5824.27
Luma83.6183.4784.1797.3397.4398.6499.3544.2665.5166.5594.9582.6396.4092.3383.6758.9824.66
HunyuanVideo83.2485.0975.8297.3797.7699.4498.9970.8360.3667.5686.1068.5594.4091.6068.6853.8819.80
Goku-T2V (ours)84.8585.6081.8795.5596.6797.7198.5076.1167.2271.2994.4079.4897.6083.8185.7257.0823.08

BibTeX

@article{chen2025goku,
  title={Goku: Flow Based Video Generative Foundation Models},
  author={Chen, Shoufa and Ge, Chongjian and Zhang, Yuqi and Zhang, Yida and Zhu, Fengda and Yang, Hao and Hao, Hongxiang and Wu, Hui and Lai, Zhichao and Hu, Yifei and Lin, Ting-Che and Zhang, Shilong and Li, Fu and Li, Chuan and Wang, Xing and Peng, Yanghua and Sun, Peize and Luo, Ping and Jiang, Yi and Yuan, Zehuan and Peng, Bingyue and Liu, Xiaobing},
  journal={arXiv preprint arXiv:2502.04896},
  year={2025}
}