README.md

June 27, 2026 ยท View on GitHub

UniRL โ€” A Reinforcement Learning Framework for Unified Multimodal Models

A Reinforcement Learning Framework for Unified Multimodal Models

U(you)ยทni(need)ยทRL for unified multimodal intelligence

Python License Documentation WeChat

News ๐Ÿš€

  • [2026-06] DRPO released โ€” "Rethinking the Divergence Regularization in LLM RL" (arXiv).
  • [2026-06] Flow-DPPO released โ€” "FlowDPPO: Divergence Proximal Policy Optimization for Flow Matching Models" (arXiv).
  • [2026-06] CPPO released โ€” "Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning" (arXiv).

About ๐Ÿ’ก

UniRL applies one RL post-training loop โ€” generate samples, score them, compute advantages, update the policy, and sync weights back to rollout workers โ€” across multimodal model families.

UniRL architecture

UniRL is a layered, composable system. Each entrypoint (train_diffusion, train_ar, train_pe, train_unified_model) loads a Hydra example config covering model, algorithm, rollout, reward, placement, and sync, then creates the matching domain trainer (DiffusionTrainer, ARTrainer, PETrainer, UnifiedModelTrainer). The trainer coordinates the RL loop across pluggable rollout engines, algorithms, model bundles, reward services, and the shared distributed runtime: Ray DevicePool, FSDP, Transfer Queue (TQ), and LoRA/full-weight sync. See unirl/README.md for the runtime loop, deployment modes, and module map.

Team-Proposed Algorithms ๐ŸŒŸ

๐ŸŒŸ These algorithms are proposed by our team โ€” the highlight of UniRL. Each algorithm's folder holds a step-by-step tutorial and a runnable example recipe. We highly recommend trying them in our framework!

AlgorithmPaperTutorialNotes
Flow-DPPO"Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models"FlowDPPO/Diffusion/flow RL with an exact divergence-based trust-region mask.
DRPO"Rethinking the Divergence Regularization in LLM RL"DRPO/Token-level LLM RL with a smooth advantage-weighted quadratic regularizer.
CPPO"Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning"CPPO/Token-level LLM RL with a position-weighted, cumulative-prefix-budget Binary-TV mask.

UniRL also wires in standard reference algorithms โ€” (LLM's)GRPO, DiffusionNFT, DanceGRPO, and MixGRPO โ€” in unirl/algorithms/.

Model Support ๐ŸŽจ

Model and algorithm support are two independent dimensions that compose within a domain: any diffusion algorithm (see above) runs on a diffusion model, AR algorithms on AR models โ€” so UniRL covers many more model ร— algorithm combinations than the shipped example recipes alone. The table below is the model dimension; all listed models are supported (โœ…).

ModelCategoryModalityStatus
Stable Diffusion 3 / 3.5Image diffusionText โ†’ Imageโœ…
Qwen-ImageImage diffusionText โ†’ Imageโœ…
FLUX.2-KleinImage diffusionText โ†’ Image / Text + Image โ†’ Imageโœ…
Z-ImageImage diffusionText โ†’ Imageโœ…
WAN 2.1Video diffusionText / Image โ†’ Videoโœ…
WAN 2.2Video diffusionText / Image โ†’ Videoโœ…
HunyuanVideo 1.0 / 1.5Video diffusionText โ†’ Videoโœ…
LTX-Video-2Video diffusionText โ†’ Videoโœ…
LTX-Video-2.3Video diffusionText โ†’ Audio + Videoโœ…
Qwen-VLVision-language ARText + Image โ†’ Textโœ…
Qwen3LLM ARText โ†’ Textโœ…
Prompt-EnhancerLLM + diffusionText โ†’ Text โ†’ Imageโœ…
HunyuanImage3Unified AR + diffusionText โ†’ Imageโœ…
BagelUnified AR + diffusionText / Text + Image โ†’ Imageโœ…

Each model maps to a domain entrypoint (train_diffusion, train_ar, train_pe, train_unified_model); see Getting Started below to run any of them.

Training Modes ๐Ÿงฉ

UniRL unifies four training modes, one Hydra example bucket and entrypoint each. Examples are self-contained YAML files selected with --config-name=<domain>/<example>:

DomainTrainsEntrypointExample
diffusion/Image / video diffusion modelstrain_diffusiondiffusion/sd3/sd3_sglang_rollout_colocate
ar/Autoregressive models โ€” vision-language (VLM) + text-only (LLM)train_arar/qwen_vl_grpo_geo3k_mc_4x8, ar/qwen3_drpo_4b_base_dapo_sglang
pe/Prompt-enhancer (AR rewriter + diffusion reward)train_pepe/pe_sglang_full_pickscore
unified_model/Unified AR + diffusion modelstrain_unified_modelunified_model/hi3_vllmomni

See examples/README.md for the full launch guide, naming schema, and how to add a recipe.

Getting Started โšก

Install dependencies first โ€” see INSTALL.md.

# compose-check, then launch a single-node example
python -m unirl.train_diffusion --config-name=diffusion/sd3/sd3_trainside --cfg job --resolve
bash examples/run_experiment_single_node.sh diffusion/sd3/sd3_trainside

Full launch guide โ€” multi-node, every entrypoint, mooncake.

Roadmap ๐Ÿ—บ๏ธ

We are actively expanding model and algorithm coverage. Near-term directions:

  • Broaden algorithm coverage for the newer model families โ€” FLUX.2-Klein, HunyuanVideo 1.0 / 1.5, and Bagel.
  • Extend the team-proposed algorithms (Flow-DPPO, DRPO) to more model families.
  • Broaden reward backends and rollout-engine coverage across domains.

Want a model or algorithm prioritized? Open an issue to discuss.

Contributing ๐Ÿค

Contributions and questions are welcome. Before opening a pull request, read the repository conventions in AGENTS.md, run the pre-PR checks for the files you touched, and fill in the pull request template. For questions, bug reports, and feature requests, open an issue.

Acknowledgement ๐Ÿ™

UniRL builds on ideas and infrastructure from the open-source RL and inference ecosystem. We especially thank vLLM, SGLang, slime, and verl.

Citation ๐Ÿ“š

If you find UniRL helpful, please cite:

@misc{unirl_github,
  title        = {{UniRL: A Reinforcement Learning Framework for Unified Multimodal Models}},
  author       = {Haonan Wang and Linyu Wu and Qian Qiu and Lewei Jin and Bowen Ping and Jianghai Chen and Yiheng Du and Guangxin He and Yu Shi and Yongguang Lin and Zhuoxin Zhou and Zhanchao Zhou and Keming Wu and Rizhen Hu and Xuefei Ning and Lvfang Tao and Feiyu Hu and Xiangyan Liu and Siqi Kou and Jiarui Yao and Xiangxin Zhou and Liefeng Bo and Wenxi Zhu and Tianyu Pang},
  year         = {2026},
  howpublished = {\url{https://github.com/Tencent-Hunyuan/UniRL}},
  urldate      = {2026-06-05}
}

If you use DRPO, please also cite:

@article{yao2026rethinking,
  title={Rethinking the Divergence Regularization in LLM RL},
  author={Yao, Jiarui and Zhou, Xiangxin and Qi, Penghui and Lee, Wee Sun and Bo, Liefeng and Pang, Tianyu},
  journal={arXiv preprint arXiv:2606.09821},
  year={2026}
}

If you use Flow-DPPO, please also cite:

@article{ping2026flow,
  title={Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models},
  author={Ping, Bowen and Zhou, Xiangxin and Qi, Penghui and Luo, Minnan and Bo, Liefeng and Pang, Tianyu},
  journal={arXiv preprint arXiv:2606.11025},
  year={2026}
}