README.md

March 2, 2026 ยท View on GitHub

SpatialReward Logo
Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang1,2*, Chaoran Feng1*, Yufan Deng1,2, Jie Wu2, Xiaojie Li2,
Rui Wang2, Yunpeng Chen2, Daquan Zhou1

1Peking University ย ย  2ByteDance Seed
*Equal Contribution

Project Page arXiv Dataset hf_space License: CC BY-NC 4.0


๐Ÿ“… News

  • [2026.03] ๐Ÿ“„ Paper is now available on arXiv: https://arxiv.org/abs/2602.24233
  • [TBD] ๐Ÿšง We are planning to release the SpatialReward-Dataset and SpatialScore model weights. Please stay tuned!

๐Ÿ“– Abstract

Text-to-image models have made significant strides in visual fidelity but often struggle with complex spatial relationships. Existing reward models often fail to capture these intricate spatial constraints.

In this work, we introduce a novel method to strengthen spatial understanding in image generation:

  1. SpatialReward-Dataset: A curated dataset with over 80k preference pairs, featuring adversarial spatial perturbations verified by humans.
  2. SpatialScore: A VLM-based reward model (built on Qwen2.5-VL) that surpasses proprietary models (e.g., GPT-5, Gemini 2.5) in spatial evaluation accuracy.
  3. Spatial-RL: We demonstrate that SpatialScore effectively enables online Reinforcement Learning (specifically GRPO with Top-k filtering), yielding significant gains in spatial generation capabilities.
Performance Comparison
Figure 1: Existing reward models often assign high scores to spatially incorrect images. SpatialScore provides accurate feedback, enabling better alignment.

๐Ÿ”ฅ Highlights & Contributions

1. SpatialReward-Dataset

We constructed a large-scale dataset focusing on spatial logic. Each entry consists of a "Perfect Image" (aligned with the text) and a "Perturbed Image" (with subtle spatial violations), creating a hard negative sample for robust training.

2. SpatialScore: State-of-the-Art Reward Model

By fine-tuning Qwen2.5-VL, our SpatialScore achieves superior performance in evaluating spatial relationships, outperforming strong baselines including HPS v2/v3, ImageReward, and even proprietary VLM APIs on our benchmarks.

ModelOverall Accuracy
HPS v2.146.3%
ImageReward47.9%
GPT-5 (API)89.0%
Gemini-2.5 Pro95.1%
SpatialScore (Ours)95.8%

3. Reinforcement Learning with Top-k Filtering

We apply GRPO (Group Relative Policy Optimization) using SpatialScore as the feedback signal. To handle reward noise and prompt difficulty variance, we introduce a Top-k filtering strategy, which significantly stabilizes training and improves convergence.

RL Pipeline

๐Ÿ–ผ๏ธ Visual Results

Our method significantly improves the spatial layout capability of Flux.1-dev.

Visual Results
Comparison of generated images using complex spatial prompts.

โœ๏ธ Citation

If you find our work useful, please cite our paper:

@article{tang2026enhancing,
  title={Enhancing Spatial Understanding in Image Generation via Reward Modeling},
  author={Tang, Zhenyu and Feng, Chaoran and Deng, Yufan and Wu, Jie and Li, Xiaojie and Wang, Rui and Chen, Yunpeng and Zhou, Daquan},
  journal={arXiv preprint arXiv:2602.24233},
  year={2026}
}

โš–๏ธ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more details, please refer to the LICENSE file.