LLaVA-Reward
July 30, 2025 ยท View on GitHub
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation (ICCV 2025)
๐ arXiv Paper ย ๐ค Models ย
Model Architecture Overview

Installation
git clone https://github.com/sjz5202/LLaVA-Reward
cd LLaVA-Reward
conda create -n llavareward python=3.10
conda activate llavareward
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .
Quick Start
Example usage (also available at eval/simple_inference.py):
import torch
from eval.reward_adaptor_loader import load_reward_adaptor, inference_process_phi3v, preference_compute
import os
class Args:
pass
args = Args()
args.pm_path = "/code/llava-reward-ckpt/alignment/llavareward_phi_alignment"
args.pretrain = "microsoft/Phi-3.5-vision-instruct"
args.cache_dir = None
args.ft_projector = True
args.seed = 1234
args.disable_fast_tokenizer = False
# load model
args, model, processor, tokenizer = load_reward_adaptor(args,model_type='phi3v',reward_config_path=os.path.join(args.pm_path, "reward_config.yaml"),load_tokenizer=True)
model.to('cuda')
model.eval()
# prepare example
caption = "perfect white haired egyptian goddess wearing white dove wings, warframe armor, regal, attractive, ornate, sultry, beautiful, ice queen, half asian, pretty face, blue eyes, detailed, scifi platform, 4 k, ultra realistic, epic lighting, illuminated, cinematic, masterpiece, art by akihito tsukushi, voidstar"
img_dir_list = ["data/sample_test/sample_img/0_1_id_000904-0035.jpg", "data/sample_test/sample_img/4_3_id_000904-0035.jpg"]
img_inputs = inference_process_phi3v(args,processor,tokenizer,img_dir_list,caption,device='cuda')
img_inputs_c = img_inputs[0]
img_inputs_r = img_inputs[1]
# inference
with torch.no_grad():
chosen_rewards, _ = model.custom_forward(**img_inputs_c)
reject_rewards, _ = model.custom_forward(**img_inputs_r)
prob = preference_compute(args,chosen_rewards,reject_rewards)
if not args.is_general_preference:
print("image0 reward:", chosen_rewards.item())
print("image1 reward:", reject_rewards.item())
print('Predict probability that image0 is better than image1:',prob)
Evaluation
Eval data preparation such as MJ-Bench.
./eval/ contains corresponding evalutions for LLaVA-Reward with different MLLM backbones. You can try LLaVA-Reward with your few samples using bash eval/batch_inference_rm_phi_user_input.sh.
Reformated training data
In data/, we provide the reformated training data from ImageReward and UnsafeBench. We also include sample test data in data/sample_test, providing data format for pairwise and non-pairwise test data.
Training
scripts/ contains training scripts.
- Loss variants: BT/GPM/CLS (please note GPM mode needs pair-wise image as inputs)
- Architecture: with and without SkipCA
- Different backbones: Phi-3.5-v, Qwen2.5-VL and LLaVA-v1.6.
Example Training script
# phi3 gpm lora
deepspeed train_rm_general_preference.py \
--save_path your-path \
--save_steps 2 \
--logging_steps 1 \
--eval_steps 10000 \
--accumulated_gradient 4 \
--micro_train_batch_size 4 \
--pretrain microsoft/Phi-3.5-vision-instruct \
--bf16 \
--max_epochs 3 \
--max_len 2048 \
--zero_stage 3 \
--learning_rate 2e-4 \
--general_preference_tau 0.1 \
--dataset your-json \
--dataset_probs 1 \
--flash_attn \
--gradient_checkpointing \
--group_size 1 \
--value_head_dim 2 \
--save_best_model 2 \
--train_split_ratio 1 \
--freeze_vision_model \
--lora_rank 128 \
--lora_alpha 256 \
--ft_projector \
--add_cross_attention \
--is_general_preference \
--lora_dropout 0.05
Checkpoints
We provide checkpoints for text-to-image alignment, fidelity, safety evaluation and checkpoints for general T2I rewarding. You can find them at LLaVA-Reward.
To perform inference, Use the evaluation scripts in eval/. Please ensure that the script names align with the backbones of selected checkpoints.
Examples of inference-time scaling generations (Fk-Diffusion-Steering) using LLaVA-Reward

Acknowledgement
LLaVA-Reward is mainly based on GPM, OpenRLHF.
Citation
@misc{zhou2025multimodalllmscustomizedreward,
title={Multimodal LLMs as Customized Reward Models for Text-to-Image Generation},
author={Shijie Zhou and Ruiyi Zhang and Huaisheng Zhu and Branislav Kveton and Yufan Zhou and Jiuxiang Gu and Jian Chen and Changyou Chen},
year={2025},
eprint={2507.21391},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.21391},
}