Q-ReAlign

June 24, 2026 · View on GitHub

qalign — Pro (9B) · Lite (4B) · Mini (0.8B)

Q-ReAlign

Q-ReAlign: Lightweight Human-Aligned Multimodal Judges Built on Modern Vision-Language Models

The method · Adapting guide · Quickstart · Results

Hugging Face


Q-Align-level performance with 50% fewer parameters. Easy to use, easy to train.

The 0.8B model requires less than 4 GB of GPU memory and is CPU-runnable.

News

  • 🆕 Jun 2026 — Q-ReAlign has been adopted by IQA-PyTorch! All three sizes are now available as the qrealign metric (qrealign-mini / qrealign-lite / qrealign-pro) — score any image in two lines via pyiqa.create_metric. See Use via IQA-PyTorch.

Results

We train three sizes — Mini (0.8B) · Lite (4B) · Pro (9B) (the mascots above) — and compare against the original Q-Align.

All three are trained on the same ONE-ALIGN mix of six public datasets — KonIQ + SPAQ + KADID + AGIQA-20K + AVA + LSVQ (IQA + IAA + VQA combined).

Performance — SRCC across 7 benchmarks

SRCC across KonIQ, SPAQ, KADID, AGI, LIVE, AVA, LSVQ

All three variants match or beat the original Q-Align across KonIQ, SPAQ, KADID, AGI, LIVE, AVA, and LSVQ, and quality scales cleanly with model size — Pro (9B) reaches avg SRCC 0.896 vs. Q-Align's 0.869.

Per-dataset SRCC / PLCC on seven QA benchmarks — bold = best, italic = second-best:

ModelKonIQSPAQKADIDAGILIVEAVALSVQAvg.
Q-Align0.942 / 0.9440.932 / 0.9330.912 / 0.9200.738 / 0.7810.897 / 0.8700.798 / 0.7960.867 / 0.8660.869 / 0.873
Mini (0.8B)0.935 / 0.9380.931 / 0.9330.903 / 0.9070.811 / 0.8480.907 / 0.8730.797 / 0.7940.869 / 0.8690.879 / 0.880
Lite (4B)0.943 / 0.9410.932 / 0.9340.928 / 0.9310.829 / 0.8710.899 / 0.8620.814 / 0.8040.880 / 0.8790.889 / 0.889
Pro (9B)0.950 / 0.9520.935 / 0.9370.934 / 0.9390.843 / 0.8850.902 / 0.8760.832 / 0.8280.883 / 0.8840.896 / 0.900

Each cell is SRCC / PLCC. Rankings use full-precision values before rounding; avg. SRCC for Pro (9B) and Mini (0.8B) uses the reported 0.8956 / 0.8792. In the original table, every AIGC10K result beating Q-Align is highlighted in red — here, all but a handful of cells qualify.

Speed — throughput vs. batch size

Throughput (images/sec) on RTX 4090 and H200 141GB

Sustained throughput on a consumer RTX 4090 and a datacenter H200 141GB, measured on the SPAQ dataset.
Mini (0.8B) tops out at 26.7 img/s @ bs=4 (4090) and 61.1 img/s @ bs=14 (H200); OOM points are marked where the batch size exceeds device memory.

Models

Pretrained weights are on the Hugging Face Hub — q-future/Q-ReAlign collection:

ModelParamsHugging Face
Pro9Bq-future/Q-ReAlign-Pro-9B
Lite4Bq-future/Q-ReAlign-Lite-4B
Mini0.8Bq-future/Q-ReAlign-Mini-0.8B

Pass any of these repo IDs as CKPT / --model and the weights download automatically.

Quick Run

import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
# transformers >= 5.2.0 for Qwen3.5 Support

CKPT, IMAGE = "q-future/Q-ReAlign-Mini-0.8B", "photo.jpg"   # auto-downloads from the Hub
LEVELS, WEIGHTS = ["excellent", "good", "fair", "poor", "bad"], [1.0, 0.75, 0.5, 0.25, 0.0]

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(CKPT)
model = AutoModelForImageTextToText.from_pretrained(CKPT, dtype="auto").to(device).eval()

messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "How would you rate the quality of this image?"}]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True) + "The quality of the image is"
inputs = processor(text=[text], images=[Image.open(IMAGE).convert("RGB")], return_tensors="pt").to(device)

ids = [processor.tokenizer(" " + w, add_special_tokens=False).input_ids[0] for w in LEVELS]
probs = model(**inputs).logits[0, -1, ids].softmax(-1)
score = (probs * torch.tensor(WEIGHTS, device=device)).sum().item()
print(f"quality score: {score:.4f}")   # 0 (worst) .. 1 (best)

Use via IQA-PyTorch

Q-ReAlign is built into IQA-PyTorch (pyiqa) as the qrealign metric — the easiest way to score images without touching the training stack. The checkpoints download automatically from the Hub.

pip install pyiqa
pip install -U "transformers>=5.2"   # Qwen3.5-VL backbone needs transformers >= 5.0
import pyiqa
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# qrealign defaults to Mini (0.8B); use qrealign-lite (4B) or qrealign-pro (9B)
iqa_metric = pyiqa.create_metric("qrealign", device=device)
# iqa_metric = pyiqa.create_metric("qrealign-pro", device=device)

# higher is better; scores are in [0, 1]
print(iqa_metric.lower_better)        # False

# image path or a (N, 3, H, W) RGB tensor in [0, 1]
score = iqa_metric("photo.jpg")
print(score)
pyiqa nameModelParams
qrealign (= qrealign-mini)Mini0.8B
qrealign-liteLite4B
qrealign-proPro9B

The metric also exposes an aesthetic task (IAA) alongside the default quality (IQA) scoring. See the IQA-PyTorch Model Card for details.

Install

pip install -e .            # core (CPU): template / cache / config layer
pip install -e ".[runtime]" # + ms-swift, transformers, deepspeed, decord (GPU box)
pip install -e ".[dev]"     # + pytest

import qalign works without a GPU; ms-swift / torch are imported lazily and only needed for train / eval / infer.

Docker

The fastest way to a working train / eval / infer box. The image builds on the official, verified ModelScope + ms-swift image — which already ships torch 2.10, ms-swift 4.0.3, transformers, deepspeed, and vLLM on CUDA 12.8.1 / Python 3.11 — and just layers qalign (+ decord for video) on top:

# build (qalign on top of the verified runtime base)
docker build -t qalign .

# run with GPUs, mounting the repo + your datasets
docker run --gpus all -it --rm \
  -v "$PWD":/workspace/qalign \
  -v /path/to/datasets:/data \
  qalign

# inside the container the CLI is ready:
qalign build --config configs/example_iqa.yaml
qalign train --config configs/my.yaml mini

The base image is the source of truth for the swift runtime, so the build never touches torch / ms-swift. For a region-local pull, override the base:

docker build -t qalign \
  --build-arg BASE=modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.10.0-vllm0.17.1-modelscope1.34.0-swift4.0.3 .

5-minute tour (no GPU, bundled toy data)

The CPU stages run out of the box on a tiny synthetic dataset:

pip install -e ".[dev]"
pytest -q                                            # unit tests for the core
qalign build --config configs/example_iqa.yaml       # -> examples/runs/{data,eval_manifests}/toy_iqa.jsonl
qalign cache --config configs/example_iqa.yaml --which train   # -> examples/runs/cache/{images.blob,index.json}

Open the produced toy_iqa.jsonl to see the Q-Align conversation records. To actually train/score, set model.path in the config to a real VL model and run on a GPU box.

The one CLI

qalign build  --config CFG    # labelled sources -> Q-Align training + eval manifests
qalign frames --config CFG    # videos -> N sampled frames-as-images (+ manifest)
qalign cache  --config CFG    # pack images into a RAM-resident blob (big dataloading win)
qalign train  --config CFG    # full-parameter SFT via ms-swift  (append `mini` for a smoke test)
qalign eval   --config CFG    # SRCC / PLCC over the eval sets
qalign infer  --config CFG IMG/VIDEO ...   # label-free quality score per file

Every command takes --config and optional --set key.path=value overrides (e.g. --set train.lr=1e-5).

End-to-end (real run)

# 1) edit a config: point model.path at your VL model, list your datasets + mix
cp configs/example_iqa.yaml configs/my.yaml && $EDITOR configs/my.yaml

# 2) prepare data  (build = images + eval manifests; frames = video)
qalign build  --config configs/my.yaml
qalign frames --config configs/my.yaml          # only if you have video datasets
qalign cache  --config configs/my.yaml          # optional but recommended at scale

# 3) train (smoke test first), then full run
qalign train  --config configs/my.yaml mini
qalign train  --config configs/my.yaml

# 4) evaluate a checkpoint  (or rely on the in-training eval curve in logging.jsonl)
qalign eval   --config configs/my.yaml --model runs/.../checkpoint-XXXX

# 5) score arbitrary media
qalign infer  --config configs/my.yaml --model runs/.../best/checkpoint-XXXX photo.jpg clip.mp4

configs/onealign.yaml is the full multi-task reference (IQA + IAA + VQA = ONE-ALIGN).

Full-parameter or LoRA

Training defaults to full-parameter SFT. To fine-tune with LoRA instead, set train.train_type: lora in the config (or override it once on the command line) — everything else in the pipeline is unchanged:

train:
  train_type: lora          # full (default) | lora
  lora_rank: 16             # LoRA rank
  lora_alpha: 32            # LoRA alpha
  # target_modules default to all-linear; LR defaults to 2e-4 for LoRA (2e-5 full)
# one-off LoRA run without editing the file
qalign train --config configs/my.yaml --set train.train_type=lora
qalign train --config configs/my.yaml --set train.train_type=lora --set train.lora_rank=32

ms-swift attaches the adapters to all-linear targets and picks the faithful LoRA learning rate (2e-4) automatically; the eval / infer / cache stages are identical.

Repository layout

qalign/
  config.py    one YAML -> nested dataclasses; --set overrides           (the control surface)
  levels.py    level vocabulary, weights, MOS -> level binning           (the science)
  prompts.py   per-task prompt pools + answer stems (iqa / iaa / vqa)
  template.py  the universal template generator: record -> swift jsonl
  datasets.py  csv / jsonl / qalign_json adapters -> train & eval manifests
  frames.py    video -> N frames-as-images + video training manifest
  cache.py     pack images into a byte-exact blob; mmap RAM-resident load hooks
  model.py     load (model, template, tokenizer) from config            (the only swift coupling)
  scorer.py    level-token softmax score + SRCC/PLCC eval
  callback.py  in-training eval (distributed-safe) + best-checkpoint keeping
  train.py     compose & launch `swift sft` from the config
  infer.py     label-free scoring of arbitrary media
  cli.py       the `qalign` dispatcher
configs/   example_iqa.yaml (minimal)  ·  onealign.yaml (full reference)
docs/      METHOD.md (the method)      ·  ADAPTING.md (new model / new dataset)
examples/  a tiny synthetic dataset so the CPU stages run immediately
tests/     pure-python unit tests (no GPU)
scripts/   train.sh  ·  launch_pod.example.sh

Design notes

  • Model-agnostic by construction. Records use swift's generic <image> placeholder; the scorer only needs the level tokens. The backbone lives entirely in model: of the YAML.
  • Faithful Q-Align defaults, all overridable: 5 levels with weights [1, .75, .5, .25, 0], 8 frames/video, full-parameter FT in bf16, LR 2e-5, cosine schedule, warmup 0.03, 2 epochs, vision tower + projector trainable.
  • The dataloading cache packs every image byte-for-byte into one mmap'd blob so training/eval read RAM slices instead of hundreds of thousands of tiny files — the difference between a starved GPU and a fed one at QA dataset scale.
  • Robust in-training eval. The callback runs the same scorer on the live model at each save, logs the SRCC/PLCC curve, and hard-links the top-N checkpoints out of the rotation path so the peak is never lost — and it is distributed-correct under DeepSpeed ZeRO-2/3 (no rank-desync hangs).

Requirements

Python ≥ 3.9. Runtime training/eval needs ms-swift (4.0.2), transformers, deepspeed, and decord (for video) — see requirements.txt. Provided by your training image or pip install -e ".[runtime]".

Repo Maintained by

@Yushuo Zheng , @Zicheng Zhang

Acknowledgements

qalign stands on the shoulders of three projects — this repository would not exist without them, and we are grateful to their teams.

  • Q-Align (Q-Future) — the method this toolkit modernizes. Q-Align introduced visual scoring via discrete text-defined levels and unified IQA + IAA + VQA into ONE-ALIGN. Everything in levels.py, prompts.py, and scorer.py follows their recipe. Thanks to Haoning Wu and the entire Q-Align / ONE-ALIGN team.

  • ms-swift (the ModelScope SWIFT team) — the training/inference backbone we build on. SWIFT's generic multimodal templates and swift sft pipeline are what let qalign stay model-agnostic: a new backbone is a config edit, not a code change. Thanks to the ms-swift team.

  • Qwen3.5 (the Qwen team, Alibaba) — the reference vision-language backbone (model_type: qwen3_5) behind our results. Thanks to the Qwen team for releasing such a capable open VL model.

If you use this toolkit, please also cite the original works:

@inproceedings{wu2024qalign,
  title     = {Q-Align: Teaching {LMM}s for Visual Scoring via Discrete Text-Defined Levels},
  author    = {Wu, Haoning and Zhang, Zicheng and Zhang, Weixia and Chen, Chaofeng and
               Liao, Liang and Li, Chunyi and Gao, Yixuan and Wang, Annan and Zhang, Erli and
               Sun, Wenxiu and Yan, Qiong and Min, Xiongkuo and Zhai, Guangtao and Lin, Weisi},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning (ICML)},
  year      = {2024}
}

@inproceedings{swift2025,
  title     = {{SWIFT}: A Scalable Lightweight Infrastructure for Fine-Tuning},
  author    = {ModelScope Team},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
  year      = {2025},
  note      = {\url{https://github.com/modelscope/ms-swift}}
}

@misc{qwen3_5,
  title        = {Qwen3.5: Towards Native Multimodal Agents},
  author       = {Qwen Team},
  year         = {2025},
  howpublished = {\url{https://github.com/QwenLM/Qwen3-VL}}
}