Short RL

April 16, 2026 ยท View on GitHub

Short RL

Short RL: Shorten After Youre Right: Lazy Length Penalties for Reasoning RL Length-Aware Optimization

Code Arxiv

News

[2025/05/23] We release our paper Efficient RL Training for Reasoning Models via Length-Aware Optimization and update the main branch

[2025/03/19] We release our github project (branch old)

Overview

Code of paper "Efficient RL Training for Reasoning Models via Length-Aware Optimization"

We introduce Short-RL, a simple yet effective technique to control response length during the RL training process of R1-like models, while maintaining stable performance.

Getting Started ๐Ÿš€

Installation & Training Scripts

Logic-RL Setup

To begin working with Short-RL for the Logic-RL dataset, just run:

cd Logic-RL
conda create -n logic python=3.9
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e .  # For verl integration
pip install wandb IPython matplotlib

Math-RL Setup

To begin working with Short-RL for 3 math settings , just run:

cd deepscaler
bash setup.sh

Start Logic-RL Training

We directly use the data from logic-RL at Logic-RL/data/kk/instruct

Train Short-RL

cd Logic-RL
bash sh/Short-RL.sh # Normal-RL.sh for baseline comparision

The performance of Logic-RL is sensitive to the learning rate. In our experiments, a learning rate of 1e-6 with a batch size of 8 yields the best convergence within 3 epochs, which is the setting used in the paper. However, this configuration can be unstable, sometimes leading to sudden drops in test accuracy during training, regardless of whether standard RL or Short-RL is used. For reliable reproduction of the paper results, multiple runs may be necessary.

For more stable training, a learning rate of 4e-7 is a robust alternative, though it requires more epochs to converge.

Eval

cd eval_kk
bash eval.sh

cd Math_eval
bash test_aime.sh
bash test_amc.sh

Start Math-RL Training

Data preparation:

You can directly use the data at deepscaler/data/orzmath, deepscaler/data/ThinkDeepScaler, deepscaler/data/ThinksimpleRL

Or if you want to prepare it yourself, taking Open Reasoner Zero as an example, first you need to download curated 57k training data from Orz to ./deepscaler/data. Then run

bash ./scripts/data/data.sh

Train Short-RL

cd deepscaler
#Open Reasoner Zero
bash scripts/train/Short-RL.sh # Normal-RL.sh for baseline comparision
#DeepScaleR
bash scripts/deepscaler/Short-RL.sh # Normal-RL.sh for baseline comparision
#SimpleRL-Math
bash scripts/simplerl/Short-RL.sh # Normal-RL.sh for baseline comparision

Evaluation

The evaluation curves can be seen in wandb during training.

Or if you want to evaluate it after training. You can run:

bash ./scripts/eval/eval_model.sh

Acknowledgements

Our training framework is built on Logic-RL, deepscaler, verl and ray.

Citation

@misc{yuan2025efficientrltrainingreasoning,
      title={Efficient RL Training for Reasoning Models via Length-Aware Optimization}, 
      author={Danlong Yuan and Tian Xie and Shaohan Huang and Zhuocheng Gong and Huishuai Zhang and Chong Luo and Furu Wei and Dongyan Zhao},
      year={2025},
      eprint={2505.12284},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.12284}, 
}