Second-Moment Trust Policy Optimization (M2PO)
March 8, 2026 · View on GitHub
Last updated: Oct 23, 2025
Author: Jingyuan Ma

Second-Moment Trust Policy Optimization (M2PO) (Zheng et al., 2025), is an RL method that achieves stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance by constraining the second moment of importance weights to suppress only extreme outliers while preserving informative updates.
The first step of M2PO is to compute the second momentum:
The second step is to compute the second momentum mask:
The final step is to optimize the objective:
Where is computed in the second step and
For more details:
-
AReal Detail: Paper of AReal
-
M2PO Detail: Paper of M2PO
Core Parameters
actor.m2_threshold: The threshold for the mean of the second momentum, used in computing the M2PO mask as
Example Usage
We recommend to change the parameter within the configuration file (i.e.gsm8k_m2po.yaml).
| Backend | CMD |
|---|---|
| local | python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=local --<other_args_to_overwrite> |
| ray | python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=ray --<other_args_to_overwrite> |
| slurm | python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=slurm --<other_args_to_overwrite> |
Test Result

In this test, we name the trails by the rules as follow:
- stale: the value of
max_head_offpolicyness - dx+dy: x for the number of rollout workers and y for the number of training workers
- rollout: the value of
max_concurrent_rollout
The setting for GRPO is stale 256 d2+d1 rollout 96
The key findings in the trails are as follow:
- The
grad_normof GRPO is higher than M2PO, which may cause training instability. - The evaluate reward of M2PO is higher than GRPO.