Second-Moment Trust Policy Optimization (M2PO)

March 8, 2026 · View on GitHub

Last updated: Oct 23, 2025

Author: Jingyuan Ma

m2po figure

Second-Moment Trust Policy Optimization (M2PO) (Zheng et al., 2025), is an RL method that achieves stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance by constraining the second moment of importance weights to suppress only extreme outliers while preserving informative updates.

The first step of M2PO is to compute the second momentum: M2^=1Ni=1NM2,i=1Ni=1N(logri)2=1Ni=1N(logπθ(aisi)πbehav(aisi))2\hat{M_2}=\frac{1}{N}\sum_{i=1}^NM_{2,i}=\frac{1}{N}\sum_{i=1}^N(\log{r_i})^2=\frac{1}{N}\sum_{i=1}^N\left(\log\frac{\pi_\theta (a_i|s_i)}{\pi_{behav}(a_i|s_i)}\right)^2

The second step is to compute the second momentum mask:

m2po masking

The final step is to optimize the objective:

1i=1Goii=1Gt=1oiMi,tπθ(oiq)πθold(oiq)Ai,t.\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}M_{i,t}\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}A_{i,t}.

Where MM is computed in the second step and

Ai,t=rimean(Rii=1G)std(Rii=1G).A_{i,t}=\frac{r_i-mean({R_i}_{i=1}^G)}{std({R_i}_{i=1}^G)}.

For more details:

Core Parameters

  • actor.m2_threshold: The threshold for the mean of the second momentum, used in computing the M2PO mask as τM2\tau_{M_2}

Example Usage

We recommend to change the parameter within the configuration file (i.e.gsm8k_m2po.yaml).

BackendCMD
localpython3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=local --<other_args_to_overwrite>
raypython3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=ray --<other_args_to_overwrite>
slurmpython3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_m2po.yaml scheduler.type=slurm --<other_args_to_overwrite>

Test Result

m2po test figure

In this test, we name the trails by the rules as follow:

  • stale: the value of max_head_offpolicyness
  • dx+dy: x for the number of rollout workers and y for the number of training workers
  • rollout: the value of max_concurrent_rollout

The setting for GRPO is stale 256 d2+d1 rollout 96

The key findings in the trails are as follow:

  • The grad_norm of GRPO is higher than M2PO, which may cause training instability.
  • The evaluate reward of M2PO is higher than GRPO.