最简单的梯度策略推导

May 1, 2019 · View on GitHub

对于随机参数化的策略, 我们的目标是最大化期望回报:J(πθ)=Eτπθ[R(τ)]J\left(\pi_{\theta}\right)=\underset{\tau \sim \pi_{\theta}}{\mathrm{E}}[R(\tau)]。为了推导我们这里的R(τ)R(\tau)是有限无加权的回报,有限有加权的推导是相同的。

我们可以通过梯度上升优化策略,如

θk+1=θk+αθJ(πθ)θk\theta_{k+1}=\theta_{k}+\alpha \nabla_{\theta} J\left.\left(\pi_{\theta}\right)\right|_{\theta_{k}}

θJ(πθ)\nabla_{\theta} J\left(\pi_{\theta}\right)叫做梯度策略,这样优化策略的方法我们叫做梯度策略算法,如Vanilla梯度策略,TRPO, PPO。

先列几个我们推导会用到的公式

1, 策略轨迹的概率。假设策略来自πθ\pi_\theta,策略轨迹τ=(s0,a0,,sT+1)\tau=\left(s_{0}, a_{0}, \dots, s_{T+1}\right)的概率表示为下

P(τθ)=ρ0(s0)t=0TP(st+1st,at)πθ(atst)P(\tau | \theta)=\rho_{0}\left(s_{0}\right) \prod_{t=0}^{T} P\left(s_{t+1} | s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} | s_{t}\right)

2,Log求导的一些技巧

θP(τθ)=P(τθ)θlogP(τθ)\nabla_{\theta} P(\tau | \theta)=P(\tau | \theta) \nabla_{\theta} \log P(\tau | \theta) 用到了logx的导数是1/x和链式法则

3,策略轨迹的log概率,

logP(τθ)=logρ0(s0)+t=0T(logP(st+1st,at)+logπθ(atst))\log P(\tau | \theta)=\log \rho_{0}\left(s_{0}\right)+\sum_{t=0}^{T}\left(\log P\left(s_{t+1} | s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} | s_{t}\right)\right)

4,环境函数的梯度。环境和θ\theta无关,所以ρ0(s0),P(st+1st,at)\rho_{0}\left(s_{0}\right), P\left(s_{t+1} | s_{t}, a_{t}\right)R(τ)R(\tau)是0.

5,θlogP(τθ)=θlogρ0(s0)+t=0T(θlogP(st+1st,at)+θlogπθ(atst))=t=0Tθlogπθ(atst)\begin{aligned} \nabla_{\theta} \log P(\tau | \theta) &=\nabla_{\theta} \log \rho_{0}\left(s_{0}\right)+\sum_{t=0}^{T}\left(\nabla_{\theta} \log P\left(s_{t+1} | s_{t}, a_{t}\right)+\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right) \\ &=\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) \end{aligned}

上边几步合到一块的推导过程

θJ(πθ)=θEτπθ[R(τ)]=θτP(τθ)R(τ)=τθP(τθ)R(τ)=τP(τθ)θlogP(τθ)R(τ)=Eτπθ[θlogP(τθ)R(τ)]\begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\nabla_{\theta} \underset{_{\tau} \sim \pi_{\theta}}{E}[R(\tau)] \\ &=\nabla_{\theta} \int_{\tau} P(\tau | \theta) R(\tau) \\ &=\int_{\tau} \nabla_{\theta} P(\tau | \theta) R(\tau) \\ &=\int_{\tau} P(\tau | \theta) \nabla_{\theta} \log P(\tau | \theta) R(\tau) \\ &=\underset{\tau \sim \pi_{\theta}}{E}\left[\nabla_{\theta} \log P(\tau | \theta) R(\tau)\right] \end{aligned}

θJ(πθ)=Eτπθ[t=0Tθlogπθ(atst)R(τ)]\therefore \nabla_{\theta} J\left(\pi_{\theta}\right)=\underset{\tau \sim \pi_{\theta}}{E}\left[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) R(\tau)\right]

g^=1DτDt=0Tθlogπθ(atst)R(τ)\hat{g}=\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) R(\tau)