multistep_forward_view.md

October 17, 2018 · View on GitHub

Multistep forward view

The multistep_forward_view function computes mixed multistep returns in terms of the instantaneous rewards, discount factors pcontinues, state_value estimates, and mixing weights lambda_. In the math that follows we will replace these by r0:T1r_{0:T-1}, γ0:T1\gamma_{0:T-1}, V1:TV_{1:T}, and λ0:T1\lambda_{0:T-1}, respectively. Note that in the implementation, the state_values array is offset in time by 1 relative to the other arrays.

The mixed returns MtM_t are computed by the following recurrence, using a backwards scan:

Mt=rt+γt(λtMt+1+(1λt)Vt+1)\labeleq:recurrenceMT1=rT1+γT1VT(1)M_t = r_t + \gamma_t (\lambda_t M_{t+1} + (1-\lambda_t) V_{t+1}) \label{eq:recurrence} \tag{1}\\ M_{T-1} = r_{T-1} + \gamma_{T-1} V_T

Here we can see why MtM_t is a valid estimate of the return at time tt. The recurrence is applying the Bellman Equation, using the λt\lambda_t-weighted mixture of Mt+1M_{t+1} and Vt+1V_{t+1}, both of which are valid estimates of the expected return at time t+1t+1.

What's left is to show that we are computing the right mixture of multistep returns. Let R(t,k)R(t, k) be the γt:k\gamma_{t:k}-discounted return from time tt to kk:

R(t,k)=rt+γtrt+1++(γtγt+1γk1)rk+(γtγt+1γk)Vk+1=i=tk(j=ti1γj)ri+(i=tkγi)Vk+1R(t, k) = r_t + \gamma_t r_{t+1} + \cdots + (\gamma_t \gamma_{t+1} \cdots \gamma_{k-1}) r_k + (\gamma_t \gamma_{t+1} \cdots \gamma_k) V_{k+1}\\ = \sum_{i=t}^k \left(\prod_{j=t}^{i-1} \gamma_j \right) r_i + \left(\prod_{i=t}^k \gamma_i \right) V_{k+1}

We should mention that R(t,k)R(t, k) would correspond to the ktk-t step return GtktG_t^{k-t} in Sutton and Barto's notation. Note that RR satisfies a Bellman-style recurrence:

R(t1,k)=rt1+γt1R(t,k)R(t,t)=rt+γtVt+1R(t-1, k) = r_{t-1} + \gamma_{t-1} R(t, k)\\ R(t, t) = r_t + \gamma_t V_{t+1}

The desired1 λ\lambda-weighted mixture is given by:

L(t,k)=(1λt)R(t,t)+(1λt+1)λtR(t,t+1)++((1λk)λk1λt)R(t,k)+λkλtR(t,k)=i=tk((1λi)j=ti1λj)R(t,i)+(i=tkλi)R(t,k)L(t, k) = (1-\lambda_t) R(t, t) + (1-\lambda_{t+1}) \lambda_t R(t, t+1) + \cdots + ((1-\lambda_k) \lambda_{k-1} \cdots \lambda_t) R(t, k) + \lambda_k\cdots\lambda_t R(t, k)\\ = \sum_{i=t}^k \left((1-\lambda_i) \prod_{j=t}^{i-1} \lambda_j \right) R(t, i) + \left(\prod_{i=t}^k \lambda_i \right) R(t, k) \\

We have that

\right) R(t-1, i) + \left(\prod_{i=t-1}^k \lambda_i \right) R(t-1, k) \\ = (1-\lambda_{t-1})R(t-1, t-1) + \lambda_{t-1}\sum_{i=t}^k \left((1-\lambda_i) \prod_{j=t}^{i-1} \lambda_j \right) (r_{t-1} + \gamma_{t-1} R(t, i)) + \lambda_{t-1}\left(\prod_{i=t}^k \lambda_i \right) \left(r_{t-1} + \gamma_{t-1} R(t, k)\right) \\ = r_{t-1} + (1-\lambda_{t-1})\gamma_{t-1} V_t + \lambda_{t-1}\gamma_{t-1} L(t, k)

Thus, L(t,k)L(t, k) also satisfies recurrence \eqrefeq:recurrence\eqref{eq:recurrence}, as desired.

Footnotes

  1. Sutton and Barto, equation 7.3