The
multistep_forward_view
function computes mixed multistep returns in terms of the instantaneous
rewards, discount factors pcontinues, state_value estimates, and mixing
weights lambda_. In the math that follows we will replace these by
r0:T−1, γ0:T−1, V1:T, and λ0:T−1,
respectively. Note that in the implementation, the state_values array is
offset in time by 1 relative to the other arrays.
The mixed returns Mt are computed by the following recurrence, using a
backwards scan:
Mt=rt+γt(λtMt+1+(1−λt)Vt+1)\labeleq:recurrenceMT−1=rT−1+γT−1VT(1)
Here we can see why Mt is a valid estimate of the return at time t. The
recurrence is applying the Bellman Equation, using the λt-weighted
mixture of Mt+1 and Vt+1, both of which are valid estimates of the
expected return at time t+1.
What's left is to show that we are computing the right mixture of multistep
returns. Let R(t,k) be the γt:k-discounted return from time
t to k:
R(t,k)=rt+γtrt+1+⋯+(γtγt+1⋯γk−1)rk+(γtγt+1⋯γk)Vk+1=i=t∑k(j=t∏i−1γj)ri+(i=t∏kγi)Vk+1
We should mention that R(t,k) would correspond to the k−t step return
Gtk−t in Sutton and Barto's notation. Note that R satisfies a
Bellman-style recurrence:
R(t−1,k)=rt−1+γt−1R(t,k)R(t,t)=rt+γtVt+1
The desired1 λ-weighted mixture is given by:
L(t,k)=(1−λt)R(t,t)+(1−λt+1)λtR(t,t+1)+⋯+((1−λk)λk−1⋯λt)R(t,k)+λk⋯λtR(t,k)=i=t∑k((1−λi)j=t∏i−1λj)R(t,i)+(i=t∏kλi)R(t,k)
We have that
\right) R(t-1, i) + \left(\prod_{i=t-1}^k \lambda_i \right) R(t-1, k) \\
= (1-\lambda_{t-1})R(t-1, t-1) + \lambda_{t-1}\sum_{i=t}^k \left((1-\lambda_i) \prod_{j=t}^{i-1} \lambda_j \right) (r_{t-1} + \gamma_{t-1} R(t, i)) + \lambda_{t-1}\left(\prod_{i=t}^k \lambda_i \right) \left(r_{t-1} + \gamma_{t-1} R(t, k)\right) \\
= r_{t-1} + (1-\lambda_{t-1})\gamma_{t-1} V_t + \lambda_{t-1}\gamma_{t-1} L(t, k)
Thus, L(t,k) also satisfies recurrence \eqrefeq:recurrence, as
desired.