Lecture 4.5: n-step Bootstrapping

$n$-step TD Prediction

The idea of $n$-step TD

image-20220212161120843

Monte Carlo target $$ G_{t} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots+\gamma^{T-t-1} R_{T} $$

1-step TD target $$ G_{t: t+1} \doteq R_{t+1}+\gamma V_{t}\left(S_{t+1}\right) $$ 2-step TD target $$ G_{t: t+2} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} V_{t+1}\left(S_{t+2}\right) $$ n-step TD target $$ G_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^{n} V_{t+n-1}\left(S_{t+n}\right) $$ State-value learning algorithm for using n-step returns is $$ V_{t+n}\left(S_{t}\right) \doteq V_{t+n-1}\left(S_{t}\right)+\alpha\left[G_{t: t+n}-V_{t+n-1}\left(S_{t}\right)\right], \quad 0 \leq t<T $$

image-20220212162104250

The n-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.

$n$-step Sarsa

$$ G_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^{n} Q_{t+n-1}\left(S_{t+n}, A_{t+n}\right), n \geq 1,0 \leq t<T-n $$

$$ Q_{t+n}\left(S_{t}, A_{t}\right) \doteq Q_{t+n-1}\left(S_{t}, A_{t}\right)+\alpha\left[G_{t: t+n}-Q_{t+n-1}\left(S_{t}, A_{t}\right)\right], \quad 0 \leq t<T $$

$n$-step Off-policy Learning

$n$-step Tree Backup Algorithm

updatedupdated2023-01-262023-01-26