Lecture 3: Model Free Policy Evaluation

Lecture3 主要介绍当我们不知道模型的各个参数的时候,如何评价一个 policy.

Recall

  • Definition of Return
  • Definition of State Value Function
  • Definition of State-Action Value Function

Dynamic programming for policy evaluation

$$ V^{\pi}(s) \leftarrow \mathbb{E}_{\pi}\left[r_{t}+\gamma V_{k-1} \mid s_{t}=s\right] $$

Policy Evaluation without a Model

Monte Carlo Policy Evaluation

  • If trajectories are all finite, sample set of trajectories & average returns
  • Does not require MDP dynamics/rewards
  • No bootstrapping
  • Does not assume state is Markov (handles non-Markovian domains)
  • Can only be applied to episodic MDPs
  • Averaging over returns from a complete episode
  • Requires each episode to terminate

Monte Carlo methods can be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.

Monte Carlo is particularly useful when a subset of states is required. One can generate many sample episodes starting from the states of interest, averaging returns from only these states, ignoring all others.

First-Visit

Initialize $N(s)=0, G(s)=0 \;\; \forall s \in S$ Loop

  • Sample episode $i=s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, \ldots, s_{i, T_{i}}$
  • Define $G_{i, t}=r_{i, t}+\gamma r_{i, t+1}+\gamma^{2} r_{i, t+2}+\cdots \gamma^{T_{i}-1} r_{i, T_{i}}$ as return from time step $t$ onwards in $i$ th episode
  • For each time step $t$ till the end of the episode $i$
    • If this is the first time $t$ that state $s$ is visited in episode $i$
      • Increment counter of total first visits: $N(s)=N(s)+1$
      • Increment total return $G(s)=G(s)+G_{i, t}$
      • Update estimate $V^{\pi}(s)=G(s) / N(s)$

Properties

  • Unbiased

  • Consistent

By SLLN, the sequence of averages of the estimates converges to the expected value.

Every-Visit

Initialize $N(s)=0, G(s)=0 \; \forall s \in S$ Loop

  • Sample episode $i=s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, \ldots, s_{i, T_{i}}$
  • Define $G_{i, t}=r_{i, t}+\gamma r_{i, t+1}+\gamma^{2} r_{i, t+2}+\cdots \gamma^{T_{i}-1} r_{i, T_{i}}$ as return from time step $t$ onwards in $i$ th episode
  • For each time step $t$ till the end of the episode $i$
    • state $s$ is the state visited at time step $t$ in episodes $i$
    • Increment counter of total visits: $N(s)=N(s)+1$
    • Increment total return $G(s)=G(s)+G_{i, t}$
    • Update estimate $V^{\pi}(s)=G(s) / N(s)$

Properties

  • Biased

  • Consistent, and better MSE

Incremental Monte Carlo

A more computationally efficient way is: $$ V^{\pi}(s)=V^{\pi}(s) \frac{N(s)-1}{N(s)}+\frac{G_{i, t}}{N(s)}=V^{\pi}(s)+\frac{1}{N(s)}\left(G_{i, t}-V^{\pi}(s)\right) $$

$$ V^{\pi}(s)=V^{\pi}(s)+\alpha\left(G_{i, t}-V^{\pi}(s)\right) $$

Incremental MC with $\alpha>\displaystyle\frac{1}{N\left(s\right)}$ could help in non-stationary domains.

Monte Carlo Policy Evaluation Key Limitations

  • Generally high variance estimator
    • Reducing variance can require a lot of data
  • Requires episodic settings
    • Episode must end before data from that episode can be used to update the value function

Problem of maintaining exploration

  • Many state–action pairs may never be visited

Monte Carlo with Exploring Starts

Specify that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start.

MC off-policy evaluation

Aim: estimate target policy $\pi$ given episodes generated under behavior policy $b$

Requirement $$ \pi(a \mid s)>0 \Longrightarrow b(a\mid s) > 0 \tag{coverage} $$ Importance-sampling ratio $$ \rho_{t: T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}{\prod_{k=t}^{T-1} b\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}=\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)} $$ Given episodes from $b$ $$ \mathbb{E}\left[\rho_{t: T-1} G_{t} \mid S_{t}=s\right]=v_{\pi}(s) $$ Unbiased and consistent.

  • Ordinary importance sampling — uausally unbiased; may not converge

    $$ V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t: T(t)-1} G_{t}}{|\mathcal{T}(s)|} $$

  • Weighted importance sampling — biased but lower variance $$ V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t: T(t)-1} G_{t}}{\sum_{t \in \mathcal{T}(s)} \rho_{t: T(t)-1}} $$

The estimates of ordinary importance sampling will typically have infinite variance, and thus unsatisfactory convergence properties, whenever the scaled returns have infinite variance.

Temporal Difference Learning

“If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” – Sutton and Barto 2017

Incremental MC

$$ V^{\pi}(s)=V^{\pi}(s)+\alpha\left(G_{i, t}-V^{\pi}(s)\right) $$ Replace $G_{i,t}$ by bootstraping $r_t + \gamma V^\pi(s_{t+1})$ . $$ V^{\pi}\left(s_{t}\right)=V^{\pi}\left(s_{t}\right)+\alpha(\underbrace{\left[r_{t}+\gamma V^{\pi}\left(s_{t+1}\right)\right]}_{\text {TD target }}-V^{\pi}\left(s_{t}\right)) $$

  • TD error $$ \delta_{t}=r_{t}+\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right) $$

  • Can immediately update value estimate after $\left(s, a, r, s^{\prime}\right)$ tuple

  • Don't need episodic setting

  • Biased, but generally less high variance than MC

TD methods are often more efficient than Monte Carlo methods.

Conplex convergence property

  • TD(0) converges in the mean for a small constant $\alpha$
  • TD(0) converges a.s. if $\alpha$ decreases accordingly
  • TD(0) does not always converge with function approximation

TD(0) converges to DP policy $V^\pi$ for the MDP with the maximum likelihood model estimates if there is available only a finite amount of experience.

Maximum likelihood Markov decision process model $$ \begin{gathered} \hat{P}\left(s^{\prime} \mid s, a\right)=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{L_{k}-1} \mathbb{1}\left(s_{k, t}=s, a_{k, t}=a, s_{k, t+1}=s^{\prime}\right) \\ \hat{r}(s, a)=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{t=1}^{L_{k}-1} \mathbb{1}\left(s_{k, t}=s, a_{k, t}=a\right) r_{t, k} \end{gathered} $$

TD exploits Markov structure. As in the AB example

A, 0, B, 0

B, 1

B, 1

B, 1

B, 1

B, 1

B, 1

B, 0

updatedupdated2023-01-262023-01-26