Lecture 4 主要介绍无模型的 control,包含 MC control 和 TD control。
- On-policy learning
- Direct experience
- Learn to estimate and evaluate a policy from experience obtained from following that policy
- Off-policy learning
- Learn to estimate and evaluate a policy using experience gathered from following a different policy
Monte Carlo Control
Monte Carlo with Exploring Starts

A Blackjack game is presented to elucidate MCES.
On-policy MC Control
Maintain an $\epsilon$-greedy policy

Only achieve the best policy among the $\epsilon$-greedy policies.
Greedy in the Limit of Infinite Exploration (GLIE)
- All state-action pairs are visited an infinite number of times $$ \lim _{i \rightarrow \infty} N_{i}(s, a) \rightarrow \infty $$
- Behavior policy (policy used to act in the world) converges to greedy policy $\lim _{i \rightarrow \infty} \pi(a \mid s) \rightarrow \arg \max _{a} Q(s, a)$ with probability 1
A simple GLIE strategy is $\epsilon$-greedy where $\epsilon$ is reduced to 0 with the following rate: $\epsilon_{i}=1 / i$
GLIE Monte-Carlo control converges to the optimal state-action value function $Q(s, a) \to Q^\ast(s, a)$.
Off-policy MC Control
Require that the behavior policy be soft, to ensure each pair of state and action be visited.
TD Control
On-policy SARSA
Quintuple of events $(S_t, A_t, R_t, S_{t+1}, A_{t+1}) \to \text{SARSA}$
$$ Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)\right] $$
SARSA for finite-state and finite-action MDPs converges to the optimal action-value, $Q(s, a) \rightarrow Q^{\ast}(s, a)$, under the following conditions (Robbins-Munro sequence)
- The policy sequence $\pi_{t}(a \mid s)$ satisfies the condition of GLIE
- The step-sizes $\alpha_{t}$ satisfy the Robbins-Munro sequence such that $$ \begin{aligned} &\sum_{t=1}^{\infty} \alpha_{t}=\infty \\ &\sum_{t=1}^{\infty} \alpha_{t}^{2}<\infty \end{aligned} $$
A typical selection is $\alpha_t = o(1/t)$ .
Off-policy Q-learning
$$ Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right] $$
Directly approximates $q^\ast$, the optimal state-action value function.
Converges to optimal $q^\ast$ if visit all $(s, a)$ pairs infinitely often and $\alpha_t$ satisfy Robbins-Munro sequence.
Expected Sarsa
$$ \begin{aligned} Q\left(S_{t}, A_{t}\right) & \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \mathbb{E}_{\pi}\left[Q\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right]-Q\left(S_{t}, A_{t}\right)\right] \\ & \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \sum_{a} \pi\left(a \mid S_{t+1}\right) Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right] \end{aligned} $$
Expected Sarsa eliminates the variance due to the random selection of $A_{t+1}$, thus outperforms Sarsa by and large.
It can safely set $\alpha=1$ without suffering any degradation of asymptotic performance.
If $\pi$ is a greedy policy, expected sarsa is exactly Q-learning.
In a nutshell, expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.
Maximization Bias
Consider single-state MDP $(|S|=1)$ with 2 actions, and both actions have 0-mean random rewards, ie. $\mathbb{E}\left(r \mid a=a_{1}\right)=\mathbb{E}\left(r \mid a=a_{2}\right)=0$.
Then $Q\left(s, a_{1}\right)=Q\left(s, a_{2}\right)=0=V(s)$ for any policy.
However, the esimate can be biased
$$ \hat{V}^{\hat{\pi}}(s)=\mathbb{E}\left[\max \{ \hat{Q}\left(s, a_{1}\right), \hat{Q}\left(s, a_{2}\right)\} \right] > \max \left\{ \mathbb{E}\left[\hat{Q}\left(s, a_{1}\right)\right],\left[\hat{Q}\left(s, a_{2}\right)\right]\right\} =\max [0,0]=V^{\pi} $$
The greedy policy w.r.t. estimated $Q$ values can yield a maximization bias during finite-sample learning.
Double Q-learning
