Lecture 4 主要介绍无模型的 control,包含 MC control 和 TD control。
- On-policy learning
- Direct experience
- Learn to estimate and evaluate a policy from experience obtained from following that policy
- Off-policy learning
- Learn to estimate and evaluate a policy using experience gathered from following a different policy
Monte Carlo Control
Monte Carlo with Exploring Starts

A Blackjack game is presented to elucidate MCES.
On-policy MC Control
Maintain an

Only achieve the best policy among the
Greedy in the Limit of Infinite Exploration (GLIE)
- All state-action pairs are visited an infinite number of times
- Behavior policy (policy used to act in the world) converges to greedy policy
with probability 1
A simple GLIE strategy is
GLIE Monte-Carlo control converges to the optimal state-action value function
Off-policy MC Control
Require that the behavior policy be soft, to ensure each pair of state and action be visited.
TD Control
On-policy SARSA
Quintuple of events
SARSA for finite-state and finite-action MDPs converges to the optimal action-value,
- The policy sequence
satisfies the condition of GLIE - The step-sizes
satisfy the Robbins-Munro sequence such that
A typical selection is
Off-policy Q-learning
Directly approximates
Converges to optimal
Expected Sarsa
Expected Sarsa eliminates the variance due to the random selection of
It can safely set
If
In a nutshell, expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.
Maximization Bias
Consider single-state MDP
Then
However, the esimate can be biased
The greedy policy w.r.t. estimated
Double Q-learning
