Lecture 4: Model Free Control

Lecture 4 主要介绍无模型的 control，包含 MC control 和 TD control。

On-policy learning
- Direct experience
- Learn to estimate and evaluate a policy from experience obtained from following that policy
Off-policy learning
- Learn to estimate and evaluate a policy using experience gathered from following a different policy

Monte Carlo Control

Monte Carlo with Exploring Starts

A Blackjack game is presented to elucidate MCES.

On-policy MC Control

Maintain an $ϵ$ -greedy policy

Only achieve the best policy among the $ϵ$ -greedy policies.

Greedy in the Limit of Infinite Exploration (GLIE)

All state-action pairs are visited an infinite number of times $lim_{i \to \infty} N_{i} (s, a) \to \infty$
Behavior policy (policy used to act in the world) converges to greedy policy $lim_{i \to \infty} π (a ∣ s) \to \arg max_{a} Q (s, a)$ with probability 1

A simple GLIE strategy is $ϵ$ -greedy where $ϵ$ is reduced to 0 with the following rate: $ϵ_{i} = 1 / i$

GLIE Monte-Carlo control converges to the optimal state-action value function $Q (s, a) \to Q^{*} (s, a)$ .

Off-policy MC Control

Require that the behavior policy be soft, to ensure each pair of state and action be visited.

TD Control

On-policy SARSA

Quintuple of events $(S_{t}, A_{t}, R_{t}, S_{t + 1}, A_{t + 1}) \to SARSA$

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})]$

SARSA for finite-state and finite-action MDPs converges to the optimal action-value, $Q (s, a) \to Q^{*} (s, a)$ , under the following conditions (Robbins-Munro sequence)

The policy sequence $π_{t} (a ∣ s)$ satisfies the condition of GLIE
The step-sizes $α_{t}$ satisfy the Robbins-Munro sequence such that $\begin{aligned} \sum_{t = 1}^{\infty} α_{t} = \infty \\ \sum_{t = 1}^{\infty} α_{t}^{2} < \infty \end{aligned}$

A typical selection is $α_{t} = o (1 / t)$ .

Off-policy Q-learning

$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]$

Directly approximates $q^{*}$ , the optimal state-action value function.

Converges to optimal $q^{*}$ if visit all $(s, a)$ pairs infinitely often and $α_{t}$ satisfy Robbins-Munro sequence.

Expected Sarsa

$\begin{aligned} Q (S_{t}, A_{t}) & \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ E_{π} [Q (S_{t + 1}, A_{t + 1}) ∣ S_{t + 1}] - Q (S_{t}, A_{t})] \\ \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ \sum_{a} π (a ∣ S_{t + 1}) Q (S_{t + 1}, a) - Q (S_{t}, A_{t})] \end{aligned}$

Expected Sarsa eliminates the variance due to the random selection of $A_{t + 1}$ , thus outperforms Sarsa by and large.

It can safely set $α = 1$ without suffering any degradation of asymptotic performance.

If $π$ is a greedy policy, expected sarsa is exactly Q-learning.

In a nutshell, expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

Maximization Bias

Consider single-state MDP $(| S | = 1)$ with 2 actions, and both actions have 0-mean random rewards, ie. $E (r ∣ a = a_{1}) = E (r ∣ a = a_{2}) = 0$ .

Then $Q (s, a_{1}) = Q (s, a_{2}) = 0 = V (s)$ for any policy.

However, the esimate can be biased

${\hat{V}}^{\hat{π}} (s) = E [max {\hat{Q} (s, a_{1}), \hat{Q} (s, a_{2})}] > max {E [\hat{Q} (s, a_{1})], [\hat{Q} (s, a_{2})]} = max [0, 0] = V^{π}$

The greedy policy w.r.t. estimated $Q$ values can yield a maximization bias during finite-sample learning.