Lecture 4: Model Free Control

Lecture 4 主要介绍无模型的 control,包含 MC control 和 TD control。

  • On-policy learning
    • Direct experience
    • Learn to estimate and evaluate a policy from experience obtained from following that policy
  • Off-policy learning
    • Learn to estimate and evaluate a policy using experience gathered from following a different policy

Monte Carlo Control

Monte Carlo with Exploring Starts

A Blackjack game is presented to elucidate MCES.

On-policy MC Control

Maintain an ϵ-greedy policy

Only achieve the best policy among the ϵ-greedy policies.

Greedy in the Limit of Infinite Exploration (GLIE)

  • All state-action pairs are visited an infinite number of times limiNi(s,a)
  • Behavior policy (policy used to act in the world) converges to greedy policy limiπ(as)argmaxaQ(s,a) with probability 1

A simple GLIE strategy is ϵ-greedy where ϵ is reduced to 0 with the following rate: ϵi=1/i

GLIE Monte-Carlo control converges to the optimal state-action value function Q(s,a)Q(s,a).

Off-policy MC Control

Require that the behavior policy be soft, to ensure each pair of state and action be visited.

TD Control

On-policy SARSA

Quintuple of events (St,At,Rt,St+1,At+1)SARSA

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]

SARSA for finite-state and finite-action MDPs converges to the optimal action-value, Q(s,a)Q(s,a), under the following conditions (Robbins-Munro sequence)

  1. The policy sequence πt(as) satisfies the condition of GLIE
  2. The step-sizes αt satisfy the Robbins-Munro sequence such that t=1αt=t=1αt2<

A typical selection is αt=o(1/t) .

Off-policy Q-learning

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]

Directly approximates q, the optimal state-action value function.

Converges to optimal q if visit all (s,a) pairs infinitely often and αt satisfy Robbins-Munro sequence.

Expected Sarsa

Q(St,At)Q(St,At)+α[Rt+1+γEπ[Q(St+1,At+1)St+1]Q(St,At)]Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]

Expected Sarsa eliminates the variance due to the random selection of At+1, thus outperforms Sarsa by and large.

It can safely set α=1 without suffering any degradation of asymptotic performance.

If π is a greedy policy, expected sarsa is exactly Q-learning.

In a nutshell, expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

Maximization Bias

Consider single-state MDP (|S|=1) with 2 actions, and both actions have 0-mean random rewards, ie. E(ra=a1)=E(ra=a2)=0.

Then Q(s,a1)=Q(s,a2)=0=V(s) for any policy.

However, the esimate can be biased

V^π^(s)=E[max{Q^(s,a1),Q^(s,a2)}]>max{E[Q^(s,a1)],[Q^(s,a2)]}=max[0,0]=Vπ

The greedy policy w.r.t. estimated Q values can yield a maximization bias during finite-sample learning.

Double Q-learning

updatedupdated2023-01-262023-01-26