Lecture 8: Policy Gradient

This lecture we consider methods that learn a parameterized policy that can select actions without consulting a value function.

We denote $\boldsymbol{\theta}$ the policy parameter, and write the policy

$$ \pi(a \mid s, \boldsymbol{\theta})=\operatorname{Pr}\left\{A_t=a \mid S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta}\right\} $$

If the performance measure is $J(\boldsymbol{\theta})$, then the method seeks to approximate the gradient of $J(\boldsymbol{\theta})$ and do the iteration:

$$ \boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_t+\alpha \widehat{\nabla J\left(\boldsymbol{\theta}_t\right)} $$

where $\widehat{\nabla J\left(\boldsymbol{\theta}_t\right)}$ is a stochastic estimate.

updatedupdated2023-01-262023-01-26