This lecture we consider methods that learn a parameterized policy that can select actions without consulting a value function.
We denote $\boldsymbol{\theta}$ the policy parameter, and write the policy
$$ \pi(a \mid s, \boldsymbol{\theta})=\operatorname{Pr}\left\{A_t=a \mid S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta}\right\} $$
If the performance measure is $J(\boldsymbol{\theta})$, then the method seeks to approximate the gradient of $J(\boldsymbol{\theta})$ and do the iteration:
$$ \boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_t+\alpha \widehat{\nabla J\left(\boldsymbol{\theta}_t\right)} $$
where $\widehat{\nabla J\left(\boldsymbol{\theta}_t\right)}$ is a stochastic estimate.