From dalao: shusen Wang

Probability Theory

Random Variable

unknown; its values depend on outcomes of random events

Uppercase letter XX for a random variable.

Lowercase letter xx for an observed value.

e.g. x1=1,x2=1,x3=0,x4=1x_1 = 1, x_2 = 1, x_3 = 0, x_4 = 1

Probability Density Function, PDF

PDF provides a relative likelihood that the value of the random variable would equal that sample.

Example: Gaussian distribution is a continuous distribution, its pdf representation

p(x)=12πσ2exp((xμ)22σ2)p(x) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \text{exp} (-\dfrac{(x - \mu)^2}{2\sigma^2})

Probability Mass Function (PMF)

PMF is a function that gives the probability that a discrete random variable is exactly equal to some value.

Example: Discrete random variable: X{1,3,7}X \in \{1, 3, 7\}, its pdf representation

p(1)=0.2p(3)=0.5p(7)=0.3\begin{aligned} p(1) = 0.2 \\ p(3) = 0.5 \\ p(7) = 0.3 \\ \end{aligned}

Properties of PDF/PMF

Random variable XX is in the domain χ\chi

For continuous distributions, $$\int_\chi p(x) dx = 1$$

For discrete distributions, $$\sum_{x\in\chi} p(x) = 1$$


Random variable XX is in the domain χ\chi

For continuous distributions, the expectation of f(X)f(X) is:

E[f(x)]=χp(x)f(x)dxE[f(x)] = \int_\chi p(x) \cdot f(x) dx

For discrete distributions, the expectation of f(X)f(X) is:

E[f(X)]=xχp(x)f(x)E[f(X)] = \sum_{x\in\chi} p(x) \cdot f(x)

Random Sampling

e.g. There are 10 balls in the bin: 2 are red, 5 are green, and 3 are blue. We take one from the box. What is the color of it?

Construct a random sampling: Sample red ball w.p. 0.2, green ball w.p. 0.5, and blue ball w.p. 0.3.

from numpy.random import choice

samples = choice(['R', 'G', 'B'], size = 100, p = [0.2, 0.5, 0.3])



state ss

e.g. the frame in SuperMario

state contain some info like role position, enemy position, …


action aa

e.g. in SuperMario, Mario can do a{left, right, up}a \in \{\text{left, right, up}\}

What the ego agent can do to affect the environment?


policy π\pi

Policy function π:(s,a)[0,1]\pi: (s,a) \to [0,1], where π(as)=P(A=aS=s)\pi(a|s) = \mathbb{P}(A = a|S = s) is defination.

π(as)\pi(a|s) is the probability of taking action A=aA = a given state S=sS = s

e.g. π(lefts)=0.2,π(rights)=0.1,π(ups)=0.7\pi(\text{left} | s) = 0.2, \pi(\text{right} | s) = 0.1, \pi(\text{up} | s) = 0.7

Upon observing state S=sS = s, the agent’s action AA can be random, which bring the advantage that our next action can’t be deterministic, especially in the game like rock-paper-scissors games.


reward RR

You can get reward after you do some action to change the state.

e.g. Collect a coin: R=+1R=+1; Win the game: R=+10000R = + 10000;
Touch a Goomba: R=10000R=-10000(game over); Nothing happens: R=0R=0

state transition

when old state make an action, it will turn to the new state

e.g. “up” action leads to a new state.

State transition can be random. But the randomness is from the environment instead of the action or old state.

p(ss,a)=P(S=sS=s,A=a)p(s'|s,a) = P(S'=s'|S=s,A=a)

e.g. Goomba have w.p. 0.8 back instead of w.p. 0.2 continuing to go when Mario chooses the “up” action

Two Sources of Randomness

Randomness in Actions

Given state ss, the action can be random, e.g.

π(lefts)=0.2,π(rights)=0.1,π(ups)=0.7\begin{aligned} &\pi(\text{left} | s) = 0.2, \\ &\pi(\text{right} | s) = 0.1, \\ &\pi(\text{up} | s) = 0.7 \end{aligned}

Randomness in States

State transition can be random. The environment generates the new state SS' by

Sp(s,a)S' \sim p(\cdot | s, a)

Two Sources of Randomness

The randomness in action is from the policy function

Aπ(s)A \sim \pi(\cdot | s)

The randomness in state is from the state-transition function

Sp(s,a)S' \sim p(\cdot | s, a)

Agent-Environment Interaction

Observe state sts_t, select action atπ(st)a_t \sim \pi(\cdot | s_t), and execute ata_t

The environment gives new state st+1s_{t+1} and reward rtr_t

From the pic above, we can get a (state, action, reward) trajectory:

s1,a1,r1,s2,a2,r2,,sn,an,rns_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n

One episode is from the the beginning to the end (Mario wins or dies).

What is a good policy?

A good policy leads to big cumulative reward: t=1nγt1rt\sum_{t=1}^n \gamma^{t-1} \cdot r_t

Use the rewards to guide the learning of policy, which I will introduce in the following.

Rewards and Returns


Definition: Return (aka cumulative future reward) UtU_t

Ut=Rt+Rt+1+Rt+2+U_t = R_{t} + R_{t + 1} + R_{t + 2} + \cdots

Question: At time tt, are RtR_t and Rt+1R_{t + 1} equally important?

We think future reward is less valuable than present reward. So Rt+1R_{t + 1} should be given less weight than RtR_t.

Definition: Discounted return (aka cumulative discounted future reward).

γ\gamma: discount factor (tuning hyper-parameter).

Ut=Rt+γRt+1+γ2Rt+2+U_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots

Discounted Returns

Definition: Discounted return (at time tt).

Ut=Rt+γRt+1+γ2Rt+2++γntRnU_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots + \gamma^{n-t} R_n

Randomness in Returns

Suppose at the end of the game, we observe utu_t.

We observe all the rewards, rt,rt+1,rt+2,,rnr_t, r_{t+1}, r_{t+2}, \cdots, r_n

We thereby know the discounted return:

ut=rt+γrt+1+γ2rt+2++γntrnu_t = r_{t} + \gamma r_{t + 1} + \gamma^2 r_{t + 2} + \cdots + \gamma^{n-t} r_n

At time tt, the rewards, Rt,,RnR_t, ⋯ , R_n, are random, so the return UtU_t is random.

  • Reward RiR_i depends on SiS_i and AiA_i

  • States can be random: Sip(si1,ai1)S_i \sim p(\cdot | s_{i-1}, a_{i - 1})

  • Actions can be random: Aiπ(si)A_i \sim \pi(\cdot| s_i)

  • If either SiS_i or AiA_i is random, then RiR_i is random.

  • UtU_t depends on Rt,Rt+1,,RnR_t, R_{t+1}, \cdots, R_n

  • So UtU_t also depends on St,At,St+1,At+1,,Sn,AnS_t, A_t, S_{t+1}, A_{t+1}, \cdots, S_n, A_n

Action-Value Function

Definition: Action-value function Qπ(st,at)Q_\pi(s_t,a_t)

Qπ(st,at)=ESt+1,At+1,,Sn,An[UtSt=st,At=at]Q_\pi(s_t,a_t) = \mathbb{E}_{S_{t+1},A_{t+1},\cdots,S_{n}, A_{n}} [U_t | S_t = s_t, A_t = a_t]

where St+1,,SnS_{t+1}, \cdots, S_n and At+1,,AnA_{t+1}, \cdots, A_n as random variables and st,ats_t, a_t are observed values

because St+1p(st,at),Snp(sn1,an1)S_{t+1}\sim p(\cdot|s_t,a_t), \cdots S_{n}\sim p(\cdot|s_{n-1},a_{n-1}), also At+1π(st+1),Anp(sn)A_{t+1}\sim \pi(\cdot|s_{t+1}), \cdots A_{n}\sim p(\cdot|s_n).

So Qπ(st,at)Q_\pi(s_t,a_t) depends on st,at,π,ps_t, a_t, \pi, p and Qπ(st,at)Q_\pi(s_t,a_t) is dependent of St+1,,SnS_{t+1}, \cdots, S_n and At+1,,AnA_{t+1}, \cdots, A_n.

Then we want to erase the effect of π\pi, then we get Optimal action-value function.

Definition: Optimal action-value function

Q(st,at)=maxπQπ(st,at)Q^{*}(s_t,a_t) = \max_{\pi} Q_{\pi} (s_t, a_t)

  • Whatever policy function π\pi is used, the result of taking ata_t at state sts_t cannot be better than Q(st,at)Q^*(s_t,a_t)

State-Value Function

Definition: State-value function Vπ(st)V_\pi (s_t)

Vπ(st)=EA[Qπ(st,A)]=aπ(ast)Qπ(st,a)V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)

where Aπ(st)A \sim \pi(\cdot | s_t)

given actions are discrete.

Vπ(st)=EA[Qπ(st,A)]=π(ast)Qpi(st,a)daV_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \int \pi(a|s_t) \cdot Q_pi(s_t, a) da

given actions are continuous.

Understanding the Value Functions

Action-value function: Qπ(s,a)=E[UtSt=st,At=at]Q_\pi(s,a) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t]

Given policy π\pi, Qπ(s,a)Q_\pi(s,a)evaluates how good it is for an agent to pick action aa while being in state ss.

State-value function: Vπ(s)=EA[Qπ(s,A)]V_\pi (s) = \mathbb{E}_A [Q_\pi (s, A)]

For fixed policy π\pi, Vπ(s)V_\pi (s) evaluates how good the situation is in state ss.

ES[Vπ(S)]\mathbb{E}_S [V_\pi (S)] evaluates how good the policy π\pi is.

Evaluating Reinforcement Learning

OpenAI Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms.

Task classification:

  • Classical control problems
  • Atari Games
  • MuJoCo (Continuous control tasks.)

Play CartPole Game

import gym
env = gym.make('CartPole-v0')

Get the environment of CartPole from Gym.

“env” provides states and reward.

Complete Code:

import gym
env = gym.make('CartPole-v0')
state = env.reset()

for t in range(100):
env.render() ## a window pops up rendering CartPole

action = env.action_space.sample() ## A random action
state, reward, done, info = env.step(action)

if done: ## done=1 means finished (win or lose the game)

Value-Based Reinforcement Learning

Approximate the Q Function

Goal: Win the game(~ maximize the total reward).

Question: If we know Q(s,a)Q^*(s,a), what is the best action?

  • Obviously, the best action is a=argmaxa Q(s,a)a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a). That is our policy.

Challenge: We do not know Q(s,a)Q^*(s,a).

Solution: Deep Q Network(DQN).

  • Use neural network Q(s,a;w)Q(s,a;\mathbf{w}) to approximate Q(s,a)Q^*(s,a).

Deep Q Network(DQN)

Q(s,a;w)Q(s,a;\mathbf{w}) is a neural network parameterized by w\mathbf{w}

input shape: observed state ss.

output shape: scores for all the action αA\alpha \in \Alpha.

Question: Based on the predictions, what should be the action?

just choose the best action aa^* satisfying a=argmaxa Q(s,a)a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a) in each decision stage.

Temporal Difference(TD) Learning

example of driving to the Atlanta showed in the video is a good explaination

I want to drive from NYC to Atlanta. Model Q(w)Q(\mathbf{w}) estimates the time cost, e.g., 1000 minutes.

Question: How do I update the model?

Make a prediction: q=Q(w)q = Q(\mathbf w), e.g., q=1000q = 1000. Finish the trip and get the target yy, e.g., y=860y=860.

Then define the Loss function: L=12(qy)2L = \dfrac{1}{2}(q-y)^2.

Calculate the gradient: Lw=qwLq=(qy)Q(w)w\dfrac{\partial L}{\partial \mathbf{w}} = \dfrac{\partial q}{\partial \mathbf{w}} \cdot \dfrac{\partial L}{\partial q} = (q - y) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}}.

Do the gradient descent: wt+1=wtαLww=wt\mathbf{w}_{t+1} = \mathbf{w}_{t} - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t}

Can I update the model before finishing the trip? Can I get a better W as soon as I arrived at DC bewteen NYC and Atlanta?


  • Model’s estimate: NYC to Atlanta 1000 minutes(estimate).
  • I arrive at DC; actual time cost: NYC to DC 300 minutes(actual).
  • Model now updates its estimate: DC to Atlanta 600 minutes(estimate).

Model’s estimate: Q(w)=1000Q(w) = \color{red}{1000} minutes. Updated estimate: 300+600=900\color{blue}300\color{black} + \color{red}{600}\color{black} = \color{purple}{900} minutes.

  • the number 900 is called TD target.

  • TD target 900 is a more reliable estimate than 1000.

  • Loss: L=12(Q(w)y)2L = \dfrac{1}{2}(Q(\mathbf{w}) - \color{purple}{y}\color{black})^2, where (Q(w)y)(Q(\mathbf{w}) - \color{purple}{y}\color{black}) is called TD error.

  • Gradient: Lw=(1000900)Q(w)w\dfrac{\partial L}{\partial \mathbf{w}} = (\color{red}{1000} \color{black}- \color{purple}{900} \color{black}) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}}

  • Gradient descent: wt+1=wtαLww=wt\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t}

Model’s estimates: NYC to DC is 400 minutes. But the Gound truth is 300. Get the TD error: δ=400300=100\color{purple} \delta \color{black} = \color{red} 400 \color{black} - \color{blue} 300 \color{black} = 100.

How to apply TD learning to DQN?

In the “driving time” example, we have the equation:

TNYCATLTNYCDC+TDCATLModel’s estimateGround Truth+Model’s estimate\begin{aligned} \color{red} T_{NYC \rightarrow ATL} \color{black} &\approx \color{blue} T_{NYC \rightarrow DC} \color{black} + \color{red} T_{DC \rightarrow ATL} \\ \color{red} \text{Model's estimate} \color{black} &\approx \color{blue} \text{Ground Truth} \color{black} + \color{red} \text{Model's estimate} \end{aligned}

In deep reinforcement learning:

Q(st,at;w)rt+γQ(st+1,at+1;w)\color{red} Q(s_t, a_t; \mathbf{w}) \color{black} \approx \color{blue} r_t \color{black} + \color{red} \gamma \cdot Q(s_{t+1}, a_{t+1}; \mathbf{w})

Given the proof:

Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+γ4Rt+4+=Rt+(γRt+1+γ2Rt+2+γ3Rt+3+γ4Rt+4+)=Rt+γ(Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+)=Rt+γUt+1\begin{aligned} U_t &= R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots \\ &= R_t + (\gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots) \\ &= R_t + \gamma \cdot (R_{t+1} + \gamma \cdot R_{t+2} + \gamma^2 \cdot R_{t+3} + \gamma^3 \cdot R_{t+4} + \cdots) \\ &= R_t + \gamma \cdot U_{t + 1} \\ \end{aligned}

identity: Ut=Rt+γUt+1U_t = R_t + \gamma \cdot U_{t + 1}

TD learning for DQN:

  • DQN’s output, Q(st,at;w)Q(s_t, a_t; \mathbf{w}), is an estimate of UtU_t.

  • DQN’s output, Q(st+1,at+1;w)Q(s_{t + 1}, a_{t + 1}; \mathbf{w}), is an estimate of Ut+1U_{t + 1}.

Thus, Q(st,at;w)E[Rt+γQ(St+1,At+1;w)]Q(s_t, a_t; \mathbf{w}) \approx \mathbb{E} [R_t + \gamma \cdot Q(S_{t + 1}, A_{t + 1}; \mathbf{w})], which is equal to estimate of UtE[Rt+γestimate of Ut+1]\color{purple} \text{estimate of } U_t \color{black}\approx \mathbb{E} [R_t + \gamma \cdot \color{purple}\text{estimate of } U_{t + 1}\color{black}]

Given the value, we get Q(st,at;w)rt+γQ(st+1,at+1;w)Q(s_t, a_t; \mathbf{w}) \approx r_t + \gamma \cdot Q(s_{t + 1}, a_{t + 1}; \mathbf{w}), whick is equal to PredictionTD target\text{Prediction} \approx \text{TD target}

for Prediction: Q(st,at;w)Q(s_t, a_t; \mathbf{w}), we set TD target: yt=rt+γQ(st+1,at+1;wt)=rt+γmaxaQ(st+1,a;wt)y_t = r_t + \gamma \cdot Q(s_{t+1},a_{t+1};\mathbf{w}_t) = r_t + \gamma \cdot \underset{a}{\max} Q(s_{t+1},a;\mathbf{w}_t)

then we can set loss: Lt=12[Q(st,at;w)yt]2L_t = \dfrac{1}{2}[Q(s_t,a_t;\mathbf{w}) - y_t]^2.

finally do the Gradient descent: wt+1=wtαLtww=wt\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L_t}{\partial \mathbf{w}} |_{\mathbf{w = w}_t}.

In summary: Algorithm for One iteration of TD learning.

  1. Observe state St=stS_t = s_t and perform action At=atA_t = a_t.

  2. Predict the value: qt=Q(st,at;wt)q_t = Q(s_t, a_t;\mathbf{w}_t)

  3. Differentiate the value network: dt=Q(st,at;wt)ww=wt\mathbf{d}_t = \dfrac{\partial Q(s_t, a_t;\mathbf{w}_t)}{\partial \mathbf{w}}|_{\mathbf{w=w}_t}

  4. Environment provides new state st+1s_{t+1} and reward rtr_t

  5. Compute TD target: yt=rt+γmaxaQ(st+1,a;wt)y_t = r_t + \gamma \cdot \underset{a}{\max}Q(s_{t+1}, a;\mathbf{w}_t)

  6. Gradient descent: wt+1=wtα(qtyt)dt\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot (q_t - y_t) \cdot \mathbf{d}_t.

Policy-Based Reinforcement Learning

Policy Function Approximation

Policy function π(as)\pi(a|s) is a probability density function (PDF). It takes state ss as input. It output the probabilities for all the actions, e.g.,

π(lefts)=0.2π(rights)=0.1π(ups)=0.7\begin{aligned} \pi(\text{left} | s) = 0.2 \\ \pi(\text{right} | s) = 0.1 \\ \pi(\text{up} | s) = 0.7 \end{aligned}

Randomly sample action aa random drawn from the distribution.

Can we directly learn a policy function π(as)\pi(a|s)?

If there are only a few states and actions, then yes, we can. Draw a table (matrix) and learn the entries.

What if there are too many (or infinite) states or actions? It’s impossible to finish the task by this way. Let’s try another way involving he Policy Network.

Policy Network

Policy network: Use a neural net to approximate π(as)\pi(a|s)

  • Use policy network π(as;θ)\pi(a|s;\theta) to approximate π(as)\pi(a|s)

  • θ\theta: trainable parameters of the neural net.

Some properties of the policy network:

  • aAπ(as;θ)=1\sum_{a \in \Alpha} \pi (a|s;\theta) = 1

  • Here, A=\Alpha = {“left”, “right”, “up”} is the set of all actions.

  • That is why we use softmax activation.

Let’s turn back to see the former knowledge for the purpose of exploiting the way to train the Policy Network.

State-Value Function Approximation

Definition: Discounted return.

Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+γ4Rt+4+U_t = R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots

  • The return depends on actions At,At+1,At+2,A_t, A_{t+1}, A_{t+2}, \cdots and states St,St+1,St+2,S_t, S_{t+1}, S_{t+2}, \cdots

  • Actions are random, given P[A=aS=s]=π(as)\mathbb{P}[A=a | S=s] = \pi(a|s) (Policy function)

  • States are random, given P[S=sS=s,A=a]=π(ss,a)\mathbb{P}[S' = s' | S=s, A=a] = \pi(s'|s,a) (State function)

Definition: Action-value function.

Qπ(st,at)=E[UtSt=st,At=at]Q_\pi(s_t,a_t) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t]

  • The expectation is taken w.r.t. actions At+1,A_{t+1}, \cdots and states St+1,S_{t+1}, \cdots

Definition: State-value function.

Vπ(st)=EA[Qπ(st,A)]=aπ(ast)Qπ(st,a)V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)

  • Integrate out action Aπ(st)A \sim \pi(\cdot|s_t)

How to approximate state-value function ?

  • Approximate policy function π(ast)\pi(a|s_t) by policy network π(ast;θ)\pi(a|s_t;\theta)

  • Approximate value function Vπ(st)V_{\pi}(s_t) by:

Vπ(st;θ)=aπ(ast;θ)Qπ(st,a)V_\pi (s_t;\theta) = \sum_a \pi(a|s_t;\theta) \cdot Q_\pi(s_t, a)

Here is the Approximate state-value function w.r.t. sts_t.

Policy-based learning: Learn θ\theta that maximizes J(θ)=ES[V(S;θ)]J(\theta) = \mathbb{E}_S [V(S; \theta)]

How to improve θ\theta ? Policy gradient ascent!

  • Observe state ss.

  • Update policy by: θθ+βV(s;θ)θ\theta \leftarrow \theta + \beta \cdot \dfrac{\partial V(s;\theta)}{\partial \theta}

Policy Gradient

Definition: Approximate state-value function.

Vπ(s;θ)=aπ(as;θ)Qπ(s,a)V_\pi (s;\theta) = \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)

Policy gradient: Derivative of Vπ(st;θ)V_\pi (s_t;\theta) w.r.t. θ\theta.

Vπ(s;θ)θ=aπ(as;θ)Qπ(s,a)θ=aπ(as;θ)Qπ(s,a)θ=aπ(as;θ)θQπ(s,a)+aQπ(s,a)θπ(as;θ)=aπ(as;θ)logπ(as;θ)θQπ(s,a)+EAπ(s;θ)[Qπ(s,A)θ]=aπ(as;θ)logπ(as;θ)θQπ(s,a)+x=EAπ(s;θ)[logπ(As;θ)θQπ(s,A)]+x\begin{aligned} \dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta} &= \dfrac{\partial \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\ &= \sum_a \dfrac{\partial \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\ &= \sum_a \dfrac{\partial \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + \sum_a \dfrac{\partial Q_\pi(s, a)}{\partial \theta} \cdot \pi(a|s;\theta)\\ &= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + \mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\Big[\dfrac{\partial Q_\pi(s, A)}{\partial \theta}\Big] \\ &= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + x \\ &= \mathbb{E}_{A\sim \pi(\cdot|s;\theta)} \Big[ \dfrac{\partial \log \pi(A|s;\theta)}{\partial \theta} \cdot Q_\pi(s, A)\Big] + x \\ \end{aligned}

Given the equation J(θ)=ES[Vπ(S;θ)]J(\theta) = \mathbb{E}_S [V_{\pi}(S;\theta)], we can get

J(θ)θ=ES[Vπ(S;θ)θ]=ES[EAπ(S;θ)[logπ(AS;θ)θQπ(S,A)]]+ES[x]\begin{aligned} \dfrac{\partial J(\theta)}{\partial \theta} &= \mathbb{E}_S \Big[ \dfrac{\partial V_{\pi}(S;\theta)}{\partial \theta} \Big]\\&= \mathbb{E}_S \Big[ \mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big] \Big] + \mathbb{E}_S[x] \end{aligned}

For the Policy Gradient, we can do:

  1. Randomly sample an action a^\hat a according to π(s;θ)\pi(\cdot | s;\theta)

  2. Calculate g(a^,θ)=logπ(a^θ)θQπ(s,a^)\mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a)

  3. Use g(a^,θ)\mathbf{g}(\hat a, \theta) as an approximation to the policy gradient Vπ(s;θ)θ\dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta}

Here we do not to compute the EAπ(S;θ)[logπ(AS;θ)θQπ(S,A)]\mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big]. Instead, we use Monte Carlo method, just computing one sampling g(a^,θ)=logπ(a^θ)θQπ(s,a^)\mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a) to update the gradient for approximating.

Update policy network using policy gradient

  1. Observe the state sts_t.

  2. Randomly sample action ata_t according to π(st;θt)\pi(\cdot|s_t;\theta_t).

  3. Compute qtQπ(st,at)q_t \approx Q_\pi(s_t, a_t) (some estimate).

  4. Differentiate policy network: dθ,t=logπ(atst,θ)θθ=θt\mathbf{d}_{\theta, t} = \dfrac{\partial \log \pi(a_t|s_t, \theta)}{\partial \theta} |_{\theta=\theta_t}

  5. (Approximate) policy gradient: g(at,θt)=qtdθ,t\mathbf{g}(a_t, \theta_t) = q_t \cdot \mathbf{d}_{\theta, t}.

  6. Update policy network: θt+1=θt+βg(at,θt)\theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a_t, \theta_t).

Another question about the operation 3, how to construct the Q function?


Play the game to the end and generate the trajectory:

s1,a1,r1,s2,a2,r2,,sT,aT,rT.s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T.

Compute the discounted return ut=k=tTγktrku_t = \sum_{k=t}^T \gamma^{k-t} r_k, for all tt.

Since Qπ(st,at)=E[Ut]Q_\pi(s_t,a_t) = \mathbb{E} [U_t], we can use utu_t to approximate Qπ(st,at)Q_{\pi}(s_t, a_t). Then we can use qt=utq_t = u_t in the above algorithm.

Option 2: Approximate QπQ_\pi using a neural network.

This leads to the actor-critic method.

Actor-Critic Methods

Value Network and Policy Network

Definition: State-value function.

V(s;θ,w)=aπ(as)Qπ(s,a)aπ(as;θ)q(s,a;w)V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s) \cdot Q_{\pi}(s,a) \approx \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})

Policy network(actor)

Use neural net π(as;θ)\pi(a|s;\theta) to approximate π(as)\pi(a|s)

θ\theta is the trainable parameters of the neural net.

Value network(critic)

Use neural net Qπ(s,a;w)Q_{\pi}(s,a;\mathbf{w}) to approximate Qπ(s,a)Q_{\pi}(s,a)

w\mathbf{w} is the trainable parameters of the neural net.

Policy Network(Actor)


  • Input: stats ss, e.g., a screenshot of Super Mario.

  • Output: probability distribution over the actions.

  • Let A\Alpha be the set all actions, e.g., A=\Alpha= {“left”, “right”, “up”}.

  • aAπ(as;θ)=1\sum_{a\in\Alpha}\pi(a|s;\theta)=1. (That is why we use softmax activation.)

Value Network(Critic)


  • Input: stats ss

  • Outpt: action-values of all the actions.

Train the Neural Networks

Definition: State-value function approximated using neural networks.

V=aπ(as;θ)q(s,a;w)V = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})

Training: Update the parameters θ\theta and w\mathbf{w}.

Update policy network π(as;θ)\pi(a|s;\theta) to increase the state-value V(s;θ,w)V(s;\theta, \mathbf{w})

  • Actor gradually performs better.

  • Supervision is purely from the value network (critic).

Update value network q(s,a;w)q(s,a;\mathbf{w}) to better estimate the return.

  • Critic’s judgement becomes more accurate.

  • Supervision is purely from the rewards.


  1. Observe the state sts_t.

  2. Randomly sample action ata_t according to π(st;θt)\pi(\cdot|s_t;\theta_t).

  3. Perform ata_t and observe new state st+1s_{t+1} and reward rtr_t.

  4. Updata w\mathbf{w} (in value network) using temporal difference(TD).

  5. Update θ\theta (in policy network) using policy gradient.

Update value network using TD

  1. Comupte q(st,at;wt)q(s_t, a_t; \mathbf{w}_t) and q(st+1,at+1;wt)q(s_{t+1}, a_{t+1};\mathbf{w}_t).

  2. TD target: yt=rt+γq(st+1,at+1;wt)y_t = r_t + \gamma \cdot q(s_{t+1}, a_{t+1};\mathbf{w}_t).

  3. Compute the Loss: L(w)=12[q(st,at;wt)yt]2L(\mathbf{w}) = \dfrac{1}{2} [q(s_t, a_t; \mathbf{w}_t) - y_t]^2

  4. Do gradient descent: wt+1=wtαL(w)ww=wt\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L(\mathbf{w})}{\partial \mathbf{w}} |_{\mathbf{w=w}_t}

Update policy network using policy network

Definition: State-value function approximated using neural networks.

V(s;θ,w)=aπ(as;θ)q(s,a;w)V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})

Policy gradient: Derivative of V(s;θ,w)V(s;\theta,\mathbf{w}) w.r.t. θ\theta.

Set g(a,θ)=logπ(as,θ)θq(st,a;w)\mathbf{g}(a,\theta) = \dfrac{\partial \log \pi(a|s,\theta)}{\partial \theta} \cdot q(s_t, a;\mathbf{w}). Then we can get V(s;θ,w)θ=EA[g(A,θ)]\dfrac{\partial V(s;\theta,\mathbf{w})}{\partial \theta} = \mathbb{E}_A[\mathbf{g}(A,\theta)].

By Monte Carlo method, once we get a state ss, we use it to update the gradient directly, to approximate the expectation.

Algorithm: Update policy network using stochastic policy gradient.

Random sampling aπ(st;θt).a \sim \pi(\cdot|s_t; \theta_t). (Thus g(a,θ)\mathbf{g}(a,\theta) is unbiased.)

Stochastic gradient ascent: θt+1=θt+βg(a,θt)\theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a,\theta_t)

Summary of Algorithm

  1. Observe state sts_t and randomly sample atπ(st;θt)a_t \sim \pi(\cdot|s_t;\theta_t).

  2. Perform ata_t; then environment gives new state st+1s_{t+1} and reward rtr_t.

  3. Randomly sample a~t+1π(st+1;θt)\tilde{a}_{t+1} \sim \pi(\cdot|s_{t+1};\theta_t). (Do not perform a~t+1\tilde{a}_{t+1})

  4. Evalueate value network: qt=q(st,at;wt)q_t = q(s_t, a_t; \mathbf{w}_t) and qt+1=q(st+1,at+1;wt)q_{t+1} = q(s_{t+1}, a_{t+1}; \mathbf{w}_t).

  5. Compute TD error: δt=qt(rt+γqt+1)\delta_t = q_t - (r_t + \gamma \cdot q_{t+1}).

  6. Differentiate value network: dw,t=q(st,at;w)w\mathbf{d}_{\mathbf{w},t} = \dfrac{\partial q(s_t, a_t; \mathbf{w})}{\partial \mathbf{w}}