From dalao: shusen Wang

## Probability Theory

### Random Variable

unknown; its values depend on outcomes of random events

Uppercase letter $X$ for a random variable.

Lowercase letter $x$ for an observed value.

e.g. $x_1 = 1, x_2 = 1, x_3 = 0, x_4 = 1$ ### Probability Density Function, PDF

PDF provides a relative likelihood that the value of the random variable would equal that sample.

Example: Gaussian distribution is a continuous distribution, its pdf representation

$p(x) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \text{exp} (-\dfrac{(x - \mu)^2}{2\sigma^2})$ ### Probability Mass Function (PMF)

PMF is a function that gives the probability that a discrete random variable is exactly equal to some value.

Example: Discrete random variable: $X \in \{1, 3, 7\}$, its pdf representation

\begin{aligned} p(1) = 0.2 \\ p(3) = 0.5 \\ p(7) = 0.3 \\ \end{aligned} ### Properties of PDF/PMF

Random variable $X$ is in the domain $\chi$

For continuous distributions, $$\int_\chi p(x) dx = 1$$

For discrete distributions, $$\sum_{x\in\chi} p(x) = 1$$

### Expectation

Random variable $X$ is in the domain $\chi$

For continuous distributions, the expectation of $f(X)$ is:

$E[f(x)] = \int_\chi p(x) \cdot f(x) dx$

For discrete distributions, the expectation of $f(X)$ is:

$E[f(X)] = \sum_{x\in\chi} p(x) \cdot f(x)$

### Random Sampling

e.g. There are 10 balls in the bin: 2 are red, 5 are green, and 3 are blue. We take one from the box. What is the color of it? Construct a random sampling: Sample red ball w.p. 0.2, green ball w.p. 0.5, and blue ball w.p. 0.3.

## Terminologies

### state

state $s$

e.g. the frame in SuperMario

state contain some info like role position, enemy position, …

### action

action $a$

e.g. in SuperMario, Mario can do $a \in \{\text{left, right, up}\}$

What the ego agent can do to affect the environment?

### policy

policy $\pi$

Policy function $\pi: (s,a) \to [0,1]$, where $\pi(a|s) = \mathbb{P}(A = a|S = s)$ is defination.

$\pi(a|s)$ is the probability of taking action $A = a$ given state $S = s$

e.g. $\pi(\text{left} | s) = 0.2, \pi(\text{right} | s) = 0.1, \pi(\text{up} | s) = 0.7$

Upon observing state $S = s$, the agent’s action $A$ can be random, which bring the advantage that our next action can’t be deterministic, especially in the game like rock-paper-scissors games.

### reward

reward $R$

You can get reward after you do some action to change the state.

e.g. Collect a coin: $R=+1$; Win the game: $R = + 10000$;
Touch a Goomba: $R=-10000$(game over); Nothing happens: $R=0$

### state transition

when old state make an action, it will turn to the new state e.g. “up” action leads to a new state.

State transition can be random. But the randomness is from the environment instead of the action or old state.

$p(s'|s,a) = P(S'=s'|S=s,A=a)$

e.g. Goomba have w.p. 0.8 back instead of w.p. 0.2 continuing to go when Mario chooses the “up” action

## Two Sources of Randomness

### Randomness in Actions

Given state $s$, the action can be random, e.g.

\begin{aligned} &\pi(\text{left} | s) = 0.2, \\ &\pi(\text{right} | s) = 0.1, \\ &\pi(\text{up} | s) = 0.7 \end{aligned} ### Randomness in States

State transition can be random. The environment generates the new state $S'$ by

$S' \sim p(\cdot | s, a)$ ### Two Sources of Randomness

The randomness in action is from the policy function

$A \sim \pi(\cdot | s)$

The randomness in state is from the state-transition function

$S' \sim p(\cdot | s, a)$

### Agent-Environment Interaction Observe state $s_t$, select action $a_t \sim \pi(\cdot | s_t)$, and execute $a_t$

The environment gives new state $s_{t+1}$ and reward $r_t$ From the pic above, we can get a (state, action, reward) trajectory:

$s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n$

One episode is from the the beginning to the end (Mario wins or dies).

What is a good policy?

A good policy leads to big cumulative reward: $\sum_{t=1}^n \gamma^{t-1} \cdot r_t$

Use the rewards to guide the learning of policy, which I will introduce in the following.

## Rewards and Returns

### Return

Definition: Return (aka cumulative future reward) $U_t$

$U_t = R_{t} + R_{t + 1} + R_{t + 2} + \cdots$

Question: At time $t$, are $R_t$ and $R_{t + 1}$ equally important?

We think future reward is less valuable than present reward. So $R_{t + 1}$ should be given less weight than $R_t$.

Definition: Discounted return (aka cumulative discounted future reward).

$\gamma$: discount factor (tuning hyper-parameter).

$U_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots$

### Discounted Returns

Definition: Discounted return (at time $t$).

$U_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots + \gamma^{n-t} R_n$

### Randomness in Returns

Suppose at the end of the game, we observe $u_t$.

We observe all the rewards, $r_t, r_{t+1}, r_{t+2}, \cdots, r_n$

We thereby know the discounted return:

$u_t = r_{t} + \gamma r_{t + 1} + \gamma^2 r_{t + 2} + \cdots + \gamma^{n-t} r_n$

At time $t$, the rewards, $R_t, ⋯ , R_n$, are random, so the return $U_t$ is random.

• Reward $R_i$ depends on $S_i$ and $A_i$

• States can be random: $S_i \sim p(\cdot | s_{i-1}, a_{i - 1})$

• Actions can be random: $A_i \sim \pi(\cdot| s_i)$

• If either $S_i$ or $A_i$ is random, then $R_i$ is random.

• $U_t$ depends on $R_t, R_{t+1}, \cdots, R_n$

• So $U_t$ also depends on $S_t, A_t, S_{t+1}, A_{t+1}, \cdots, S_n, A_n$

### Action-Value Function

Definition: Action-value function $Q_\pi(s_t,a_t)$

$Q_\pi(s_t,a_t) = \mathbb{E}_{S_{t+1},A_{t+1},\cdots,S_{n}, A_{n}} [U_t | S_t = s_t, A_t = a_t]$

where $S_{t+1}, \cdots, S_n$ and $A_{t+1}, \cdots, A_n$ as random variables and $s_t, a_t$ are observed values

because $S_{t+1}\sim p(\cdot|s_t,a_t), \cdots S_{n}\sim p(\cdot|s_{n-1},a_{n-1})$, also $A_{t+1}\sim \pi(\cdot|s_{t+1}), \cdots A_{n}\sim p(\cdot|s_n)$.

So $Q_\pi(s_t,a_t)$ depends on $s_t, a_t, \pi, p$ and $Q_\pi(s_t,a_t)$ is dependent of $S_{t+1}, \cdots, S_n$ and $A_{t+1}, \cdots, A_n$.

Then we want to erase the effect of $\pi$, then we get Optimal action-value function.

Definition: Optimal action-value function

$Q^{*}(s_t,a_t) = \max_{\pi} Q_{\pi} (s_t, a_t)$

• Whatever policy function $\pi$ is used, the result of taking $a_t$ at state $s_t$ cannot be better than $Q^*(s_t,a_t)$

### State-Value Function

Definition: State-value function $V_\pi (s_t)$

$V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)$

where $A \sim \pi(\cdot | s_t)$

given actions are discrete.

$V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \int \pi(a|s_t) \cdot Q_pi(s_t, a) da$

given actions are continuous.

### Understanding the Value Functions

Action-value function: $Q_\pi(s,a) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t]$

Given policy $\pi$, $Q_\pi(s,a)$evaluates how good it is for an agent to pick action $a$ while being in state $s$.

State-value function: $V_\pi (s) = \mathbb{E}_A [Q_\pi (s, A)]$

For fixed policy $\pi$, $V_\pi (s)$ evaluates how good the situation is in state $s$.

$\mathbb{E}_S [V_\pi (S)]$ evaluates how good the policy $\pi$ is.

## Evaluating Reinforcement Learning

OpenAI Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms. https://gym.openai.com/

• Classical control problems
• Atari Games

Play CartPole Game

Get the environment of CartPole from Gym.

“env” provides states and reward.

Complete Code:

## Value-Based Reinforcement Learning

### Approximate the Q Function

Goal: Win the game(~ maximize the total reward).

Question: If we know $Q^*(s,a)$, what is the best action?

• Obviously, the best action is $a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a)$. That is our policy.

Challenge: We do not know $Q^*(s,a)$.

Solution: Deep Q Network(DQN).

• Use neural network $Q(s,a;\mathbf{w})$ to approximate $Q^*(s,a)$.

### Deep Q Network(DQN)

$Q(s,a;\mathbf{w})$ is a neural network parameterized by $\mathbf{w}$

input shape: observed state $s$.

output shape: scores for all the action $\alpha \in \Alpha$. Question: Based on the predictions, what should be the action?

just choose the best action $a^*$ satisfying $a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a)$ in each decision stage. ### Temporal Difference(TD) Learning

example of driving to the Atlanta showed in the video is a good explaination

I want to drive from NYC to Atlanta. Model $Q(\mathbf{w})$ estimates the time cost, e.g., 1000 minutes.

Question: How do I update the model?

Make a prediction: $q = Q(\mathbf w)$, e.g., $q = 1000$. Finish the trip and get the target $y$, e.g., $y=860$.

Then define the Loss function: $L = \dfrac{1}{2}(q-y)^2$.

Calculate the gradient: $\dfrac{\partial L}{\partial \mathbf{w}} = \dfrac{\partial q}{\partial \mathbf{w}} \cdot \dfrac{\partial L}{\partial q} = (q - y) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}}$.

Do the gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_{t} - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t}$

Can I update the model before finishing the trip? Can I get a better W as soon as I arrived at DC bewteen NYC and Atlanta?

Suppose:

• Model’s estimate: NYC to Atlanta 1000 minutes(estimate).
• I arrive at DC; actual time cost: NYC to DC 300 minutes(actual).
• Model now updates its estimate: DC to Atlanta 600 minutes(estimate).

Model’s estimate: $Q(w) = \color{red}{1000}$ minutes. Updated estimate: $\color{blue}300\color{black} + \color{red}{600}\color{black} = \color{purple}{900}$ minutes.

• the number 900 is called TD target.

• TD target 900 is a more reliable estimate than 1000.

• Loss: $L = \dfrac{1}{2}(Q(\mathbf{w}) - \color{purple}{y}\color{black})^2$, where $(Q(\mathbf{w}) - \color{purple}{y}\color{black})$ is called TD error.

• Gradient: $\dfrac{\partial L}{\partial \mathbf{w}} = (\color{red}{1000} \color{black}- \color{purple}{900} \color{black}) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}}$

• Gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t}$

Model’s estimates: NYC to DC is 400 minutes. But the Gound truth is 300. Get the TD error: $\color{purple} \delta \color{black} = \color{red} 400 \color{black} - \color{blue} 300 \color{black} = 100$.

### How to apply TD learning to DQN?

In the “driving time” example, we have the equation:

\begin{aligned} \color{red} T_{NYC \rightarrow ATL} \color{black} &\approx \color{blue} T_{NYC \rightarrow DC} \color{black} + \color{red} T_{DC \rightarrow ATL} \\ \color{red} \text{Model's estimate} \color{black} &\approx \color{blue} \text{Ground Truth} \color{black} + \color{red} \text{Model's estimate} \end{aligned}

In deep reinforcement learning:

$\color{red} Q(s_t, a_t; \mathbf{w}) \color{black} \approx \color{blue} r_t \color{black} + \color{red} \gamma \cdot Q(s_{t+1}, a_{t+1}; \mathbf{w})$

Given the proof:

\begin{aligned} U_t &= R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots \\ &= R_t + (\gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots) \\ &= R_t + \gamma \cdot (R_{t+1} + \gamma \cdot R_{t+2} + \gamma^2 \cdot R_{t+3} + \gamma^3 \cdot R_{t+4} + \cdots) \\ &= R_t + \gamma \cdot U_{t + 1} \\ \end{aligned}

identity: $U_t = R_t + \gamma \cdot U_{t + 1}$

TD learning for DQN:

• DQN’s output, $Q(s_t, a_t; \mathbf{w})$, is an estimate of $U_t$.

• DQN’s output, $Q(s_{t + 1}, a_{t + 1}; \mathbf{w})$, is an estimate of $U_{t + 1}$.

Thus, $Q(s_t, a_t; \mathbf{w}) \approx \mathbb{E} [R_t + \gamma \cdot Q(S_{t + 1}, A_{t + 1}; \mathbf{w})]$, which is equal to $\color{purple} \text{estimate of } U_t \color{black}\approx \mathbb{E} [R_t + \gamma \cdot \color{purple}\text{estimate of } U_{t + 1}\color{black}]$

Given the value, we get $Q(s_t, a_t; \mathbf{w}) \approx r_t + \gamma \cdot Q(s_{t + 1}, a_{t + 1}; \mathbf{w})$, whick is equal to $\text{Prediction} \approx \text{TD target}$

for Prediction: $Q(s_t, a_t; \mathbf{w})$, we set TD target: $y_t = r_t + \gamma \cdot Q(s_{t+1},a_{t+1};\mathbf{w}_t) = r_t + \gamma \cdot \underset{a}{\max} Q(s_{t+1},a;\mathbf{w}_t)$

then we can set loss: $L_t = \dfrac{1}{2}[Q(s_t,a_t;\mathbf{w}) - y_t]^2$.

finally do the Gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L_t}{\partial \mathbf{w}} |_{\mathbf{w = w}_t}$.

In summary: Algorithm for One iteration of TD learning.

1. Observe state $S_t = s_t$ and perform action $A_t = a_t$.

2. Predict the value: $q_t = Q(s_t, a_t;\mathbf{w}_t)$

3. Differentiate the value network: $\mathbf{d}_t = \dfrac{\partial Q(s_t, a_t;\mathbf{w}_t)}{\partial \mathbf{w}}|_{\mathbf{w=w}_t}$

4. Environment provides new state $s_{t+1}$ and reward $r_t$

5. Compute TD target: $y_t = r_t + \gamma \cdot \underset{a}{\max}Q(s_{t+1}, a;\mathbf{w}_t)$

6. Gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot (q_t - y_t) \cdot \mathbf{d}_t$.

## Policy-Based Reinforcement Learning

### Policy Function Approximation

Policy function $\pi(a|s)$ is a probability density function (PDF). It takes state $s$ as input. It output the probabilities for all the actions, e.g.,

\begin{aligned} \pi(\text{left} | s) = 0.2 \\ \pi(\text{right} | s) = 0.1 \\ \pi(\text{up} | s) = 0.7 \end{aligned}

Randomly sample action $a$ random drawn from the distribution.

Can we directly learn a policy function $\pi(a|s)$?

If there are only a few states and actions, then yes, we can. Draw a table (matrix) and learn the entries. What if there are too many (or infinite) states or actions? It’s impossible to finish the task by this way. Let’s try another way involving he Policy Network.

### Policy Network

Policy network: Use a neural net to approximate $\pi(a|s)$

• Use policy network $\pi(a|s;\theta)$ to approximate $\pi(a|s)$

• $\theta$: trainable parameters of the neural net. Some properties of the policy network:

• $\sum_{a \in \Alpha} \pi (a|s;\theta) = 1$

• Here, $\Alpha =$ {“left”, “right”, “up”} is the set of all actions.

• That is why we use softmax activation.

Let’s turn back to see the former knowledge for the purpose of exploiting the way to train the Policy Network.

### State-Value Function Approximation

Definition: Discounted return.

$U_t = R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots$

• The return depends on actions $A_t, A_{t+1}, A_{t+2}, \cdots$ and states $S_t, S_{t+1}, S_{t+2}, \cdots$

• Actions are random, given $\mathbb{P}[A=a | S=s] = \pi(a|s)$ (Policy function)

• States are random, given $\mathbb{P}[S' = s' | S=s, A=a] = \pi(s'|s,a)$ (State function)

Definition: Action-value function.

$Q_\pi(s_t,a_t) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t]$

• The expectation is taken w.r.t. actions $A_{t+1}, \cdots$ and states $S_{t+1}, \cdots$

Definition: State-value function.

$V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)$

• Integrate out action $A \sim \pi(\cdot|s_t)$

How to approximate state-value function ?

• Approximate policy function $\pi(a|s_t)$ by policy network $\pi(a|s_t;\theta)$

• Approximate value function $V_{\pi}(s_t)$ by:

$V_\pi (s_t;\theta) = \sum_a \pi(a|s_t;\theta) \cdot Q_\pi(s_t, a)$

Here is the Approximate state-value function w.r.t. $s_t$.

Policy-based learning: Learn $\theta$ that maximizes $J(\theta) = \mathbb{E}_S [V(S; \theta)]$

How to improve $\theta$ ? Policy gradient ascent!

• Observe state $s$.

• Update policy by: $\theta \leftarrow \theta + \beta \cdot \dfrac{\partial V(s;\theta)}{\partial \theta}$

Definition: Approximate state-value function.

$V_\pi (s;\theta) = \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)$

Policy gradient: Derivative of $V_\pi (s_t;\theta)$ w.r.t. $\theta$.

\begin{aligned} \dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta} &= \dfrac{\partial \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\ &= \sum_a \dfrac{\partial \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\ &= \sum_a \dfrac{\partial \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + \sum_a \dfrac{\partial Q_\pi(s, a)}{\partial \theta} \cdot \pi(a|s;\theta)\\ &= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + \mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\Big[\dfrac{\partial Q_\pi(s, A)}{\partial \theta}\Big] \\ &= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + x \\ &= \mathbb{E}_{A\sim \pi(\cdot|s;\theta)} \Big[ \dfrac{\partial \log \pi(A|s;\theta)}{\partial \theta} \cdot Q_\pi(s, A)\Big] + x \\ \end{aligned}

Given the equation $J(\theta) = \mathbb{E}_S [V_{\pi}(S;\theta)]$, we can get

\begin{aligned} \dfrac{\partial J(\theta)}{\partial \theta} &= \mathbb{E}_S \Big[ \dfrac{\partial V_{\pi}(S;\theta)}{\partial \theta} \Big]\\&= \mathbb{E}_S \Big[ \mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big] \Big] + \mathbb{E}_S[x] \end{aligned}

For the Policy Gradient, we can do:

1. Randomly sample an action $\hat a$ according to $\pi(\cdot | s;\theta)$

2. Calculate $\mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a)$

3. Use $\mathbf{g}(\hat a, \theta)$ as an approximation to the policy gradient $\dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta}$

Here we do not to compute the $\mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big]$. Instead, we use Monte Carlo method, just computing one sampling $\mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a)$ to update the gradient for approximating.

### Update policy network using policy gradient

1. Observe the state $s_t$.

2. Randomly sample action $a_t$ according to $\pi(\cdot|s_t;\theta_t)$.

3. Compute $q_t \approx Q_\pi(s_t, a_t)$ (some estimate).

4. Differentiate policy network: $\mathbf{d}_{\theta, t} = \dfrac{\partial \log \pi(a_t|s_t, \theta)}{\partial \theta} |_{\theta=\theta_t}$

5. (Approximate) policy gradient: $\mathbf{g}(a_t, \theta_t) = q_t \cdot \mathbf{d}_{\theta, t}$.

6. Update policy network: $\theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a_t, \theta_t)$.

Another question about the operation 3, how to construct the Q function?

Option1: REINFORCE

Play the game to the end and generate the trajectory:

$s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T.$

Compute the discounted return $u_t = \sum_{k=t}^T \gamma^{k-t} r_k$, for all $t$.

Since $Q_\pi(s_t,a_t) = \mathbb{E} [U_t]$, we can use $u_t$ to approximate $Q_{\pi}(s_t, a_t)$. Then we can use $q_t = u_t$ in the above algorithm.

Option 2: Approximate $Q_\pi$ using a neural network.

This leads to the actor-critic method.

## Actor-Critic Methods ### Value Network and Policy Network

Definition: State-value function.

$V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s) \cdot Q_{\pi}(s,a) \approx \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})$

Policy network(actor)

Use neural net $\pi(a|s;\theta)$ to approximate $\pi(a|s)$

$\theta$ is the trainable parameters of the neural net.

Value network(critic)

Use neural net $Q_{\pi}(s,a;\mathbf{w})$ to approximate $Q_{\pi}(s,a)$

$\mathbf{w}$ is the trainable parameters of the neural net.

### Policy Network(Actor)

$\pi(a|s;\theta)$

• Input: stats $s$, e.g., a screenshot of Super Mario.

• Output: probability distribution over the actions.

• Let $\Alpha$ be the set all actions, e.g., $\Alpha=$ {“left”, “right”, “up”}.

• $\sum_{a\in\Alpha}\pi(a|s;\theta)=1$. (That is why we use softmax activation.) ### Value Network(Critic)

$Q_{\pi}(s,a;\mathbf{w})$

• Input: stats $s$

• Outpt: action-values of all the actions. ### Train the Neural Networks

Definition: State-value function approximated using neural networks.

$V = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})$

Training: Update the parameters $\theta$ and $\mathbf{w}$. Update policy network $\pi(a|s;\theta)$ to increase the state-value $V(s;\theta, \mathbf{w})$

• Supervision is purely from the value network (critic). Update value network $q(s,a;\mathbf{w})$ to better estimate the return.

• Critic’s judgement becomes more accurate.

• Supervision is purely from the rewards. step:

1. Observe the state $s_t$.

2. Randomly sample action $a_t$ according to $\pi(\cdot|s_t;\theta_t)$.

3. Perform $a_t$ and observe new state $s_{t+1}$ and reward $r_t$.

4. Updata $\mathbf{w}$ (in value network) using temporal difference(TD).

5. Update $\theta$ (in policy network) using policy gradient.

#### Update value network using TD

1. Comupte $q(s_t, a_t; \mathbf{w}_t)$ and $q(s_{t+1}, a_{t+1};\mathbf{w}_t)$.

2. TD target: $y_t = r_t + \gamma \cdot q(s_{t+1}, a_{t+1};\mathbf{w}_t)$.

3. Compute the Loss: $L(\mathbf{w}) = \dfrac{1}{2} [q(s_t, a_t; \mathbf{w}_t) - y_t]^2$

4. Do gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L(\mathbf{w})}{\partial \mathbf{w}} |_{\mathbf{w=w}_t}$

#### Update policy network using policy network

Definition: State-value function approximated using neural networks.

$V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})$

Policy gradient: Derivative of $V(s;\theta,\mathbf{w})$ w.r.t. $\theta$.

Set $\mathbf{g}(a,\theta) = \dfrac{\partial \log \pi(a|s,\theta)}{\partial \theta} \cdot q(s_t, a;\mathbf{w})$. Then we can get $\dfrac{\partial V(s;\theta,\mathbf{w})}{\partial \theta} = \mathbb{E}_A[\mathbf{g}(A,\theta)]$.

By Monte Carlo method, once we get a state $s$, we use it to update the gradient directly, to approximate the expectation.

Algorithm: Update policy network using stochastic policy gradient.

Random sampling $a \sim \pi(\cdot|s_t; \theta_t).$ (Thus $\mathbf{g}(a,\theta)$ is unbiased.)

Stochastic gradient ascent: $\theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a,\theta_t)$ ### Summary of Algorithm

1. Observe state $s_t$ and randomly sample $a_t \sim \pi(\cdot|s_t;\theta_t)$.

2. Perform $a_t$; then environment gives new state $s_{t+1}$ and reward $r_t$.

3. Randomly sample $\tilde{a}_{t+1} \sim \pi(\cdot|s_{t+1};\theta_t)$. (Do not perform $\tilde{a}_{t+1}$)

4. Evalueate value network: $q_t = q(s_t, a_t; \mathbf{w}_t)$ and $q_{t+1} = q(s_{t+1}, a_{t+1}; \mathbf{w}_t)$.

5. Compute TD error: $\delta_t = q_t - (r_t + \gamma \cdot q_{t+1})$.

6. Differentiate value network: $\mathbf{d}_{\mathbf{w},t} = \dfrac{\partial q(s_t, a_t; \mathbf{w})}{\partial \mathbf{w}}$