From dalao: shusen Wang
Probability Theory
Random Variable
unknown; its values depend on outcomes of random events
Uppercase letter X X X for a random variable.
Lowercase letter x x x for an observed value.
e.g. x 1 = 1 , x 2 = 1 , x 3 = 0 , x 4 = 1 x_1 = 1, x_2 = 1, x_3 = 0, x_4 = 1 x 1 = 1 , x 2 = 1 , x 3 = 0 , x 4 = 1
Probability Density Function, PDF
PDF provides a relative likelihood that the value of the random variable would equal that sample.
Example: Gaussian distribution is a continuous distribution, its pdf representation
p ( x ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) p(x) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \text{exp} (-\dfrac{(x - \mu)^2}{2\sigma^2})
p ( x ) = 2 π σ 2 1 exp ( − 2 σ 2 ( x − μ ) 2 )
Probability Mass Function (PMF)
PMF is a function that gives the probability that a discrete random variable is exactly equal to some value.
Example: Discrete random variable: X ∈ { 1 , 3 , 7 } X \in \{1, 3, 7\} X ∈ { 1 , 3 , 7 } , its pdf representation
p ( 1 ) = 0.2 p ( 3 ) = 0.5 p ( 7 ) = 0.3 \begin{aligned}
p(1) = 0.2 \\
p(3) = 0.5 \\
p(7) = 0.3 \\
\end{aligned}
p ( 1 ) = 0.2 p ( 3 ) = 0.5 p ( 7 ) = 0.3
Properties of PDF/PMF
Random variable X X X is in the domain χ \chi χ
For continuous distributions, $$\int_\chi p(x) dx = 1$$
For discrete distributions, $$\sum_{x\in\chi} p(x) = 1$$
Expectation
Random variable X X X is in the domain χ \chi χ
For continuous distributions, the expectation of f ( X ) f(X) f ( X ) is:
E [ f ( x ) ] = ∫ χ p ( x ) ⋅ f ( x ) d x E[f(x)] = \int_\chi p(x) \cdot f(x) dx
E [ f ( x )] = ∫ χ p ( x ) ⋅ f ( x ) d x
For discrete distributions, the expectation of f ( X ) f(X) f ( X ) is:
E [ f ( X ) ] = ∑ x ∈ χ p ( x ) ⋅ f ( x ) E[f(X)] = \sum_{x\in\chi} p(x) \cdot f(x)
E [ f ( X )] = x ∈ χ ∑ p ( x ) ⋅ f ( x )
Random Sampling
e.g. There are 10 balls in the bin: 2 are red, 5 are green, and 3 are blue. We take one from the box. What is the color of it?
Construct a random sampling: Sample red ball w.p. 0.2, green ball w.p. 0.5, and blue ball w.p. 0.3.
1 2 3 4 from numpy.random import choice samples = choice(['R' , 'G' , 'B' ], size = 100 , p = [0.2 , 0.5 , 0.3 ])print (samples)
Terminologies
state
state s s s
e.g. the frame in SuperMario
state contain some info like role position, enemy position, …
action
action a a a
e.g. in SuperMario, Mario can do a ∈ { left, right, up } a \in \{\text{left, right, up}\} a ∈ { left, right, up }
What the ego agent can do to affect the environment?
policy
policy π \pi π
Policy function π : ( s , a ) → [ 0 , 1 ] \pi: (s,a) \to [0,1] π : ( s , a ) → [ 0 , 1 ] , where π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a|s) = \mathbb{P}(A = a|S = s) π ( a ∣ s ) = P ( A = a ∣ S = s ) is defination.
π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) is the probability of taking action A = a A = a A = a given state S = s S = s S = s
e.g. π ( left ∣ s ) = 0.2 , π ( right ∣ s ) = 0.1 , π ( up ∣ s ) = 0.7 \pi(\text{left} | s) = 0.2, \pi(\text{right} | s) = 0.1, \pi(\text{up} | s) = 0.7 π ( left ∣ s ) = 0.2 , π ( right ∣ s ) = 0.1 , π ( up ∣ s ) = 0.7
Upon observing state S = s S = s S = s , the agent’s action A A A can be random, which bring the advantage that our next action can’t be deterministic, especially in the game like rock-paper-scissors games.
reward
reward R R R
You can get reward after you do some action to change the state.
e.g. Collect a coin: R = + 1 R=+1 R = + 1 ; Win the game: R = + 10000 R = + 10000 R = + 10000 ;
Touch a Goomba: R = − 10000 R=-10000 R = − 10000 (game over); Nothing happens: R = 0 R=0 R = 0
state transition
when old state make an action, it will turn to the new state
e.g. “up” action leads to a new state.
State transition can be random. But the randomness is from the environment instead of the action or old state.
p ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a ) p(s'|s,a) = P(S'=s'|S=s,A=a)
p ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a )
e.g. Goomba have w.p. 0.8 back instead of w.p. 0.2 continuing to go when Mario chooses the “up” action
Two Sources of Randomness
Randomness in Actions
Given state s s s , the action can be random, e.g.
π ( left ∣ s ) = 0.2 , π ( right ∣ s ) = 0.1 , π ( up ∣ s ) = 0.7 \begin{aligned}
&\pi(\text{left} | s) = 0.2, \\
&\pi(\text{right} | s) = 0.1, \\
&\pi(\text{up} | s) = 0.7
\end{aligned}
π ( left ∣ s ) = 0.2 , π ( right ∣ s ) = 0.1 , π ( up ∣ s ) = 0.7
Randomness in States
State transition can be random. The environment generates the new state S ′ S' S ′ by
S ′ ∼ p ( ⋅ ∣ s , a ) S' \sim p(\cdot | s, a)
S ′ ∼ p ( ⋅ ∣ s , a )
Two Sources of Randomness
The randomness in action is from the policy function
A ∼ π ( ⋅ ∣ s ) A \sim \pi(\cdot | s)
A ∼ π ( ⋅ ∣ s )
The randomness in state is from the state-transition function
S ′ ∼ p ( ⋅ ∣ s , a ) S' \sim p(\cdot | s, a)
S ′ ∼ p ( ⋅ ∣ s , a )
Agent-Environment Interaction
Observe state s t s_t s t , select action a t ∼ π ( ⋅ ∣ s t ) a_t \sim \pi(\cdot | s_t) a t ∼ π ( ⋅ ∣ s t ) , and execute a t a_t a t
The environment gives new state s t + 1 s_{t+1} s t + 1 and reward r t r_t r t
From the pic above, we can get a (state, action, reward) trajectory :
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s n , a n , r n s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s n , a n , r n
One episode is from the the beginning to the end (Mario wins or dies).
What is a good policy?
A good policy leads to big cumulative reward: ∑ t = 1 n γ t − 1 ⋅ r t \sum_{t=1}^n \gamma^{t-1} \cdot r_t ∑ t = 1 n γ t − 1 ⋅ r t
Use the rewards to guide the learning of policy, which I will introduce in the following.
Rewards and Returns
Return
Definition: Return (aka cumulative future reward ) U t U_t U t
U t = R t + R t + 1 + R t + 2 + ⋯ U_t = R_{t} + R_{t + 1} + R_{t + 2} + \cdots
U t = R t + R t + 1 + R t + 2 + ⋯
Question: At time t t t , are R t R_t R t and R t + 1 R_{t + 1} R t + 1 equally important?
We think future reward is less valuable than present reward. So R t + 1 R_{t + 1} R t + 1 should be given less weight than R t R_t R t .
Definition: Discounted return (aka cumulative discounted future reward ).
γ \gamma γ : discount factor (tuning hyper-parameter).
U t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ U_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots
U t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯
Discounted Returns
Definition: Discounted return (at time t t t ).
U t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ + γ n − t R n U_t = R_{t} + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \cdots + \gamma^{n-t} R_n
U t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ + γ n − t R n
Randomness in Returns
Suppose at the end of the game, we observe u t u_t u t .
We observe all the rewards, r t , r t + 1 , r t + 2 , ⋯ , r n r_t, r_{t+1}, r_{t+2}, \cdots, r_n r t , r t + 1 , r t + 2 , ⋯ , r n
We thereby know the discounted return:
u t = r t + γ r t + 1 + γ 2 r t + 2 + ⋯ + γ n − t r n u_t = r_{t} + \gamma r_{t + 1} + \gamma^2 r_{t + 2} + \cdots + \gamma^{n-t} r_n
u t = r t + γ r t + 1 + γ 2 r t + 2 + ⋯ + γ n − t r n
At time t t t , the rewards, R t , ⋯ , R n R_t, ⋯ , R_n R t , ⋯ , R n , are random, so the return U t U_t U t is random.
Reward R i R_i R i depends on S i S_i S i and A i A_i A i
States can be random: S i ∼ p ( ⋅ ∣ s i − 1 , a i − 1 ) S_i \sim p(\cdot | s_{i-1}, a_{i - 1}) S i ∼ p ( ⋅ ∣ s i − 1 , a i − 1 )
Actions can be random: A i ∼ π ( ⋅ ∣ s i ) A_i \sim \pi(\cdot| s_i) A i ∼ π ( ⋅ ∣ s i )
If either S i S_i S i or A i A_i A i is random, then R i R_i R i is random.
U t U_t U t depends on R t , R t + 1 , ⋯ , R n R_t, R_{t+1}, \cdots, R_n R t , R t + 1 , ⋯ , R n
So U t U_t U t also depends on S t , A t , S t + 1 , A t + 1 , ⋯ , S n , A n S_t, A_t, S_{t+1}, A_{t+1}, \cdots, S_n, A_n S t , A t , S t + 1 , A t + 1 , ⋯ , S n , A n
Action-Value Function
Definition: Action-value function Q π ( s t , a t ) Q_\pi(s_t,a_t) Q π ( s t , a t )
Q π ( s t , a t ) = E S t + 1 , A t + 1 , ⋯ , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_\pi(s_t,a_t) = \mathbb{E}_{S_{t+1},A_{t+1},\cdots,S_{n}, A_{n}} [U_t | S_t = s_t, A_t = a_t]
Q π ( s t , a t ) = E S t + 1 , A t + 1 , ⋯ , S n , A n [ U t ∣ S t = s t , A t = a t ]
where S t + 1 , ⋯ , S n S_{t+1}, \cdots, S_n S t + 1 , ⋯ , S n and A t + 1 , ⋯ , A n A_{t+1}, \cdots, A_n A t + 1 , ⋯ , A n as random variables and s t , a t s_t, a_t s t , a t are observed values
because S t + 1 ∼ p ( ⋅ ∣ s t , a t ) , ⋯ S n ∼ p ( ⋅ ∣ s n − 1 , a n − 1 ) S_{t+1}\sim p(\cdot|s_t,a_t), \cdots S_{n}\sim p(\cdot|s_{n-1},a_{n-1}) S t + 1 ∼ p ( ⋅ ∣ s t , a t ) , ⋯ S n ∼ p ( ⋅ ∣ s n − 1 , a n − 1 ) , also A t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , ⋯ A n ∼ p ( ⋅ ∣ s n ) A_{t+1}\sim \pi(\cdot|s_{t+1}), \cdots A_{n}\sim p(\cdot|s_n) A t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , ⋯ A n ∼ p ( ⋅ ∣ s n ) .
So Q π ( s t , a t ) Q_\pi(s_t,a_t) Q π ( s t , a t ) depends on s t , a t , π , p s_t, a_t, \pi, p s t , a t , π , p and Q π ( s t , a t ) Q_\pi(s_t,a_t) Q π ( s t , a t ) is dependent of S t + 1 , ⋯ , S n S_{t+1}, \cdots, S_n S t + 1 , ⋯ , S n and A t + 1 , ⋯ , A n A_{t+1}, \cdots, A_n A t + 1 , ⋯ , A n .
Then we want to erase the effect of π \pi π , then we get Optimal action-value function .
Definition: Optimal action-value function
Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q^{*}(s_t,a_t) = \max_{\pi} Q_{\pi} (s_t, a_t)
Q ∗ ( s t , a t ) = π max Q π ( s t , a t )
Whatever policy function π \pi π is used, the result of taking a t a_t a t at state s t s_t s t cannot be better than Q ∗ ( s t , a t ) Q^*(s_t,a_t) Q ∗ ( s t , a t )
State-Value Function
Definition: State-value function V π ( s t ) V_\pi (s_t) V π ( s t )
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)
V π ( s t ) = E A [ Q π ( s t , A )] = a ∑ π ( a ∣ s t ) ⋅ Q π ( s t , a )
where A ∼ π ( ⋅ ∣ s t ) A \sim \pi(\cdot | s_t) A ∼ π ( ⋅ ∣ s t )
given actions are discrete.
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∫ π ( a ∣ s t ) ⋅ Q p i ( s t , a ) d a V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \int \pi(a|s_t) \cdot Q_pi(s_t, a) da
V π ( s t ) = E A [ Q π ( s t , A )] = ∫ π ( a ∣ s t ) ⋅ Q p i ( s t , a ) d a
given actions are continuous.
Understanding the Value Functions
Action-value function : Q π ( s , a ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi(s,a) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t] Q π ( s , a ) = E [ U t ∣ S t = s t , A t = a t ]
Given policy π \pi π , Q π ( s , a ) Q_\pi(s,a) Q π ( s , a ) evaluates how good it is for an agent to pick action a a a while being in state s s s .
State-value function : V π ( s ) = E A [ Q π ( s , A ) ] V_\pi (s) = \mathbb{E}_A [Q_\pi (s, A)] V π ( s ) = E A [ Q π ( s , A )]
For fixed policy π \pi π , V π ( s ) V_\pi (s) V π ( s ) evaluates how good the situation is in state s s s .
E S [ V π ( S ) ] \mathbb{E}_S [V_\pi (S)] E S [ V π ( S )] evaluates how good the policy π \pi π is.
Evaluating Reinforcement Learning
OpenAI Gym
Gym is a toolkit for developing and comparing reinforcement learning algorithms. https://gym.openai.com/
Task classification:
Classical control problems
Atari Games
MuJoCo (Continuous control tasks.)
Play CartPole Game
1 2 import gym env = gym.make('CartPole-v0' )
Get the environment of CartPole from Gym.
“env” provides states and reward.
Complete Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import gym env = gym.make('CartPole-v0' ) state = env.reset()for t in range (100 ): env.render() print (state) action = env.action_space.sample() state, reward, done, info = env.step(action) if done: print ('Finished' ) break env.close()
Value-Based Reinforcement Learning
Approximate the Q Function
Goal: Win the game(~ maximize the total reward).
Question: If we know Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) , what is the best action?
Obviously, the best action is a ∗ = argmax a Q ∗ ( s , a ) a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a) a ∗ = a argmax Q ∗ ( s , a ) . That is our policy.
Challenge: We do not know Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) .
Solution: Deep Q Network(DQN).
Use neural network Q ( s , a ; w ) Q(s,a;\mathbf{w}) Q ( s , a ; w ) to approximate Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) .
Deep Q Network(DQN)
Q ( s , a ; w ) Q(s,a;\mathbf{w}) Q ( s , a ; w ) is a neural network parameterized by w \mathbf{w} w
input shape: observed state s s s .
output shape: scores for all the action α ∈ A \alpha \in \Alpha α ∈ A .
Question: Based on the predictions, what should be the action?
just choose the best action a ∗ a^* a ∗ satisfying a ∗ = argmax a Q ∗ ( s , a ) a^* = \underset{a}{\operatorname{argmax}} ~ Q^*(s,a) a ∗ = a argmax Q ∗ ( s , a ) in each decision stage.
Temporal Difference(TD) Learning
example of driving to the Atlanta showed in the video is a good explaination
I want to drive from NYC to Atlanta. Model Q ( w ) Q(\mathbf{w}) Q ( w ) estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
Make a prediction: q = Q ( w ) q = Q(\mathbf w) q = Q ( w ) , e.g., q = 1000 q = 1000 q = 1000 . Finish the trip and get the target y y y , e.g., y = 860 y=860 y = 860 .
Then define the Loss function: L = 1 2 ( q − y ) 2 L = \dfrac{1}{2}(q-y)^2 L = 2 1 ( q − y ) 2 .
Calculate the gradient: ∂ L ∂ w = ∂ q ∂ w ⋅ ∂ L ∂ q = ( q − y ) ⋅ ∂ Q ( w ) ∂ w \dfrac{\partial L}{\partial \mathbf{w}} = \dfrac{\partial q}{\partial \mathbf{w}} \cdot \dfrac{\partial L}{\partial q} = (q - y) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}} ∂ w ∂ L = ∂ w ∂ q ⋅ ∂ q ∂ L = ( q − y ) ⋅ ∂ w ∂ Q ( w ) .
Do the gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t \mathbf{w}_{t+1} = \mathbf{w}_{t} - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t} w t + 1 = w t − α ⋅ ∂ w ∂ L ∣ w = w t
Can I update the model before finishing the trip? Can I get a better W as soon as I arrived at DC bewteen NYC and Atlanta?
Suppose:
Model’s estimate: NYC to Atlanta 1000 minutes(estimate) .
I arrive at DC; actual time cost: NYC to DC 300 minutes(actual) .
Model now updates its estimate: DC to Atlanta 600 minutes(estimate) .
Model’s estimate: Q ( w ) = 1000 Q(w) = \color{red}{1000} Q ( w ) = 1000 minutes. Updated estimate: 300 + 600 = 900 \color{blue}300\color{black} + \color{red}{600}\color{black} = \color{purple}{900} 300 + 600 = 900 minutes.
the number 900 is called TD target .
TD target 900 is a more reliable estimate than 1000 .
Loss: L = 1 2 ( Q ( w ) − y ) 2 L = \dfrac{1}{2}(Q(\mathbf{w}) - \color{purple}{y}\color{black})^2 L = 2 1 ( Q ( w ) − y ) 2 , where ( Q ( w ) − y ) (Q(\mathbf{w}) - \color{purple}{y}\color{black}) ( Q ( w ) − y ) is called TD error .
Gradient: ∂ L ∂ w = ( 1000 − 900 ) ⋅ ∂ Q ( w ) ∂ w \dfrac{\partial L}{\partial \mathbf{w}} = (\color{red}{1000} \color{black}- \color{purple}{900} \color{black}) \cdot \dfrac{\partial Q(\mathbf{w})}{\partial \mathbf{w}} ∂ w ∂ L = ( 1000 − 900 ) ⋅ ∂ w ∂ Q ( w )
Gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t \mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L}{\partial \mathbf{w}}|_{\mathbf{w} = \mathbf{w}_t} w t + 1 = w t − α ⋅ ∂ w ∂ L ∣ w = w t
Model’s estimates: NYC to DC is 400 minutes. But the Gound truth is 300 . Get the TD error: δ = 400 − 300 = 100 \color{purple} \delta \color{black} = \color{red} 400 \color{black} - \color{blue} 300 \color{black} = 100 δ = 400 − 300 = 100 .
How to apply TD learning to DQN?
In the “driving time” example, we have the equation:
T N Y C → A T L ≈ T N Y C → D C + T D C → A T L Model’s estimate ≈ Ground Truth + Model’s estimate \begin{aligned}
\color{red} T_{NYC \rightarrow ATL} \color{black} &\approx \color{blue} T_{NYC \rightarrow DC} \color{black} + \color{red} T_{DC \rightarrow ATL} \\
\color{red} \text{Model's estimate} \color{black} &\approx \color{blue} \text{Ground Truth} \color{black} + \color{red} \text{Model's estimate}
\end{aligned}
T N Y C → A T L Model’s estimate ≈ T N Y C → D C + T D C → A T L ≈ Ground Truth + Model’s estimate
In deep reinforcement learning:
Q ( s t , a t ; w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w ) \color{red} Q(s_t, a_t; \mathbf{w}) \color{black} \approx \color{blue} r_t \color{black} + \color{red} \gamma \cdot Q(s_{t+1}, a_{t+1}; \mathbf{w})
Q ( s t , a t ; w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w )
Given the proof:
U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯ = R t + ( γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯ ) = R t + γ ⋅ ( R t + 1 + γ ⋅ R t + 2 + γ 2 ⋅ R t + 3 + γ 3 ⋅ R t + 4 + ⋯ ) = R t + γ ⋅ U t + 1 \begin{aligned}
U_t &= R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots \\
&= R_t + (\gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots) \\
&= R_t + \gamma \cdot (R_{t+1} + \gamma \cdot R_{t+2} + \gamma^2 \cdot R_{t+3} + \gamma^3 \cdot R_{t+4} + \cdots) \\
&= R_t + \gamma \cdot U_{t + 1} \\
\end{aligned}
U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯ = R t + ( γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯ ) = R t + γ ⋅ ( R t + 1 + γ ⋅ R t + 2 + γ 2 ⋅ R t + 3 + γ 3 ⋅ R t + 4 + ⋯ ) = R t + γ ⋅ U t + 1
identity : U t = R t + γ ⋅ U t + 1 U_t = R_t + \gamma \cdot U_{t + 1} U t = R t + γ ⋅ U t + 1
TD learning for DQN :
DQN’s output, Q ( s t , a t ; w ) Q(s_t, a_t; \mathbf{w}) Q ( s t , a t ; w ) , is an estimate of U t U_t U t .
DQN’s output, Q ( s t + 1 , a t + 1 ; w ) Q(s_{t + 1}, a_{t + 1}; \mathbf{w}) Q ( s t + 1 , a t + 1 ; w ) , is an estimate of U t + 1 U_{t + 1} U t + 1 .
Thus, Q ( s t , a t ; w ) ≈ E [ R t + γ ⋅ Q ( S t + 1 , A t + 1 ; w ) ] Q(s_t, a_t; \mathbf{w}) \approx \mathbb{E} [R_t + \gamma \cdot Q(S_{t + 1}, A_{t + 1}; \mathbf{w})] Q ( s t , a t ; w ) ≈ E [ R t + γ ⋅ Q ( S t + 1 , A t + 1 ; w )] , which is equal to estimate of U t ≈ E [ R t + γ ⋅ estimate of U t + 1 ] \color{purple} \text{estimate of } U_t \color{black}\approx \mathbb{E} [R_t + \gamma \cdot \color{purple}\text{estimate of } U_{t + 1}\color{black}] estimate of U t ≈ E [ R t + γ ⋅ estimate of U t + 1 ]
Given the value, we get Q ( s t , a t ; w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w ) Q(s_t, a_t; \mathbf{w}) \approx r_t + \gamma \cdot Q(s_{t + 1}, a_{t + 1}; \mathbf{w}) Q ( s t , a t ; w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w ) , whick is equal to Prediction ≈ TD target \text{Prediction} \approx \text{TD target} Prediction ≈ TD target
for Prediction: Q ( s t , a t ; w ) Q(s_t, a_t; \mathbf{w}) Q ( s t , a t ; w ) , we set TD target: y t = r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w t ) = r t + γ ⋅ max a Q ( s t + 1 , a ; w t ) y_t = r_t + \gamma \cdot Q(s_{t+1},a_{t+1};\mathbf{w}_t) = r_t + \gamma \cdot \underset{a}{\max} Q(s_{t+1},a;\mathbf{w}_t) y t = r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w t ) = r t + γ ⋅ a max Q ( s t + 1 , a ; w t )
then we can set loss: L t = 1 2 [ Q ( s t , a t ; w ) − y t ] 2 L_t = \dfrac{1}{2}[Q(s_t,a_t;\mathbf{w}) - y_t]^2 L t = 2 1 [ Q ( s t , a t ; w ) − y t ] 2 .
finally do the Gradient descent: w t + 1 = w t − α ⋅ ∂ L t ∂ w ∣ w = w t \mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L_t}{\partial \mathbf{w}} |_{\mathbf{w = w}_t} w t + 1 = w t − α ⋅ ∂ w ∂ L t ∣ w = w t .
In summary: Algorithm for One iteration of TD learning.
Observe state S t = s t S_t = s_t S t = s t and perform action A t = a t A_t = a_t A t = a t .
Predict the value: q t = Q ( s t , a t ; w t ) q_t = Q(s_t, a_t;\mathbf{w}_t) q t = Q ( s t , a t ; w t )
Differentiate the value network: d t = ∂ Q ( s t , a t ; w t ) ∂ w ∣ w = w t \mathbf{d}_t = \dfrac{\partial Q(s_t, a_t;\mathbf{w}_t)}{\partial \mathbf{w}}|_{\mathbf{w=w}_t} d t = ∂ w ∂ Q ( s t , a t ; w t ) ∣ w = w t
Environment provides new state s t + 1 s_{t+1} s t + 1 and reward r t r_t r t
Compute TD target: y t = r t + γ ⋅ max a Q ( s t + 1 , a ; w t ) y_t = r_t + \gamma \cdot \underset{a}{\max}Q(s_{t+1}, a;\mathbf{w}_t) y t = r t + γ ⋅ a max Q ( s t + 1 , a ; w t )
Gradient descent: w t + 1 = w t − α ⋅ ( q t − y t ) ⋅ d t \mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot (q_t - y_t) \cdot \mathbf{d}_t w t + 1 = w t − α ⋅ ( q t − y t ) ⋅ d t .
Policy-Based Reinforcement Learning
Policy Function Approximation
Policy function π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) is a probability density function (PDF). It takes state s s s as input. It output the probabilities for all the actions, e.g.,
π ( left ∣ s ) = 0.2 π ( right ∣ s ) = 0.1 π ( up ∣ s ) = 0.7 \begin{aligned}
\pi(\text{left} | s) = 0.2 \\
\pi(\text{right} | s) = 0.1 \\
\pi(\text{up} | s) = 0.7
\end{aligned}
π ( left ∣ s ) = 0.2 π ( right ∣ s ) = 0.1 π ( up ∣ s ) = 0.7
Randomly sample action a a a random drawn from the distribution.
Can we directly learn a policy function π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) ?
If there are only a few states and actions, then yes, we can. Draw a table (matrix) and learn the entries.
What if there are too many (or infinite) states or actions? It’s impossible to finish the task by this way. Let’s try another way involving he Policy Network.
Policy Network
Policy network : Use a neural net to approximate π ( a ∣ s ) \pi(a|s) π ( a ∣ s )
Some properties of the policy network:
∑ a ∈ A π ( a ∣ s ; θ ) = 1 \sum_{a \in \Alpha} \pi (a|s;\theta) = 1 ∑ a ∈ A π ( a ∣ s ; θ ) = 1
Here, A = \Alpha = A = {“left”, “right”, “up”} is the set of all actions.
That is why we use softmax activation.
Let’s turn back to see the former knowledge for the purpose of exploiting the way to train the Policy Network.
State-Value Function Approximation
Definition : Discounted return.
U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯ U_t = R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + \gamma^4 \cdot R_{t+4} + \cdots
U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + γ 4 ⋅ R t + 4 + ⋯
The return depends on actions A t , A t + 1 , A t + 2 , ⋯ A_t, A_{t+1}, A_{t+2}, \cdots A t , A t + 1 , A t + 2 , ⋯ and states S t , S t + 1 , S t + 2 , ⋯ S_t, S_{t+1}, S_{t+2}, \cdots S t , S t + 1 , S t + 2 , ⋯
Actions are random, given P [ A = a ∣ S = s ] = π ( a ∣ s ) \mathbb{P}[A=a | S=s] = \pi(a|s) P [ A = a ∣ S = s ] = π ( a ∣ s ) (Policy function)
States are random, given P [ S ′ = s ′ ∣ S = s , A = a ] = π ( s ′ ∣ s , a ) \mathbb{P}[S' = s' | S=s, A=a] = \pi(s'|s,a) P [ S ′ = s ′ ∣ S = s , A = a ] = π ( s ′ ∣ s , a ) (State function)
Definition : Action-value function.
Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi(s_t,a_t) = \mathbb{E} [U_t | S_t = s_t, A_t = a_t]
Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ]
The expectation is taken w.r.t. actions A t + 1 , ⋯ A_{t+1}, \cdots A t + 1 , ⋯ and states S t + 1 , ⋯ S_{t+1}, \cdots S t + 1 , ⋯
Definition : State-value function .
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_\pi (s_t) = \mathbb{E}_A [Q_\pi (s_t, A)] = \sum_a \pi(a|s_t) \cdot Q_\pi(s_t, a)
V π ( s t ) = E A [ Q π ( s t , A )] = a ∑ π ( a ∣ s t ) ⋅ Q π ( s t , a )
Integrate out action A ∼ π ( ⋅ ∣ s t ) A \sim \pi(\cdot|s_t) A ∼ π ( ⋅ ∣ s t )
How to approximate state-value function ?
V π ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V_\pi (s_t;\theta) = \sum_a \pi(a|s_t;\theta) \cdot Q_\pi(s_t, a)
V π ( s t ; θ ) = a ∑ π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a )
Here is the Approximate state-value function w.r.t. s t s_t s t .
Policy-based learning : Learn θ \theta θ that maximizes J ( θ ) = E S [ V ( S ; θ ) ] J(\theta) = \mathbb{E}_S [V(S; \theta)] J ( θ ) = E S [ V ( S ; θ )]
How to improve θ \theta θ ? Policy gradient ascent!
Policy Gradient
Definition: Approximate state-value function.
V π ( s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V_\pi (s;\theta) = \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)
V π ( s ; θ ) = a ∑ π ( a ∣ s ; θ ) ⋅ Q π ( s , a )
Policy gradient : Derivative of V π ( s t ; θ ) V_\pi (s_t;\theta) V π ( s t ; θ ) w.r.t. θ \theta θ .
∂ V π ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) + ∑ a ∂ Q π ( s , a ) ∂ θ ⋅ π ( a ∣ s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ ∂ log π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) + E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ Q π ( s , A ) ∂ θ ] = ∑ a π ( a ∣ s ; θ ) ⋅ ∂ log π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) + x = E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ log π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] + x \begin{aligned}
\dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta} &= \dfrac{\partial \sum_a \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\
&= \sum_a \dfrac{\partial \pi(a|s;\theta) \cdot Q_\pi(s, a)}{\partial \theta}\\
&= \sum_a \dfrac{\partial \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) +
\sum_a \dfrac{\partial Q_\pi(s, a)}{\partial \theta} \cdot \pi(a|s;\theta)\\
&= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + \mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\Big[\dfrac{\partial Q_\pi(s, A)}{\partial \theta}\Big] \\
&= \sum_a \pi(a|s;\theta) \cdot \dfrac{\partial \log \pi(a|s;\theta)}{\partial \theta} \cdot Q_\pi(s, a) + x \\
&= \mathbb{E}_{A\sim \pi(\cdot|s;\theta)} \Big[ \dfrac{\partial \log \pi(A|s;\theta)}{\partial \theta} \cdot Q_\pi(s, A)\Big] + x \\
\end{aligned}
∂ θ ∂ V π ( s ; θ ) = ∂ θ ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) = a ∑ ∂ θ ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) = a ∑ ∂ θ ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) + a ∑ ∂ θ ∂ Q π ( s , a ) ⋅ π ( a ∣ s ; θ ) = a ∑ π ( a ∣ s ; θ ) ⋅ ∂ θ ∂ log π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) + E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ θ ∂ Q π ( s , A ) ] = a ∑ π ( a ∣ s ; θ ) ⋅ ∂ θ ∂ log π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) + x = E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ θ ∂ log π ( A ∣ s ; θ ) ⋅ Q π ( s , A ) ] + x
Given the equation J ( θ ) = E S [ V π ( S ; θ ) ] J(\theta) = \mathbb{E}_S [V_{\pi}(S;\theta)] J ( θ ) = E S [ V π ( S ; θ )] , we can get
∂ J ( θ ) ∂ θ = E S [ ∂ V π ( S ; θ ) ∂ θ ] = E S [ E A ∼ π ( ⋅ ∣ S ; θ ) [ ∂ log π ( A ∣ S ; θ ) ∂ θ ⋅ Q π ( S , A ) ] ] + E S [ x ] \begin{aligned}
\dfrac{\partial J(\theta)}{\partial \theta} &=
\mathbb{E}_S \Big[ \dfrac{\partial V_{\pi}(S;\theta)}{\partial \theta} \Big]\\&=
\mathbb{E}_S \Big[
\mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big]
\Big] + \mathbb{E}_S[x]
\end{aligned}
∂ θ ∂ J ( θ ) = E S [ ∂ θ ∂ V π ( S ; θ ) ] = E S [ E A ∼ π ( ⋅ ∣ S ; θ ) [ ∂ θ ∂ log π ( A ∣ S ; θ ) ⋅ Q π ( S , A ) ] ] + E S [ x ]
For the Policy Gradient, we can do:
Randomly sample an action a ^ \hat a a ^ according to π ( ⋅ ∣ s ; θ ) \pi(\cdot | s;\theta) π ( ⋅ ∣ s ; θ )
Calculate g ( a ^ , θ ) = ∂ log π ( a ^ ∣ θ ) ∂ θ ⋅ Q π ( s , a ^ ) \mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a) g ( a ^ , θ ) = ∂ θ ∂ log π ( a ^ ∣ θ ) ⋅ Q π ( s , a ^ )
Use g ( a ^ , θ ) \mathbf{g}(\hat a, \theta) g ( a ^ , θ ) as an approximation to the policy gradient ∂ V π ( s ; θ ) ∂ θ \dfrac{\partial V_{\pi}(s;\theta)}{\partial \theta} ∂ θ ∂ V π ( s ; θ )
Here we do not to compute the E A ∼ π ( ⋅ ∣ S ; θ ) [ ∂ log π ( A ∣ S ; θ ) ∂ θ ⋅ Q π ( S , A ) ] \mathbb{E}_{A\sim \pi(\cdot|S;\theta)} \big[\dfrac{\partial \log \pi(A|S;\theta)}{\partial \theta} \cdot Q_\pi(S, A)\big] E A ∼ π ( ⋅ ∣ S ; θ ) [ ∂ θ ∂ log π ( A ∣ S ; θ ) ⋅ Q π ( S , A ) ] . Instead, we use Monte Carlo method , just computing one sampling g ( a ^ , θ ) = ∂ log π ( a ^ ∣ θ ) ∂ θ ⋅ Q π ( s , a ^ ) \mathbf{g}(\hat a, \theta) = \dfrac{\partial \log \pi(\hat a| \theta)}{\partial \theta} \cdot Q_\pi(s,\hat a) g ( a ^ , θ ) = ∂ θ ∂ log π ( a ^ ∣ θ ) ⋅ Q π ( s , a ^ ) to update the gradient for approximating.
Update policy network using policy gradient
Observe the state s t s_t s t .
Randomly sample action a t a_t a t according to π ( ⋅ ∣ s t ; θ t ) \pi(\cdot|s_t;\theta_t) π ( ⋅ ∣ s t ; θ t ) .
Compute q t ≈ Q π ( s t , a t ) q_t \approx Q_\pi(s_t, a_t) q t ≈ Q π ( s t , a t ) (some estimate).
Differentiate policy network: d θ , t = ∂ log π ( a t ∣ s t , θ ) ∂ θ ∣ θ = θ t \mathbf{d}_{\theta, t} = \dfrac{\partial \log \pi(a_t|s_t, \theta)}{\partial \theta} |_{\theta=\theta_t} d θ , t = ∂ θ ∂ log π ( a t ∣ s t , θ ) ∣ θ = θ t
(Approximate) policy gradient: g ( a t , θ t ) = q t ⋅ d θ , t \mathbf{g}(a_t, \theta_t) = q_t \cdot \mathbf{d}_{\theta, t} g ( a t , θ t ) = q t ⋅ d θ , t .
Update policy network: θ t + 1 = θ t + β ⋅ g ( a t , θ t ) \theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a_t, \theta_t) θ t + 1 = θ t + β ⋅ g ( a t , θ t ) .
Another question about the operation 3, how to construct the Q function?
Option1: REINFORCE
Play the game to the end and generate the trajectory:
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T . s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T.
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T .
Compute the discounted return u t = ∑ k = t T γ k − t r k u_t = \sum_{k=t}^T \gamma^{k-t} r_k u t = ∑ k = t T γ k − t r k , for all t t t .
Since Q π ( s t , a t ) = E [ U t ] Q_\pi(s_t,a_t) = \mathbb{E} [U_t] Q π ( s t , a t ) = E [ U t ] , we can use u t u_t u t to approximate Q π ( s t , a t ) Q_{\pi}(s_t, a_t) Q π ( s t , a t ) . Then we can use q t = u t q_t = u_t q t = u t in the above algorithm.
Option 2: Approximate Q π Q_\pi Q π using a neural network.
This leads to the actor-critic method.
Actor-Critic Methods
Value Network and Policy Network
Definition: State-value function.
V ( s ; θ , w ) = ∑ a π ( a ∣ s ) ⋅ Q π ( s , a ) ≈ ∑ a π ( a ∣ s ; θ ) ⋅ q ( s , a ; w ) V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s) \cdot Q_{\pi}(s,a) \approx \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})
V ( s ; θ , w ) = a ∑ π ( a ∣ s ) ⋅ Q π ( s , a ) ≈ a ∑ π ( a ∣ s ; θ ) ⋅ q ( s , a ; w )
Policy network(actor)
Use neural net π ( a ∣ s ; θ ) \pi(a|s;\theta) π ( a ∣ s ; θ ) to approximate π ( a ∣ s ) \pi(a|s) π ( a ∣ s )
θ \theta θ is the trainable parameters of the neural net.
Value network(critic)
Use neural net Q π ( s , a ; w ) Q_{\pi}(s,a;\mathbf{w}) Q π ( s , a ; w ) to approximate Q π ( s , a ) Q_{\pi}(s,a) Q π ( s , a )
w \mathbf{w} w is the trainable parameters of the neural net.
Policy Network(Actor)
π ( a ∣ s ; θ ) \pi(a|s;\theta) π ( a ∣ s ; θ )
Input: stats s s s , e.g., a screenshot of Super Mario.
Output: probability distribution over the actions.
Let A \Alpha A be the set all actions, e.g., A = \Alpha= A = {“left”, “right”, “up”}.
∑ a ∈ A π ( a ∣ s ; θ ) = 1 \sum_{a\in\Alpha}\pi(a|s;\theta)=1 ∑ a ∈ A π ( a ∣ s ; θ ) = 1 . (That is why we use softmax activation.)
Value Network(Critic)
Q π ( s , a ; w ) Q_{\pi}(s,a;\mathbf{w}) Q π ( s , a ; w )
Train the Neural Networks
Definition: State-value function approximated using neural networks.
V = ∑ a π ( a ∣ s ; θ ) ⋅ q ( s , a ; w ) V = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})
V = a ∑ π ( a ∣ s ; θ ) ⋅ q ( s , a ; w )
Training: Update the parameters θ \theta θ and w \mathbf{w} w .
Update policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π ( a ∣ s ; θ ) to increase the state-value V ( s ; θ , w ) V(s;\theta, \mathbf{w}) V ( s ; θ , w )
Update value network q ( s , a ; w ) q(s,a;\mathbf{w}) q ( s , a ; w ) to better estimate the return.
step:
Observe the state s t s_t s t .
Randomly sample action a t a_t a t according to π ( ⋅ ∣ s t ; θ t ) \pi(\cdot|s_t;\theta_t) π ( ⋅ ∣ s t ; θ t ) .
Perform a t a_t a t and observe new state s t + 1 s_{t+1} s t + 1 and reward r t r_t r t .
Updata w \mathbf{w} w (in value network) using temporal difference(TD).
Update θ \theta θ (in policy network) using policy gradient.
Update value network using TD
Comupte q ( s t , a t ; w t ) q(s_t, a_t; \mathbf{w}_t) q ( s t , a t ; w t ) and q ( s t + 1 , a t + 1 ; w t ) q(s_{t+1}, a_{t+1};\mathbf{w}_t) q ( s t + 1 , a t + 1 ; w t ) .
TD target: y t = r t + γ ⋅ q ( s t + 1 , a t + 1 ; w t ) y_t = r_t + \gamma \cdot q(s_{t+1}, a_{t+1};\mathbf{w}_t) y t = r t + γ ⋅ q ( s t + 1 , a t + 1 ; w t ) .
Compute the Loss: L ( w ) = 1 2 [ q ( s t , a t ; w t ) − y t ] 2 L(\mathbf{w}) = \dfrac{1}{2} [q(s_t, a_t; \mathbf{w}_t) - y_t]^2 L ( w ) = 2 1 [ q ( s t , a t ; w t ) − y t ] 2
Do gradient descent: w t + 1 = w t − α ⋅ ∂ L ( w ) ∂ w ∣ w = w t \mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \dfrac{\partial L(\mathbf{w})}{\partial \mathbf{w}} |_{\mathbf{w=w}_t} w t + 1 = w t − α ⋅ ∂ w ∂ L ( w ) ∣ w = w t
Update policy network using policy network
Definition: State-value function approximated using neural networks.
V ( s ; θ , w ) = ∑ a π ( a ∣ s ; θ ) ⋅ q ( s , a ; w ) V(s;\theta,\mathbf{w}) = \sum_a \pi(a|s;\theta) \cdot q(s,a;\mathbf{w})
V ( s ; θ , w ) = a ∑ π ( a ∣ s ; θ ) ⋅ q ( s , a ; w )
Policy gradient: Derivative of V ( s ; θ , w ) V(s;\theta,\mathbf{w}) V ( s ; θ , w ) w.r.t. θ \theta θ .
Set g ( a , θ ) = ∂ log π ( a ∣ s , θ ) ∂ θ ⋅ q ( s t , a ; w ) \mathbf{g}(a,\theta) = \dfrac{\partial \log \pi(a|s,\theta)}{\partial \theta} \cdot q(s_t, a;\mathbf{w}) g ( a , θ ) = ∂ θ ∂ log π ( a ∣ s , θ ) ⋅ q ( s t , a ; w ) . Then we can get ∂ V ( s ; θ , w ) ∂ θ = E A [ g ( A , θ ) ] \dfrac{\partial V(s;\theta,\mathbf{w})}{\partial \theta} = \mathbb{E}_A[\mathbf{g}(A,\theta)] ∂ θ ∂ V ( s ; θ , w ) = E A [ g ( A , θ )] .
By Monte Carlo method, once we get a state s s s , we use it to update the gradient directly, to approximate the expectation.
Algorithm: Update policy network using stochastic policy gradient.
Random sampling a ∼ π ( ⋅ ∣ s t ; θ t ) . a \sim \pi(\cdot|s_t; \theta_t). a ∼ π ( ⋅ ∣ s t ; θ t ) . (Thus g ( a , θ ) \mathbf{g}(a,\theta) g ( a , θ ) is unbiased.)
Stochastic gradient ascent: θ t + 1 = θ t + β ⋅ g ( a , θ t ) \theta_{t+1} = \theta_t + \beta \cdot \mathbf{g}(a,\theta_t) θ t + 1 = θ t + β ⋅ g ( a , θ t )
Summary of Algorithm
Observe state s t s_t s t and randomly sample a t ∼ π ( ⋅ ∣ s t ; θ t ) a_t \sim \pi(\cdot|s_t;\theta_t) a t ∼ π ( ⋅ ∣ s t ; θ t ) .
Perform a t a_t a t ; then environment gives new state s t + 1 s_{t+1} s t + 1 and reward r t r_t r t .
Randomly sample a ~ t + 1 ∼ π ( ⋅ ∣ s t + 1 ; θ t ) \tilde{a}_{t+1} \sim \pi(\cdot|s_{t+1};\theta_t) a ~ t + 1 ∼ π ( ⋅ ∣ s t + 1 ; θ t ) . (Do not perform a ~ t + 1 \tilde{a}_{t+1} a ~ t + 1 )
Evalueate value network: q t = q ( s t , a t ; w t ) q_t = q(s_t, a_t; \mathbf{w}_t) q t = q ( s t , a t ; w t ) and q t + 1 = q ( s t + 1 , a t + 1 ; w t ) q_{t+1} = q(s_{t+1}, a_{t+1}; \mathbf{w}_t) q t + 1 = q ( s t + 1 , a t + 1 ; w t ) .
Compute TD error: δ t = q t − ( r t + γ ⋅ q t + 1 ) \delta_t = q_t - (r_t + \gamma \cdot q_{t+1}) δ t = q t − ( r t + γ ⋅ q t + 1 ) .
Differentiate value network: d w , t = ∂ q ( s t , a t ; w ) ∂ w \mathbf{d}_{\mathbf{w},t} = \dfrac{\partial q(s_t, a_t; \mathbf{w})}{\partial \mathbf{w}} d w , t = ∂ w ∂ q ( s t , a t ; w )