ML 1: Reinforcement Learning, So Hot Right Now

Published on
12 min read––– views

Introduction to Reinforcement Learning

This page is a summary of the cool symbols and my layman's understanding of what they mean in the context of Reinforcement Learning. Praise be to Sergey and BHoag. A more complete, coherent glossary is given in the glossary of this point; this post remains on the more shit-posty end of the spectrum.


t\mathcal{t}time step
rt\mathcal{r_{t}}reward that scales with the state e.g. st,rtst+1,rt+1\mathcal{s_{t}, r_{t} \to s_{t+1}, r_{t+1} }
rt\mathcal{r_{t}} get provided to the agent in order to line
this process of incrementing st,at,rt\mathcal{s_{t}, a_{t}, r_{t}} to approach a better and better policy π\mathcal{\pi} is repeated until the optimal policy π\mathcal{\pi^{*}} is achieved by maximizing rtr_{t}
S\mathcal{S}the set of states and the distribution of starting states p(s0)p(\mathcal{s}_{0})
A\mathcal{A}the set of actions that can be taken in a given state
T(st+1|st,at)\mathcal{T(s_{t+1}\text{\textbar} s_{t}, {a}_{t} ) }Transition dynamics function which maps a state-action pair at time t\mathcal{t} to a distribution of states at t+1\mathcal{t+1}
// the probability "|\text{\textbar}" of st+1\mathcal{s_{t+1}} given st,at\mathcal{s_{t}, a_{t}}
can't make any assumptions about future states, just that \exists a function T\mathcal{T} that determines the probability of any state
R(st,at,st+1)\mathcal{R(s_{t}, a_{t}, s_{t+1})}immediate reward function
γ[0,1]\gamma \in [0,1]discount factor where limγ0\displaystyle{\lim_{\gamma \to 0}} places emphasis on the immediacy of the reward
π:Sp(A=a|s)\mathcal{\pi:S \to p(A=a\text{\textbar}s)}π\mathcal{\pi} is a mapping from the set of states to a probability distribution over the set of actions
TTan episode. A Markov Decision Process is said to be "episodic" if, after length TT, the state resets
if T=T=\infin, then the MDP is not episodic
if TT, then the MDP is episodic
R=t=0T1γtrt+1R = \displaystyle\sum_{t=0}^{T-1} \gamma^tr_{t+1}brings the policy us closer to the goal of Reinforcement Learning: π\mathcal{\pi^*}
π=arg maxπE[R|π]\mathcal{\pi^*=\argmax\limits_\pi}\mathbb{E}[R \text{\textbar}\mathcal{\pi}]the optimal policy/control strategy based on the given possibility and rewards

Problem Solving in Reinforcement Learning: Evaluative Functions

Vπ(s)=E[R|s,π]V\mathcal{^{\pi}(s)=\mathbb{E}[}R\mathcal{\text{\textbar}s, \pi]}Expected return, given the state and probability
Vπ(s)=arg maxπVπ(s)V\mathcal{^{\pi}(s)=\argmax\limits_\pi}V^{\pi}(s)
 sS\space \forall s \in\mathcal{S}
the optimal policy for all states in the set of states.
sidebar: saying that something holds true for all states in the set of all possible states is really stupid and redundant for a generalized solution like RL. Why not say aA\forall a \in\mathcal{A} whenever you mention a\mathcal{a}?
the set of all possible sets, policies, models, gradients, loss functions, backup operations, transition dynamics, action, Turing Machines, and symbols. To be used:
x where x is literally any variable. ever.\forall x \in\mathcal{勿} \text { where x is literally any variable. ever.}
Qπ(s,a)=E[R|s,a,π]Q\mathcal{^\pi(s,a)=\mathbb{E}[}R\mathcal{\text{\textbar}s,a,\pi]}quality function which is like Vπ(s)V\mathcal{^\pi(s)}, except the intial action aa is provided
Qπ(st,at)=Est+1[rt+1+γQπ(st+1,π(st+1)]Q\mathcal{^\pi(s_{t},a_{t})=\mathbb{E_{s_{t+1}}}[r_{t+1}+\gamma}Q\mathcal{^{\pi}(s_{t+1}, \pi(s_{t+1})]}the expected output of the next state determined by the anticipated reward plus the discount times the quality of the next state, provided the next state and action
Qπ(st,at)Qπ(st,at)+αδQ\mathcal{^\pi(s_{t},a_{t}) \leftarrow }Q\mathcal{^{\pi}(s_{t},a_{t}) + \alpha\delta}The value of Q is assigned according to the learning rate times the Temporal Difference Error
α\alphathe learning rate
δ=YQπ(st,at)\delta = \text {Y} - Q^{\pi}(\mathcal{s_{t}, a_{t}})the Temporal Difference Errror
Y\text {Y}the target of standard regression maximized with respect to the action aa.
Y=rt+γmaxaQπ(st,at)\text {Y} = \mathcal {r_{t} + \gamma \max_{a}} Q^{\pi}(\mathcal{s_{t}, a_{t}})
QY=rt+γmaxaQπ(st+1,at+1)Q^{*} \approx \text {Y} = \mathcal {r_{t} + \gamma \max_{a}} Q^{\pi}(\mathcal{s_{t+1}, a_{t+1}})the optimal Quality Function is approximated by the target of standard regression
sπa1,a2,a3πa1,2,3πasπb1,2,3πbsπb1,b2,b3s_{\pi_{a_{1}, a_{2}, a_{3}}}\xleftrightharpoons[\overline{\pi_{a_{1,2,3}}}]{\pi_{a}} \LARGE{s} \normalsize \xrightleftharpoons[\overline{\pi_{b_{1,2,3}}}]{\pi_{b}} s_{\pi_{b_{1}, b_{2}, b_{3}}}Monte Carlo Methods
a policy is rolled out and pursued for an arbitrary amount of episodes after which the mean value of the policy is returned.
//looking into the possible future based on the current state and a given policy and choosing the best policy to be taken based on the mean
TD(λ)\text {TD}(\lambda)a combination of Temporal Difference bootstrapping and monte carlo methods where λ\lambda is an interpolation factor between TD and MCM
Aπ(s)=QπVπ\text{A}^{\pi}\mathcal{(s)} = Q^{\pi} - V^{\pi}the Advantage Function which produces relative state-action values instead of absolute state-action values like QπQ^{\pi}.
this is useful as it's easier to make comparative evaluations between actual return calculations e.g. we can tell that x>yx > y without knowing of xx or yy are actually any good.
the Advantage Function asks/answers the question "How much better is this action compared to the average action taken in this state."
{Policy evaluationPolicy improvement \infin \circlearrowright \begin{cases} \footnotesize \text {Policy evaluation} \\ \footnotesize \text{Policy improvement } \end{cases}Generalized Policy Improvement
πθ; θ=the parameters\pi_{\theta} ; \space \theta = \small \text{the parameters}means of policy search which is typically chosen that is parametrized to maximize E[R|θ]\mathbb{E}[\text{R}\text{\textbar}\theta] using either gradient-based or gradient free optimization
gradient-based optimizationapplicable to Deep Reinforcement Learning scenarios denoted by high-dimensional action spaces
gradient-free optimizationgood for low-dimensional problems
θE[f(x,θ)]=Ex[f(x,θ)θlogp(x)]\nabla_{\theta}\mathbb{E}[f(x,\theta)]=\mathbb{E}_{x}[f(x,\theta)\nabla_{\theta}\log p(x)]REINFORCE rule: gradient-estimation (aka "score function", "likelihood-ratio estimator") computes the gradient of the expectation over a function ff of a random variable xx with respect to parameters θ\theta.
logx\log x in RLlogxb=a  s.t.  xlog_{x}b=a \space \text{ s.t. } \exists \space x \in
hint: it does

A Brief Note on Gradients

f(x,y)=x2+2xy+y2f(x,y) = x^2 + 2xy + y^2
f=2x+2y,2x+2y Take the partial derivative with respect to each variable \nabla f = \langle 2x + 2y, 2x + 2y \rangle \qquad \ \scriptsize \text{Take the partial derivative with respect to each variable}
θx,yx,y are assigned to the parameter θ\qquad\theta \leftarrow x,y \qquad \qquad \quad \qquad \scriptsize x,y \text{ are assigned to the parameter } \theta
θf=2x+2y,2x+2y=f\nabla_{\theta}f = \langle 2x + 2y, 2x + 2y \rangle = \nabla f

Closing Big Brain Ideas

Markov's PropertyOnly the current state effects the next state
Actor Critic Methodsuses π\mathcal{\pi^*} as a baseline for policy gradients
Reinforcement Learningfocuses on learning without knowing he underlying model of the environment, but they can learn from interacting with it
Convolutional Neural Networkcommonly used as components of RL agents allowing them to learn directly from raw, high-dimensional visual input values
The Goal of RLto train Deep Neural Networks to approximate π,V,Q,A\mathcal{\pi^*}, V^*, Q^*, A^*
Experience Relay Memorystores st,at,st+1,rt+1\mathcal{s_{t}, a_{t}, s_{t+1}, r_{t+1}} in a cyclic buffer, enabling RL agents to sample from and train on previously observed data (batches) offline
Target Networkcorrectly-weighted, but frozen neural nets. The "active" policy network pulls TD error values from the cached and comparatively stable target network, rather than having to calculate the TD error based on its own rapidly fluctuating estimates of QQ values
Hard Attentionusing RL to make discrete stochastic decisions over inputs via back propagation and re-parameterization
Re-parameterizationallows neural networks to be treated as stochastic computational graphs -- a key concept in algorithms involving Stochastic Value Gradients
Dynamic Programmingusing the current QπQ^{\pi} to improve the next QπQ^{\pi} via "bootstrapping"
Monte Carlo Methodsestimating the expected return from a state by averaging return from multiple rollouts of a policy.
used to find optimal trajectories.
limited to episodic Markov Decision Processes
Policy Searchdoesn't maintain a value-function model, instead this approach directly searches for an optimal policy π\pi^{*}
Actor-Critic ModelThe Actor (policy) receives a state from the environment and chooses an action to perform. At the same time, some critic (VπV^\pi) receives the state and reward resulting from the previous interaction. The critic uses the RD error δ\delta calculated from this information to update itself and the actor.

It's trivial, really.