ML 1: Reinforcement Learning, So Hot Right Now

Published on

Introduction to Reinforcement Learning

This page is a summary of the cool symbols and my layman's understanding of what they mean in the context of Reinforcement Learning. Praise be to Sergey and BHoag. A more complete, coherent glossary is given in the glossary of this point; this post remains on the more shit-posty end of the spectrum.

Glossary

symbolterm
$\mathcal{s}_{t}$state
$\mathcal{a}_{t}$action
$\mathcal{t}$time step
$\mathcal{r_{t}}$reward that scales with the state e.g. $\mathcal{s_{t}, r_{t} \to s_{t+1}, r_{t+1} }$
$\mathcal{r_{t}}$ get provided to the agent in order to line
this process of incrementing $\mathcal{s_{t}, a_{t}, r_{t}}$ to approach a better and better policy $\mathcal{\pi}$ is repeated until the optimal policy $\mathcal{\pi^{*}}$ is achieved by maximizing $r_{t}$
$\mathcal{S}$the set of states and the distribution of starting states $p(\mathcal{s}_{0})$
$\mathcal{A}$the set of actions that can be taken in a given state
$\mathcal{T(s_{t+1}\text{\textbar} s_{t}, {a}_{t} ) }$Transition dynamics function which maps a state-action pair at time $\mathcal{t}$ to a distribution of states at $\mathcal{t+1}$
// the probability "$\text{\textbar}$" of $\mathcal{s_{t+1}}$ given $\mathcal{s_{t}, a_{t}}$
can't make any assumptions about future states, just that $\exists$ a function $\mathcal{T}$ that determines the probability of any state
$\mathcal{R(s_{t}, a_{t}, s_{t+1})}$immediate reward function
$\gamma \in [0,1]$discount factor where $\displaystyle{\lim_{\gamma \to 0}}$ places emphasis on the immediacy of the reward
$\mathcal{\pi:S \to p(A=a\text{\textbar}s)}$$\mathcal{\pi}$ is a mapping from the set of states to a probability distribution over the set of actions
$T$an episode. A Markov Decision Process is said to be "episodic" if, after length $T$, the state resets
if $T=\infin$, then the MDP is not episodic
if $T$, then the MDP is episodic
$R = \displaystyle\sum_{t=0}^{T-1} \gamma^tr_{t+1}$brings the policy us closer to the goal of Reinforcement Learning: $\mathcal{\pi^*}$
$\mathcal{\pi^*=\argmax\limits_\pi}\mathbb{E}[R \text{\textbar}\mathcal{\pi}]$the optimal policy/control strategy based on the given possibility and rewards

Problem Solving in Reinforcement Learning: Evaluative Functions

functionpurpose
$V\mathcal{^{\pi}(s)=\mathbb{E}[}R\mathcal{\text{\textbar}s, \pi]}$Expected return, given the state and probability
$V\mathcal{^{\pi}(s)=\argmax\limits_\pi}V^{\pi}(s)$
$\space \forall s \in\mathcal{S}$
the optimal policy for all states in the set of states.
sidebar: saying that something holds true for all states in the set of all possible states is really stupid and redundant for a generalized solution like RL. Why not say $\forall a \in\mathcal{A}$ whenever you mention $\mathcal{a}$?
the set of all possible sets, policies, models, gradients, loss functions, backup operations, transition dynamics, action, Turing Machines, and symbols. To be used:
$\forall x \in\mathcal{勿} \text { where x is literally any variable. ever.}$
$Q\mathcal{^\pi(s,a)=\mathbb{E}[}R\mathcal{\text{\textbar}s,a,\pi]}$quality function which is like $V\mathcal{^\pi(s)}$, except the intial action $a$ is provided
$Q\mathcal{^\pi(s_{t},a_{t})=\mathbb{E_{s_{t+1}}}[r_{t+1}+\gamma}Q\mathcal{^{\pi}(s_{t+1}, \pi(s_{t+1})]}$the expected output of the next state determined by the anticipated reward plus the discount times the quality of the next state, provided the next state and action
$Q\mathcal{^\pi(s_{t},a_{t}) \leftarrow }Q\mathcal{^{\pi}(s_{t},a_{t}) + \alpha\delta}$The value of Q is assigned according to the learning rate times the Temporal Difference Error
$\alpha$the learning rate
$\delta = \text {Y} - Q^{\pi}(\mathcal{s_{t}, a_{t}})$the Temporal Difference Errror
$\text {Y}$the target of standard regression maximized with respect to the action $a$.
$\text {Y} = \mathcal {r_{t} + \gamma \max_{a}} Q^{\pi}(\mathcal{s_{t}, a_{t}})$
$Q^{*} \approx \text {Y} = \mathcal {r_{t} + \gamma \max_{a}} Q^{\pi}(\mathcal{s_{t+1}, a_{t+1}})$the optimal Quality Function is approximated by the target of standard regression
$s_{\pi_{a_{1}, a_{2}, a_{3}}}\xleftrightharpoons[\overline{\pi_{a_{1,2,3}}}]{\pi_{a}} \LARGE{s} \normalsize \xrightleftharpoons[\overline{\pi_{b_{1,2,3}}}]{\pi_{b}} s_{\pi_{b_{1}, b_{2}, b_{3}}}$Monte Carlo Methods
a policy is rolled out and pursued for an arbitrary amount of episodes after which the mean value of the policy is returned.
//looking into the possible future based on the current state and a given policy and choosing the best policy to be taken based on the mean
$\text {TD}(\lambda)$a combination of Temporal Difference bootstrapping and monte carlo methods where $\lambda$ is an interpolation factor between TD and MCM
$\text{A}^{\pi}\mathcal{(s)} = Q^{\pi} - V^{\pi}$the Advantage Function which produces relative state-action values instead of absolute state-action values like $Q^{\pi}$.
this is useful as it's easier to make comparative evaluations between actual return calculations e.g. we can tell that $x > y$ without knowing of $x$ or $y$ are actually any good.
the Advantage Function asks/answers the question "How much better is this action compared to the average action taken in this state."
$\infin \circlearrowright \begin{cases} \footnotesize \text {Policy evaluation} \\ \footnotesize \text{Policy improvement } \end{cases}$Generalized Policy Improvement
functionpurpose
$\pi_{\theta} ; \space \theta = \small \text{the parameters}$means of policy search which is typically chosen that is parametrized to maximize $\mathbb{E}[\text{R}\text{\textbar}\theta]$ using either gradient-based or gradient free optimization
gradient-based optimizationapplicable to Deep Reinforcement Learning scenarios denoted by high-dimensional action spaces
$\nabla_{\theta}\mathbb{E}[f(x,\theta)]=\mathbb{E}_{x}[f(x,\theta)\nabla_{\theta}\log p(x)]$REINFORCE rule: gradient-estimation (aka "score function", "likelihood-ratio estimator") computes the gradient of the expectation over a function $f$ of a random variable $x$ with respect to parameters $\theta$.
$\log x$ in RL$log_{x}b=a \space \text{ s.t. } \exists \space x \in$
hint: it does

$f(x,y) = x^2 + 2xy + y^2$
$\nabla f = \langle 2x + 2y, 2x + 2y \rangle \qquad \ \scriptsize \text{Take the partial derivative with respect to each variable}$
$\qquad\theta \leftarrow x,y \qquad \qquad \quad \qquad \scriptsize x,y \text{ are assigned to the parameter } \theta$
$\nabla_{\theta}f = \langle 2x + 2y, 2x + 2y \rangle = \nabla f$

Closing Big Brain Ideas

termdefinition
Markov's PropertyOnly the current state effects the next state
Actor Critic Methodsuses $\mathcal{\pi^*}$ as a baseline for policy gradients
Reinforcement Learningfocuses on learning without knowing he underlying model of the environment, but they can learn from interacting with it
Convolutional Neural Networkcommonly used as components of RL agents allowing them to learn directly from raw, high-dimensional visual input values
The Goal of RLto train Deep Neural Networks to approximate $\mathcal{\pi^*}, V^*, Q^*, A^*$
Experience Relay Memorystores $\mathcal{s_{t}, a_{t}, s_{t+1}, r_{t+1}}$ in a cyclic buffer, enabling RL agents to sample from and train on previously observed data (batches) offline
Target Networkcorrectly-weighted, but frozen neural nets. The "active" policy network pulls TD error values from the cached and comparatively stable target network, rather than having to calculate the TD error based on its own rapidly fluctuating estimates of $Q$ values
Hard Attentionusing RL to make discrete stochastic decisions over inputs via back propagation and re-parameterization
Re-parameterizationallows neural networks to be treated as stochastic computational graphs -- a key concept in algorithms involving Stochastic Value Gradients
Dynamic Programmingusing the current $Q^{\pi}$ to improve the next $Q^{\pi}$ via "bootstrapping"
Monte Carlo Methodsestimating the expected return from a state by averaging return from multiple rollouts of a policy.
used to find optimal trajectories.
limited to episodic Markov Decision Processes
Policy Searchdoesn't maintain a value-function model, instead this approach directly searches for an optimal policy $\pi^{*}$
Actor-Critic ModelThe Actor (policy) receives a state from the environment and chooses an action to perform. At the same time, some critic ($V^\pi$) receives the state and reward resulting from the previous interaction. The critic uses the RD error $\delta$ calculated from this information to update itself and the actor.

It's trivial, really.