ML 2: A Discussion of Action Spaces

Published on
3 min read––– views
Discrete\color{blue} \text {Discrete}Continuous\color{red} \text {Continuous}
finite number of actions which can be takeninfinite amount of actions which can be taken
e.g. Left\color{blue} \text{Left} or Right\color{blue} \text{Right}e.g. Amount of torque\color{red} \text{Amount of torque} to apply to a wheel
easier to conceptualize and evaluate as the action set is finite and therefore iterableAction space can be differentiated which is advantageous because this allows us to identify similarities between actions
can not be differentiated, therefore actions like North, South\color{blue} \text {North, South} may be both adjacent and highly dissimilare.g. ddan=1.02,ddan+1=1.03,\color{red} \frac d {da*{n}} = 1.02, \frac d {da*{n+1}} = 1.03,
can be grouped by trend
π\color{blue} \pi^{*} is easier to find, as there is an exhaustible set of actions and policies to be evaluatedcontinuous action-spaces are superior because, theoretically, \exist an action aA\color{red} a \in \mathcal A which immediately solves the given problem. While stipulation that most aA\color{red} a \in \mathcal A will be so downright wank that you'll want to terminate the simulation, the infinite size of A\color{red} \mathcal A dictates that  aπ>>π\color{red} \exist \space a \leftarrow \pi^{*} >> \color{blue} \pi^{*}
we can compensate for the limitations of the discrete action space by identifying the region about the global maximum at any time t\color{red}t of π\color{red}\pi^{*} and discretizing it via some function gamma –s.t. γ(x)=βx\color{orange} \gamma(x) = \beta\lfloor x \rfloor where β\beta is some factor that creates a distribution of actions rather than a set of identical actions - such that we end up with a dense action space A\color{blue} \mathcal A each member of which is better, on average, than a random aA\color{red} a \in \mathcal A
see Fig. 1

This solution, however eloquent, is also constrained by \infin because no matter how "global" a theoretical maximum π\color{blue} \pi^{*},  π+n\color{blue} \exist \space \pi^{*}+n. We can resolve this caveat and potential resource leak (forever searching for a global maximum in an infinite space) by defining a global maximum satisfaction rate ω\color{orange} \text {global maximum satisfaction rate } \omega such that the system is satisfied if a global maximum π+ω\color{blue} \pi^{*} + \color{orange} \omega has not been found in an arbitrary t+nt + n time steps.

Figure 1

Figure 1 - discretizing a "Global Maximum"