Digest: Occam's Razor

Published on
14 min read––– views


This article aims to qualitatively describe the notation from Kevin Kelly's paper on "Simplicity and Truth Conduciveness" which offers painful perversions of standard set notation. The paper itself strives to provide a formal proof of the validity of Occam's Razor which Wikipedia defines to be:

"the simplest explanation is most likely the right one".

in light of the of the claim that

The standard literature on Occam's razor contains no plausible, non-circular, explanation of how simplicity helps science to arrive at true theories.

Kelly asks, then, "how could Ockham's razor help one find true theory?"

Formal Definitions of Simplicity

When striving to define the "simplest" explanation of a claim, we start with the benchmarks of


  • Kolmogorov Complexity - which can be thought of as the optimal encoding of a piece of information, or the lower bound of the length of program needed to recreate some input, roughly speaking.

For example, the Kolmogorov complexity of a simple string like aaaaaaaaaaaaaaaaaaaaaaaaaa (26 'a's) could be expressed in fewer characters as:

a = 'a' * 26

which isn't necessarily the smallest program used to expressed the input string, but it's closer.

Though not entirely relevant to the discussion, the formal definition of Kolmogorov complexity is given in terms of Turing machines, so you know I had to include it:

Kolmogorov complexity KK of a string, relative to a Turing machine ff of a string xx is : Kf(x)=min{p:f(p)=x}\\ K_f(x) = \min \{|p| : f(p) = x\}

Another fun example of Kolmogorov complexity is given by the Berry paradox which claims that the complexity of all positive integers in English is at most 11. If it were not true, then there would be an integer that could not be described in fewer than 11 words. Punchline: any such number could be described as "The smallest positive integer not definable in under eleven words."

Kelly asks, "what's the fastest way to reach a true theory?" and answers, in far more words, that the theory with the fewest retractions –given new data– is the simplest.

A Side Note on the Problem of Induction

In real world applications, induction is limited the fact that we want to derive general principles given only immediate specifics. We can never know that our inductive hypotheses is true, let alone absolute truth, but we can gain or lose confidence in it with each opportunity to be wrong and hopefully know that our hypothesis is not wrong.

Fundamentally, that's all science is: not being wrong lots of times.

Amongst other things, Kelly adds in his supplemental conference slides that the if you know that "future" is simple, you can drop the heuristic and skip straight to a solid principle. However, the future is typically complex, hence the need for a formal means of reaching the simplest hypothesis to describe truth.


If you were looking on a refresher to Set Theory, Kelly does not have your back. Luckily, Daniel does.

Let EE be the set of all effects that might be realized:

E={e1,e2,e3,e4,e5,e6,e7,e8,e9}\begin{aligned} E = \{e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, e_9\} \end{aligned}

Let KK be the set of all sets of effects that might be a complete description of the ⭐world ⭐:

K={{e1},{e1,e2},{e1,e2},...,{e1,e2,e3,e4,e5,e6,e7,e8,e9}}\begin{aligned} K = \Big\{\{e_1\}, \{e_1, e_2\}, \{e_1, e_2\}, ..., \{e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, e_9\} \Big\} \end{aligned}

Let QQ be a partition of KK containing theories TT that explain the ⭐world ⭐:

Q={{{e1},{e1,e2,e3,e4}T1},{{e1,e2},{e1,e2,e3,e4,e5}T2},{{e1,e2e3},{e1,e2,e3,e4,e5,e6}T3}}\begin{aligned} Q = \Biggr \{ \Big\{ \overset{T_1}{\{e_1\}, \atop \{e_1, e_2, e_3, e_4\} }\Big\}, \Big\{ \overset{T_2}{\{e_1, e_2\}, \atop \{e_1, e_2, e_3, e_4, e_5\} }\Big\}, \Big\{ \overset{T_3}{\{e_1, e_2 e_3\}, \atop \{e_1, e_2, e_3, e_4, e_5, e_6\} }\Big\} \Biggr\} \end{aligned}

So, to play with this a lil bit:

  • if we see e5e_5, then T1T_1 is immediately falsified as it does not consider e5e_5.

  • if we see e1e_1, then T1T_1 is not falsified, but the longer that time goes by before we see e2e_2, the lower our confidence in T1T_1 will be. However; Occam's razor, by definition, will not retract T1T_1 as no new information has been presented to contradict it thus far. More on this later.

Let MM be a strategy that that chooses some TQT \in Q at an arbitrary point in time which helps us converge on a true theory: a description of the ⭐worldww.

For example, let's let the truth ww be {e1,e2,e3,e4,e5}\{e_1, e_2, e_3, e_4, e_5\}, then MM would point to T2T_2.

Now let π\pi be the "skeptical path" - a sequence of effects that nature can reveal such that we're taken to a new theory with each piece of added information:

π={e1}S,{e1,e2}S,{e1,e2,e3},...\begin{aligned} \pi = \underset{S}{\{e_1\}}, \underset{S'}{\{e_1, e_2\}}, \{e_1, e_2, e_3\}, ... \end{aligned}

such that SSS \subseteq S', and SS must conflict with SS': T1T2T_1 \neq T_2. Each step in the skeptical path necessarily takes you to a new theory.


The complexity of a set c(S)c(S), then, is the length of the longest skeptical path that terminates on SS, 1-1:

π={e1}c(π)=c({e1})=0...c(T2)=c({e1,e2,e3,e4,e5})=4π={e1}1,{e1,e2}2,{e1,e2,e3}3,{e1,e2,e3,e4}4,{e1,e2,e3,e4,e5}5\begin{aligned} \pi &= \{e_1\} \\ \therefore c(\pi) &= c(\{e_1\}) = 0 \\ ...\\ c(T_2) &= c(\{e_1, e_2, e_3, e_4, e_5\}) = 4 \\ \because \pi &= \underset{1}{\{e_1\}}, \underset{2}{\{e_1, e_2\}}, \underset{3}{\{e_1, e_2, e_3\}}, \underset{4}{\{e_1, e_2, e_3, e_4\}}, \underset{5}{\{e_1, e_2, e_3, e_4, e_5\}} \end{aligned}

Suppose e4e_4 is revealed in isolation, we have no theory (from our example QQ) that captures this until e1,e2,e_1, e_2, and e3e_3 are also revealed. However, this would not be a skeptical path, nor does it falsify any of our theories either. This revelation, beginning with e4e_4 in isolation would just decrease our confidence in our theories, indicating that we might need a new theory. For example, let e4e_4 represent the Higgs boson particle, it wouldn't make sense for nature to reveal this to us without first accounting for e2:e_2: the large hadron collider.

The next natural step is to define the complexity of a Theory, which is nothing more than a set of sets containing effects. Kelly defines the the complexity of a Theory with two stipulations, the first being that a Theory's complexity is the minimum complexity of the sets in T:

c(T)=minSTc(S)\begin{aligned} c(T) = \min \limits_{S \in T} c(S) \end{aligned}

The second stipulation is that SS must be compatible with experience meaning that if, for example, e6e_6 is revealed, T1T_1 must be eliminated.

Occam's Razor

So, Occam's razor is c(T)c(T), but this doesn't necessarily get us to the truth... The strength of Occam's razor is that, if another truth-getting strategy is used, it will eventually be forced to make a retraction in the absence of new information, like the Higgs boson, given {e1,e2,e3,?}\{e_1, e_2, e_3, ?\}, whereas Occam's razor can solve a problem (Ke,Q)(K_e, Q) each time new information eie_i is revealed that is necessarily the simplest, most complete Theory that makes no more retractions than any other arbitrary strategy. In this sense, Occam's razor leads you to the truth the fastest, in the worst case.

The new idea is that these [alternative theories] are not exhaustive alternatives, for it may be that Ockham's razor somehow converges to the truth along the straightest or most direct path, where directness is, roughly speaking, a matter of not altering one's opinion more often or later than necessary.

Say one theory TT leaps to the correct answer and Occam's razor took the long path to get there. As soon as nature reveals eie_i, TT could be invalidated, but Occam's razor is still true, with one additional retraction.

On pain of non-convergence, there is always some amount of time where new eie_i, or lack thereof, will force us to retract our theory. We could be waiting for the Higgs boson e4e_4 for eternity, then finally retract our Theory {e1,e2,e3,e4}\{e_1, e_2, e_3, e_4\}, and the devilish nature of π\pi could, the very next day, reveal e4e_4. TT would have made an excessive retraction: once to retract e4e_4, then rightfully again to re-include e4e_4. Occam's razor dictates patience, and never makes the first retraction. In fact, Occam's razor never included e4e_4 prior to its revelation. Hence, it only ever retracts theories in light of new eie_i. Occam's razor is the upper bound for the worst case, but again, the generality of this approach makes it convergent in all cases. Whereas other strategies might work in some specific cases, "faster" than Occam's razor, they also may fail for other fields, leading to excess retractions, and therefore less simplicity.

Nature is recalcitrant, and can pause at e3e_3 for infinity, then as soon as we retract, e4e_4 is revealed. Occam's razor does not retract on pain of convergence: it is the most pessimistic approach with respect to nature.

Critiques of Occam's Razor and Some Takeaways

"Any theory worth its salt has an underlying mechanism, where's Occam's?"

Occam's razor is useful when no current Theory contains π\pi. The worst case you get stuck in is the one where nature falsifies your "best" theory per c(T)c(T), and the one retraction you make increases c(T)c(T) by only one. Your mechanism is not obliterated.

Take T4={{e7,e8}}T_4 = \Big\{ \{e_7, e_8\}\Big\} along with T1={{e1},{e1,e2,e3,e4}}T_1 = \Big\{\{e_1\}, \{ e_1, e_2, e_3, e_4 \}\Big\}. T4T_4 is simpler than T1T_1 given no π\pi yet, as there is no path we could be given that takes us to both e4e_4 and e7e_7. T4T_4 is bolder because it's more true or false, and that is the definition of simplicity. The fragility of T4T_4 is directly proportional to its simplicity and therein lies the profundity of Occam's razor.

Since methods that approach the truth more directly have superior connection to the truth or are more conducive to find the truth, it is a relevant and non-circular response to the simplicity puzzle to prove that Occam strategies approach the truth more directly than all competitors.

  • What makes your theory complex is the presence of other theories that predict a subset of what your theory predicts

  • What makes your theory simple is how indivisible your predictions are among known possible theories