Statistical Geometry

Introduction

In which we discuss three cases of the same problem regarding optimal selection.

The Trivial Case of Googol

A problem presented by information theorist Thomas Cover in his book Open Problems in Communication and Computation¹ deals with a specific case of the betting game called 'Googol' in which you ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number – as big as he pleases! giving rise to the name. The values on the slips are hidden, and the slips are shuffled. One at a time you turn the slips face up, with aim being to stop once you've flipped what you guess to be the largest number of all the slips. You cannot go back and pick a previously turned slip. If you turn over all the slips, then of course you must pick the last one turned.

A (maybe?) more practical exercise where this might occur is in pulling straws of various lengths to decide who gets stuck with some unfavorable task that must be done. Now, obviously, you'd need really big hands to conceal a googol-lengthéd straw, let alone arbitrarily many of them, but we can still recover some value from even the "trivial" –and much more germane– case with two straws.

Martin Gardner mentioned it in his February 1960 column in Scientific American,² but he disregarded the case for just two slips of paper considering it to be trivial. This is exactly the case that the Cover investigates in this chapter.

Player 1 devises two straws of different lengths; Player 2 chooses one of the straws and must decide whether its the longer of the two. Cover narrates our intuition:

He can be right with probability one-half. It seems absurd that he can do better.

The question posed is whether or not there exists some strategy for Player 2 to improve upon the random guess.

The solution requires the selection of a splitting number $T$ according to density $f(t) > 0, t \in (-\infty, \infty)$ . If the length of the straw in hand is less than $T$ , decide that it is smaller.

While this at first doesn't seem to provide an apparent advantage over the naïve strategy since there are infinitely many numbers greater than $T$ , we can develop an improved intuition about this strategy by understanding of the $T$ -selection function $f$ . If we map the real line onto the open interval $(0, 1)$ , we can preserve the cardinality of the reals without having to deal with infinity. Such a mapping can be achieved via a function like the sigmoid:

f(x) = \frac{1}{e^{-x}}

which satisfies:

$f: \mathbb R \rightarrow (0,1)$
strict monotonicity

Now, after picking our straw (or one of the slips of paper in the original, non-trivial case(s) of googol), call it $A$ , we compute $f(A)$ and compare it with a random number $T \sim (0,1)$ . If $T < f(A)$ then our strategy dictates that we guess that $A$ is the bigger of the two numbers $A,B$ .

With this strategy, the probability of guessing the larger number correctly is:

P = \frac{1}{2} f(lg) + \frac{1}{2}(1 - f(sm))

where $\frac{1}{2}f(lg)$ is the probability of choosing the larger straw half of the time and then sticking with it, and $\frac{1}{2}(1 - f(sm))$ is the probability of choosing the smaller straw half of the time and switching your guess. Computing these probabilities, we get:

\begin{aligned} P &= \frac{1}{2} f(lg) + \frac{1}{2}(1 - f(sm)) \\ \\ &= \frac{1}{2} + \frac{1}{2}\big(f(lg) - f(sm)\big) \end{aligned}

But since $f$ is strictly monotonic, we have $f(lg) > f(sm)$ , so the resultant term corresponding to the probability of finding the larger number will always be strictly greater than one half.

Cover leaves as an exercise to the reader whether this solution generalizes to the secretary problem as well which has resulted in some fascinating research into adjacent areas of study³ as well as general strategies for maximizing the likelihood of winning an instance Googol from either player's perspective.⁴

Secretary Problems

For no reason in particular have I been thinking about secretary problems (though I didn't know them by that name, shout out this poor guy⁵), in which a hiring manager interviews a sequence of candidates and must determine a policy governing when to stop, when to settle with the added constraints that:

The manager is unable to evaluate future candidates until making a decision about the current candidate,
Previous candidates cannot and should not be contacted lest the hiring manager lose whatever semblance of face he might still maintain.

Real world examples of this problem abound in any sort of market: labor, housing, romantic, etc. wherein we must administer some degree of executive function about when to cease spending resources evaluating and get busy realizing the value of whatever asset under consideration.

In general, the solution to the secretary problem indicates both:

that we ought to evaluate roughly $1/e \approx 0.37 \approx 37\%$ of the estimated candidate pool before selecting the next candidate we encounter which outperforms all previous candidates, and
that we will encounter the best possible candidate with probability $1/e$ using this policy.

We arrive at this peculiar occurrence of Euler's transcendental constant by finding the probability of selecting one of $n$ possible candidates and that the chosen candidate is is the best. The probability of optimal selection, then, is given by the product of probabilities that the absolute best option is encountered at some index $j$ which occurs with probability $1/n$ and that that candidate is selected given that it was selected having been found at index $j$ . Of course, a priori we have no idea how many candidates ought to be evaluated, nor should we suspect (unless we're afflicted with the $e$ -tism) that $e$ would surface, so we'll start with some arbitrary splitting point $k$ and levy calculus to find the optimal value for this point after which we should transition from exploration to exploitation.

i: \underbrace{1, 2, 3, \cdots, k}_{\text{reject}}\quad\underbrace{k+1, k+2, \cdots, n}_{\text{candidates}}

We'll denote the probability of selecting the best candidate $p(k)$ and define it as the sum over each candidate $i$ being chosen, times the probability that $i$ is the best possible choice:

p(k) = \sum^n p(\text{chosen})p(\text{best | chosen})

Note that for the first $k$ applicants which we're uniformly rejecting, the resultant term will be 0, so it makes sense to split this sum about the partition index $k$ :

\begin{aligned} p(k) &= \sum^k_{i=1} p(\text{chosen})p(\text{best | chosen}) + \sum^n_{i=k+1} p(\text{chosen})p(\text{best | chosen}) \\ &= 0 + \sum^n_{i=k+1} p(\text{chosen})p(\text{best | chosen}) \\ &= 0 + \sum^n_{i=k+1} \frac{1}{n} \cdot p(\text{best | chosen}) \end{aligned} \\

For the less-trivial $p(\text{best | chosen})$ term, we have some cases:

p(\text{best | chosen}) = \begin{cases} k+1 = \text{best} \\ \\ k+1 \neq \text{best} \\ \end{cases}

First, suppose that the best candidate is the $k+1$ ^th element in the sequence. Then we'll select it with probability $1$ since it meets the satisfies the criteria, namely: being better than all the rest. Otherwise, $k+1$ is not the best and is hopefully somewhere further on in the sequence. However, there is a possibility that both of the following are true:

that $k+1 \succ j; \quad \forall j \in [0..k]$ ,
AND that $k+2 \succ k+1$

That is: the first candidate we consider after rejecting the first $k$ elements is better than all those $k$ rejects, but $k+1$ is not the supremum of the $n$ possible candidates, so we mistakenly settle for some good candidate that may not be the best. It's hard to quantify how likely it is for this to happen, but we can more easily devise an expression for the counterfactual: the probability of not selecting the first $k+1$ candidates which is given by:

p(\text{best | chosen}) = \begin{cases} k+1 = \text{best} \implies 1\\ \\ k+1 \neq \text{best} \implies 1 - \frac{1}{k+1} = \frac{k}{k+1} \\ \end{cases}

And this expression generalized for any value in the sequence, so our summation expands to:

\begin{aligned} p(k) &= \sum^k_{i=1} p(\text{chosen})p(\text{best | chosen}) + \sum^n_{i=k+1} p(\text{chosen})p(\text{best | chosen}) \\ &= \Bigg(\frac{1}{n}\cdot 0 + \frac{1}{n}\cdot0 + \cdots \Bigg) + \Bigg(\frac{1}{n}\cdot 1 + \frac{1}{n}\cdot \frac{k}{k+1} + \frac{1}{n}\cdot \frac{k}{k+2} + \cdots +\frac{1}{n}\cdot \frac{k}{n-1} \Bigg)\\ &= 0 + \Bigg(\frac{1}{n}\cdot 1 + \frac{1}{n}\cdot \frac{k}{k+1} + \frac{1}{n}\cdot \frac{k}{k+2} + \cdots +\frac{1}{n}\cdot \frac{k}{n-1} \Bigg)\\ &= \frac{k}{n} \Bigg(\frac{1}{k} + \frac{1}{k+1} + \frac{1}{k+2} + \cdots + \frac{1}{n-1} \Bigg)\\ \end{aligned} \\

the inner part of which roughly approximates to $\frac{1}{x}$ . Now, if we treat this inner part as a discrete probability density function where the likelihood of each index $i$ being the optimal stopping point having rejected the first $k$ of $n$ total candidates is given by the area under the curve, we can see that we get pretty bad approximations of the area under the curve outside of the limit:

We can compute this area by taking the integral from our splitting point $k$ to $n$ :

\begin{aligned} f(x) &= \frac{1}{x}\\ \\ \implies \int_k^n f(x)dx &= \int_k^n \frac{1}{x}dx\\ \\ &=\ln x \Big\vert^n_k = \ln n - \ln k \\ \\ &=\ln \frac{n}{k} \end{aligned}

We'll substitute this value back inside the parentheses of our expanded summation call this whole approximation $\tilde p$ :

\tilde p(k) = \frac{k}{n}\ln \frac{n}{k}

We'll make a notation substitution $x = \frac{k}{n}$ :

\tilde p(x) = x \ln \frac{1}{x} = -x \ln x

now it's not unreasonable now to resort to our intuition to imagine that it's unlikely that we'll have encountered the global supremum of our sequence at position $k+1$ , and also that after we pass the midpoint between $k+1$ and $n$ , it becomes increasingly likely that we have passed the ~~trophy wife~~, err I mean the best candidate, and so the shape of this optimal stopping policy approximation function will resemble some sort of parabolic curve on $k$ :

![](/static/images/statistical-geometry/Geometric Statistics/graph-2.png)

To find this optimum, we differentiate $\tilde p$ with respect to $x$ via the product rule, set the resulting expression equal to zero and solve for $x$ before re-substituting in our definition of $x$ :

\begin{aligned} \tilde p(x) &= -x \ln x \\ \tilde p'(x) &= -\ln x - x\frac{1}{x} \\ &= -\ln x - 1 \\ 0 &= -\ln x - 1 \\ \ln x &= -1 \\ x &= \frac{1}{e}\\ \frac{k}{n} &= \frac{1}{e} \end{aligned}

Which tells us that a good value of $k$ for arbitrarily large $n$ (excluding the trivial cases of 1, 2, and sometimes 3) is $0.37n$ and furthermore that choosing the next best candidate after rejecting the first $0.37n$ candidates will result in picking the optimal candidate with probability $0.37$ .

For $n$ choices, we have a $1/n$ chance of selecting the best candidate at random, so as $n$ approaches infinity, our chances of random selection go to zero, duh:

\lim_{n\rightarrow \infty} \frac{1}{n} = 0

Outside of the limit there still exists a tradeoff: the more samples we gather, the less likely it is that we are going to get stuck with a suboptimal candidate, but the more likely we are to accidentally sample the supremum (thus sticking us with the last candidate)

Alternatively, we can model this as:

p(x) = x\int_x^1\frac{1}{t}dt

where $x$ is the proportion of population that we sample before earnestly seeking our ""candidate"" and $\frac{1}{t}$ can be thought of as the probability of any individual candidate being the best one that we've see so far, which we take to be uniformly random across the sequence.

Integrating, then differentiating to find the optimum yields the same result of $e^{-1}$ as before, implying that roughly 37% of the time, we'll pick the best possible candidate, 37% of the time we'll accidentally sample the candidate in the initial calibration phase, and the reamining 26.4% of the time, we'll end up with something pretty good.

Furthermore, it's worth noting that we can tune the distribution of these outcomes by loosening the requirement of finding the best possible candidate in favor of not getting stuck with the last candidate by decreasing our sample size $k$ from $e^{-1}$ down. For example for $k = 0.2$ , we'll find the best candidate only 32% of the time, but get stuck with the last candidate only 20% of the time which feels sensible in many regards.

Secretary Problem with insufficient priors

concl

really the title of this post is just a setup for the next post.

Footnotes

Cover, Thomas. "Pick the Largest Number: Chapter in Open Problems in Communication and Computation." Springer-Vertag, 1987 ↩
Gardner, Martin. "MATHEMATICAL GAMES." Scientific American, Vol. 202, No. 2, 1960. ↩
Samuels, Stephen. "SUFFICIENTLY NONINFORMATIVE PRIORS FOR THE SECRETARY PROBLEM;" Purdue University, 1994. ↩
Gnedin, Alexander. "Guess the Larger Number." Polish Digital Mathematics Library, 2016. 10.14708/ma.v44i1.1205. ↩
robb. "The Secretary Problem Explained: Dating Mathematically." rs.io, 2014. ↩