Geometric Statistics

Published on
61 min read––– views

Introduction

This post is motivated by two posts. The first is an @littmath-esque piece of rage bait responding to the medical statistician and associate professor Maarten van Smeden who dared to ask:

Challenge: explain the degrees of freedom in one tweet

This post went "mini-viral" on math twitter a few years back, which is to say it garnered triple-digits of interaction at the time, with the most infuriatingly concise answer coming from the Yale PhD and now-economics professors at UBC Vadim Marmer, who responded simply:

The rank of the projection matrix inside the quadratic form in the definition of a statistic.

The second post was an asinine attempt at quantifying morality as a vector space on r/catholicism. I will defer treatment of the latter post until we've sufficiently digested the first, which will be no small task.

Two Dimensional Statistics

To work towards the above definition of degrees of freedom, we will build up a rudimentary understanding of the relevant statistical definitions. Beginning with a conventional, rote definition of degrees of freedom we might have encountered in high school statistics, suppose we have a random variable XX with observations measured to be X1,X2,...,XnX_1, X_2, ..., X_n. We can plot the observations as points on a line like so:

If we take just two of these points, the ones highlighted in green which we'll say are located at positions X1=2X_1 = 2 and X2=1X_2 = 1 for convenience, we can visualize this total measurement as a single point in 2D space, or alternatively a vector in the same space:

And since XX is a random variable, we'll attain a different value each time we sample it, corresponding to different point(s) and thus different vectors. We'll call the corresponding vector formed by taking two samples from our random variable a random vector.

Under this interpretation, we can intuit that degrees of freedom refers to the number of dimensions of space that this vector can land in across different samples. When constructed from two samples, the resultant random vector will always fall in two-dimensional space and since those points are free to vary across samples, the random vector is said to have two degrees of freedom. To formalize this notion a bit more clearly, we'll call our collection of samples our data vector which can be expressed as

X=[X1X2]X = \begin{bmatrix}X_1 \\ X_2\end{bmatrix}

The sample mean of our data vector (comprised of samples) is denoted Xˉ\bar X, and we can incorporate that into our expression of the data vector to decompose it into additional quantities which will be valuable for our dissection of the term "degrees of freedom":

X=[X1X2]data vector=[X1Xˉ+XˉX2Xˉ+Xˉ]=[X1XˉX2Xˉ]residual vector+[XˉXˉ]sample mean\underbrace{X = \begin{bmatrix}X_1 \\ X_2\end{bmatrix}}_{\color{green}\text{data vector}} = \begin{bmatrix}X_1 - \bar X + \bar X \\ X_2 - \bar X + \bar X\end{bmatrix} = \underbrace{\begin{bmatrix}X_1 - \bar X \\ X_2 - \bar X \end{bmatrix}}_{\color{red}\text{residual vector}} + \underbrace{\begin{bmatrix}\bar X \\ \bar X \end{bmatrix}}_{\color{blue}\text{sample mean}}

The residual vector is what's left over after subtracting the sample mean from each component. By decomposing our original random variable with two degrees of freedom into two separate random variables, we disjoin the degrees of freedom – we get one degree of freedom per instance of the sample mean Xˉ\bar X. Why is that?

If we plot each of these decomposed expressions of XX, we can observe some useful properties of how they behave:

Note that the sample mean vector Xˉ\color{blue}\bar X will always be a scalar multiple of the unit vector 1ˉ\bar 1 (in 2D, 1ˉ=[1  1]\bar 1 =[1 \; 1]^\top); in other words, it is not free to be just anywhere on the plane. Therefore it has a single degree of freedom.

Similarly, the residual vector XXˉ\color{red}X - \bar X always falls on a specific line (parallel to the line connecting the sample mean vector to the data vector) because it –by definition with the mean above– always adds to zero no matter what the data points are. The magnitude of the sum of all the data points plus the residuals must cancel out by definition of the mean which lies in the middles of the data.

We can recover this definition of both the data vector and the sample mean from the textbook definition of arithmetic mean:

Xˉ=1niXi    nXˉ=iXi{\color{blue} \bar X} = \frac{1}{n}\sum_i {\color{green} X_i} \implies n{\color{blue} \bar X} =\sum_i {\color{green} X_i}

The sum of all the residual vectors is given by:

iXiXˉ\sum_i {\color{green} X_i} - {\color{blue} \bar X}

And substituting in our identity of nXˉn{\color{blue} \bar X}, and knowing that since the sample mean doesn't change with the index, the summation of a constant is just nn instances of that constant, we can express the sum of residual vectors as:

iXiXˉ=iXinXˉ=nXˉnXˉ=0\sum_i {\color{green} X_i} - {\color{blue} \bar X} = \sum_i {\color{green} X_i} - n{\color{blue} \bar X} = n{\color{blue} \bar X} - n{\color{blue} \bar X} = 0

Therefore, the residual vector falls somewhere on the vector [1  1][1 \; -1]^\top, where the vertical coordinate is the negation of the horizontal. Thus, the residual vector also only has a single degree of freedom. This is maybe-convenient, because it means that if we're given one of the residuals, we can instantly compute the other (or, in general, given n1n-1 residuals, we can compute the nnth residual).

And these two dimensions (sample mean, residual) span our two dimensional linear subspace, so we can construct any data vector from some combination of mean and residual vectors. It's worth noting at this time that these properties of degrees of freedom apply only to the observations comprising the sample mean, not the actual mean of the distribution that the observations came from. We denote the actual mean:

μ=E[X]{\color{LightSeaGreen}\mu} = \mathbb E[{\color{green} X}]

which may not be intuitive at first since it seems like we ought to be able to perform the same computation just with the population mean μ{\color{LightSeaGreen}\mu} instead of sample mean Xˉ{\color{blue}\bar X}. Let's trace out why this fails by first taking some observations X\color{green}X, then adding and subtracting the expected value μ\color{LightSeaGreen}\mu, then splitting the vector in two as we did before:

X=[X1μ+μX2μ+μ]=[X1μX2μ]+[μμ]=[X1μX2μ]error vector+μ[11]expected value vector\begin{aligned} {\color{green}X} = \begin{bmatrix}X_1 - {\color{LightSeaGreen}\mu} + {\color{LightSeaGreen}\mu} \\ X_2 - {\color{LightSeaGreen}\mu} + {\color{LightSeaGreen}\mu}\end{bmatrix} &= \begin{bmatrix}X_1 - {\color{LightSeaGreen}\mu} \\ X_2 - {\color{LightSeaGreen}\mu} \end{bmatrix} + \begin{bmatrix}{\color{LightSeaGreen}\mu} \\ {\color{LightSeaGreen}\mu} \end{bmatrix} \\ \\ &= \underbrace{\begin{bmatrix}X_1 - {\color{LightSeaGreen}\mu} \\ X_2 - {\color{LightSeaGreen}\mu} \end{bmatrix}}_{\color{orange}\text{error vector}} + \underbrace{{\color{LightSeaGreen}\mu} \begin{bmatrix}1 \\ 1\end{bmatrix}}_{\color{LightSeaGreen}\text{expected value vector}} \end{aligned}

Again the data vector has two degrees of freedom, but they are inherited entirely from the error vector, for the expected value vector is a scalar multiple of the unit vector and thus has zero degrees of freedom. This is because the μ{\color{LightSeaGreen}\mu} vector always points in the exact same spot, regardless of the points sampled. Visually, this looks like:

and we can see how if μ{\color{LightSeaGreen}\mu} is fixed, and the data vector can fall anywhere in 2D space based on the samples from the underlying distribution, then the error vector must also have two degrees of freedom in order to relate the data vector to the expected value vector. As "payment" for this degree of freedom in the error vector, though, we lose the property that errors must sum to zero (which was the case with the residuals) because the data vector X\color{green}X may not be centered around μ{\color{LightSeaGreen}\mu}, whereas it is by definition centered around its mean Xˉ{\color{blue} \bar X}.

Three Dimensional Statistics

Time to break da brain as we try to see how this generalizes to higher dimensions. For a data vector consisting of now three samples

we can plot the resultant vector in 3D:

Once more the sample mean vector Xˉ{\color{blue} \bar X} is a scalar multiple of the unit vector:

1ˉ=[111]\bar 1 = \begin{bmatrix} 1 \\ 1 \\ 1\end{bmatrix}

and we can decompose the data vector accordingly:

X=[X1Xˉ+XˉX2Xˉ+XˉX3Xˉ+Xˉ]=[X1XˉX2XˉX3Xˉ]+Xˉ[111]{\color{green} X} = \begin{bmatrix} X_1 - \bar X + \bar X \\ X_2 - \bar X + \bar X \\ X_3 - \bar X + \bar X \\ \end{bmatrix} = {\color{red}\begin{bmatrix} X_1 - \bar X \\ X_2 - \bar X \\ X_3 - \bar X \\ \end{bmatrix}} + {\color{blue}\bar X \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}}

Again the sample mean vector has a single degree of freedom, and two degrees of freedom from the residual vector whose components all must still sum to zero. Note that it is not the case that all points will sum to zero, just those on the plane formed by x+y+z=0x + y + z = 0, which is orthogonal to the sample mean vector. Just like before, if we exchange the sample mean for the expected value, we can define two other terms (error vector with three degrees of freedom and an expected value vector with none).

These relationships hold for arbitrary amounts of dimensions (determined by the number of observations sampled from the underlying random variable): a data vector constructed from nn samples will have nn degrees of freedom, one of which can be attributed to the sample mean vector, and the remaining n1n-1 coming from the residuals.

And herein lies the rub explaining why we "divide by n1n-1" in the textbook definition of degrees of freedom of sample variance of a distribution which is defined:

s2=1n1(XXˉ)2E[s2]=σ2\begin{aligned} s^2 &= \frac{1}{n-1}\sum(X - \bar X)^2 \\ \\ \mathbb E[s^2] &= \sigma^2 \end{aligned}

Bessel's Correction

To make sense of the above diagram, let's examine the definition of sample variance s2s^2. To compute this quantity, we take every sampled point XiX_i and find the distance between it and the sample mean Xˉ\color{blue} \bar X, square this distance, and take the average:

s2=1n1(XXˉ)2s^2 = \frac{1}{n-1}\sum(X - \bar X)^2 \\

The astute reader now wonders why we're dividing by n1n-1 rather than nn, and the high school statistics answer usually involves a hand-wavey invocation of "degrees of freedom" which is insufficient for this post. In this section, we'll begin to develop a geometric intuition for our vectors to understand where this disparity arises.

First, it will prove instructive to understand why a right triangle is formed between the residual, error, and expected value vectors:

Recall that we assume a random variable XX from which we draw nn i.i.d. samples X1,X2,...,XnX_1, X_2, ..., X_n where μ=E[X],Var(X)=σ2\mu = \mathbb E[X], Var(X) = \sigma^2 . The question is: if we didn't know the parameters μ,σ2\mu, \sigma^2 of the underlying distribution, how could we figure them out using only our samples. I.e. how can we estimate those parameters using only our samples. For the two-dimensional case we've been focusing on, the key to understanding comes from the visualization where we can see the right triangle emerge.

We know that the data vector can fall anywhere on the plane, and that the true value of μ\mu is unknown, leaving us with equivalently infinite choices for an estimate μ^\hat \mu... Suppose we did know μ\mu, but didn't know XX, and were shown two candidate data vectors:

Ceteris pleribus, it would be fair to assume that X1X_1 is more likely to have come from the distribution with expected value μ\mu simply by the fact that it is closer μ\mu. We can invert this reasoning to see how we might be able to improve our estimate μ^\hat \mu of the true expected value. Suppose this time that we did know the data vector XX and were tasked with picking which expected value estimate is more likely to correspond to the distribution from which XX was sampled:

For this arrangement, μ1\mu_1 is clearly the better choice again by virtue of its proximity to the only data samples we have which tell us anything about the distribution. So, if we take proximity between vectors and estimated mean to be an indication of accuracy, then it makes sense that we should try to pick the expected value vector estimator μ^\hat \mu closest to the sample mean Xˉ\color{blue}\bar X. This selection for estimator criterion minimizes the quantity Xμ^\color{red} X- \hat\mu which occurs when the angle between μ\mu and Xμ^\color{red} X- \hat\mu is 90º

To measure the length (or magnitude) of Xμ^\color{red} X- \hat\mu, we use the pythagorean theorem, decomposing the vector into its constituent parts:

X=X12+X22|X| = \sqrt{X_1^2 + X_2^2}

And for algebraic peace of mind as well as efficiency, we'll defer the computation of the square root till the last moment (a thing I've decided to call JITSQRT, not to be confused with what your Floridian homies make each other do) and work in terms of squared lengths:

X2=X12+X22|X|^2 = X_1^2 + X_2^2

which is the Pythagorean theorem. It's worth noting at this point as well if it's not obvious that the Pythagorean theorem holds in nn dimensional space, not just two-dimensional space, so for an arbitrary squared vector length, we can write

X2=iXi2| X|^2 = \sum_i X_i^2

So now our goal is to minimize the squared distance which might start to sound familiar in ~the realm of statistics.

μ^2+Xμ^2=X2    μ^2+(Xiμ^)2=Xi2μ^2+(Xiμ^)2=Xi2μ^2+Xi22μ^Xi+μ^=Xi22μ^22μ^Xi2=0μ^2=μ^Xinμ^2=μ^Xinμ^=Xiμ^=1nXi=Xˉ\begin{aligned} |{\color{blue}\hat\mu}|^2 + |{\color{red}X-\hat\mu}|^2 &= |{\color{green}X}|^2 \\ \\ \implies \sum{\color{blue}\hat\mu}^2 + \sum({\color{red}X_i-\hat\mu})^2 &= \sum {\color{green}X_i}^2 \\ \\ \sum{\color{blue}\hat\mu}^2 + ({\color{red}X_i-\hat\mu})^2 &= \sum {\color{green}X_i}^2 \\ \\ \sum{\color{blue}\hat\mu}^2 + {\color{red}X_i^2 -2\hat\mu X_i + \hat\mu} &= \sum {\color{green}X_i}^2 \\ \\ \sum 2\hat\mu^2 - \sum 2\hat\mu X_i^2 &= 0 \\ \\ \sum \hat\mu^2 &= \hat\mu \sum X_i \\ \\ n\hat\mu^2 &= \hat\mu \sum X_i \\ \\ n\hat\mu &= \sum X_i \\ \\ \hat\mu &= \frac{1}{n}\sum X_i \\ \\ &= {\color{blue}\bar X} \end{aligned}

Therefore by recovering the sample mean from our Pythagorean expansion we arrive at the conclusion that the best estimation strategy for μ\mu is the Least Squared Error estimator.

And this all still holds in higher dimensions when we include more than two samples. All that changes is that the residual vector becomes free to rotate about the sample mean vector Xˉ\color{blue}\bar X in those higher dimensions.

Let's see how we can continue to leverage Pythagoras to estimate the variance σ2\sigma^2 using our data vector. Suppose that for some random vector XX for which we know σ2\sigma^2 to be small (the data is tightly grouped for some definition of "tight" and "group"). Following the same pattern as before: given two candidate observations, we'd take the the observation closer to the sample mean to be the data point more likely to be a member of XX. Again inverting this reasoning, if we find an observation XX to be far away from the sample mean, or estimated expected value, or actual expected value (any of these north star quantities), then we should conclude that σ2\sigma^2 is larger than if XX were nearer to the mean-ish quantity.

Referring back to our taxonomy diagram:

we can infer that the larger the magnitude of error vector, the larger the variance of our underlying distribution is. But oftentimes we don't know the expected value of the distribution μ\mu, which is sort of the whole point...

XXˉ{\color{red} X-\bar X} is a natural estimator to use for the error vector, using Xˉ{\color{blue}\bar X} as our estimate of μ{\color{LightSeaGreen}\mu}. The sample mean and actual mean share the desirable characteristic that they are both small whenever X\color{green}X is near μ{\color{LightSeaGreen}\mu} which, in turn, is more likely whenever σ2\sigma^2 is small. Keep in mind though that it is entirely possible to have a small residual vector and a large error vector if our μ^\hat\mu estimate sucks:

So, the length of the error and the residual vectors are both proportional to the variance of the distribution, but for a few reasons the length itself is not equal to variance:

  1. Variance is related to squared length
  2. Lengths bound to sample variance variance is constant for a fixed distribution
  3. Length is dependent on dimension, but variance is not.

Recall that the number of dimensions a vector can inhabit is equivalent to its degrees of freedom. So, though not true in general (the trivial counter example is the zero vector 0ˉ\mathbf {\bar 0} in Rn\mathbb R^n as nn \rightarrow \infty), a good rule of thumb is that the degrees of freedom are proportional to the squared length. In other words, variance does not depend on the total squared length which scales with dimension, but rather the squared length averaged over the number of dimensions.


And now it is here that we can address the stupid-ass r/catholicism post. The average post here tends to be the most neurotic perversion of moral teaching you've ever encountered e.g. quantifying the sinfulness of any given sexual act and mapping it somewhere onto vector space. The claim is:

Sodomy would be worse because it's not just a sin of lust but also an affront to the divine order of creation. In other words, neither and anus nor a mouth are meant to be sex orgs.

One might visualize this as a vector in NN-space, where the length of the vector (magnitude of the sin) is influenced by all the NN dimensions. In this case, fornication is a vector only along one axis (1,0,0)(1, 0, 0), and so is shorter than sodomy, which is a vector to the point (1,1,0)(1,1,0), if we define three axes as 'lust', 'unnaturalness', and 'interpersonal violence' [lol]. Heterosexual rape would then be a (1,0,1)(1, 0, 1)–equal in magnitude to sodomy– and homosexual rape a (1,1,1)(1,1,1), worse than all of the above. You can add more dimensions if you want to look at other factors (for example, define a fourth dimension 'unitive ends', and masturbation becomes a (1,1,0,1)(1,1,0,1) sin, since it involves lust, abuse of the sex organs, and personal isolation, but no interpersonal violece), though practically all we really need to know is that we should try to stay as close to the non-sinful origin point (0,0,0,...,0)(0, 0, 0, ..., 0) as possible.

This is primarily wrong because of the maligned effort to map moral actions into vector space. What follows from this is a mistaken or lazy attempt to use comparable quantities here. We invented algebra to allow us to make these anaolgies without immediately opening ourself to being broadsided with moral equivocation of any rape and any onanist action being mapped into fucking vector space with equivalent magnitude.

This is an obviously broken model of sin. Consider a sin sˉ\bar s in vector space of unknowably many dimensions which itself is uniformly, and arbitrarily small in all directions. But since it is non-zero and necessarily observed from all angles of sin-space, we get a squared length (insinuated to be proportional to moral impermissibility by OP) of: sˉ2|\bar s|^2. Now, crucially, since sˉ\bar s is fixed (it's components cannot have lengths that are dependent on anything other than the degree of objectionability along any axis), it can be arbitrarily close to zero. However, once its values are set and cannot move any closer to the holy origin 0ˉ\bar 0 it will have infinite magnitude.

limnsˉ2=insi2\begin{aligned} \\ \lim_{\substack{n \to \infty}} |\bar s|^2 =\sum_{i}^n s_i^2 \\ \end{aligned}

The sum is divergent for any non-zero, fixed action – meaning that our sin which is as arbitrarily inoffensive as we like, as close to purity in all dimensions as we want, is nevertheless infinitely bad under this jackass's model.

Additionally, the notion that we ought to stay as close to the holy origin is stupid because it reduces to the conclusion that "the only way to win is not to play," which is still sinful under its own model given some axes of talent-burial and withdrawing from society which is easily construed to be "interpersonal violence" as quickly as we may label any other axis of bigotry. Inaction is just as bad as action in many cases. The point being that trying to form a theodicy with linear algebra is not only stupid, but its own form of intellectual shibbolethic sin.

Milan Kundera speaks of this, Greg Egan speaks of this, everyone with a brain and a soul knows this. I don't understand what depraved band of mathematical IQ you have fall within in order to be able to think about morality this way without also seeing the obvious counter example. Seems like a hell of its own to think you're smart but actually be that much of a moron.


To our taxonomic triangle of vectors, we can also add the mean deviation vector which I've shifted away from the μ{\color{LightSeaGreen}\mu} vector it is actually coincidental with.

We now want some statistic which, on average, should give us the true value of the variance parameter we're trying to estimate: E[???]=σ2\mathbb E[???] = \sigma^2. Such a statistic would be called an unbiased estimator and to find an unbiased estimator for σ2\sigma^2, we consider the squared length of each component of the triangle and then take the expected value of that.

Previously, we claimed that those vectors painting the shaded triangle are proportional to the underlying distribution variance σ2\sigma^2. This is good since it means that we can use them to estimate. However, we just saw that the expected value of the length of any non-zero vector in arbitrarily high-dimensional space is also proportional to the number of dimensions it can inhabit, we must correct for this scaling. To accomplish this, we divide each expected value of squared length of a vector by the degrees of freedom associated with it. And, we know that although the error vector is free to land anywhere in the nn-dimensional space, the residual is bound by Xˉ\color{blue} \bar X and thus has n1n-1 degrees of freedom.

Compare:

X=[XμXμXμ]n df+μ[111]0 dfX = \underbrace{\begin{bmatrix} X - \mu \\ X - \mu \\ X - \mu \\ \end{bmatrix}}_{n \text{ df}} + \underbrace{\mu \begin{bmatrix} 1 \\ 1\\ 1 \end{bmatrix}}_{0 \text{ df}}

vs.

X=[XXˉXXˉXXˉ]n1 df+Xˉ[111]1 dfX = \underbrace{\begin{bmatrix} X - \bar X \\ X - \bar X \\ X - \bar X \\ \end{bmatrix}}_{n-1 \text{ df}} + \underbrace{\bar X \begin{bmatrix} 1 \\ 1\\ 1 \end{bmatrix}}_{1 \text{ df}}

Because the residual by definitions sums to 0 under constraint of the sample mean and higher dimensional vectors will tend to be longer, the residual vector in fewer dimensions will be shorter than the error vector. This should make sense since the error vector is the hypotenuse of the shaded right triangle formed with the residual vector and the expected value vector, so the only case when the error vector would not be longer is if the sample mean is the expected value vector, implying a deviation of 0 and a residual vector equivalent to the error vector.

This is where the n1n-1 comes from. Dividing by nn is fine for the error vector, but we don't ever actually know what the error vector is (otherwise all questions about statistics are moot, we have achieved omniscience, time to touch grass). Instead, we still have to estimate the error vector with the residual vector which has one fewer degrees of freedom. Division by nn would give us an estimator which is systematically too small, therefore we use a biased estimator which corrects for this by dividing by a slightly smaller number: n1n-1.

We usually call this quantity (squared length of the residual vector, per degree of freedom) the sample variance, expressed:

s2=1ni=1n(XXˉ)2{\color{red} s^2} = \frac{1}{n}\sum_{i=1}^n({\color{red} X - \bar X})^2

Comparatively, the error vector is given by:

E[Xμ2]=E[i(Xiμ)2]=iE[(Xiμ)2]=σ2=nσ2\begin{aligned} \mathbb E[|{\color{orange} X - \mu}|^2] &= \mathbb E[\sum_i(X_i - \mu)^2] \\ &= \sum_i\mathbb E[(X_i - \mu)^2] \\ &= \sum \sigma^2 = n\sigma^2 \end{aligned}

So, the expected value of the squared length of the error vector is proportional to σ2\sigma^2 and the number of dimensions we're working in. If we were to calculate σ2\sigma^2 using the true mean μ{\color{LightSeaGreen}\mu}, we would then want to divide by nn rather than n1n-1:

σ2=E[1ni(Xiμ)2]\sigma^2 = \mathbb E[\frac{1}{n}\sum_i(X_i - \mu)^2]

but again we never know μ{\color{LightSeaGreen}\mu}, we just like to talk about it like we do. In treating with that purple vector from the taxonomy triangle, the mean deviation, we can quantify it as the expected squared difference between the sample mean and the true mean:

E[Xˉμ2]=E[(XˉE[Xˉ])]=Var(Xˉ)=σ2n=nσ2n=σ2\begin{aligned} \mathbb E[|{\color{blue}\bar X} - {\color{LightSeaGreen}\mu}|^2] &= \sum \mathbb E\Big[({\color{blue}\bar X} - \mathbb E [{\color{blue}\bar X}])\Big] \\ \\ &= \sum Var({\color{blue}\bar X}) = \sum\frac{\sigma^2}{n} = \frac{n\sigma^2}{n} = \sigma^2 \end{aligned}

So we can see that the mean deviation is itself an unbiased estimator with just two problems for us to deal with:

  1. It only has a single degree of freedom because it's a scalar multiple of the expected value vector
  2. We still don't know μ{\color{LightSeaGreen}\mu}

Nevertheless, all together we can push these pieces into the Pythagorean theorem:

XXˉ2=Xμ2Xˉμ2E[XXˉ2]=[Xμ2][Xˉμ2] =nσ2σ2=(n1)σ2E[(XiXˉ)2n1]=s2\begin{aligned} |{\color{red} X - \bar X}|^2 &= |X - {\color{LightSeaGreen}\mu}|^2 - |{\color{blue}\bar X} - {\color{LightSeaGreen}\mu}|^2 \\ \\ \mathbb E \Big[ |{\color{red} X - \bar X}|^2\Big] &= \Big[ |X - {\color{LightSeaGreen}\mu}|^2\Big] - \Big[|{\color{blue}\bar X} - {\color{LightSeaGreen}\mu}|^2\Big] \\ \ &= n\sigma^2 - \sigma^2 \\ &= (n-1)\sigma^2 \\ \\ \mathbb E \Big[\frac{\sum({\color{green}X_i} - {\color{blue} \bar X})^2}{n-1} \Big] &= \color{red}s^2 \end{aligned}

If we now divide by the pivotal n1n-1 and substitute in the components of the residual vector into the squared length, we fully recover the textbook formula for sample variance s2s^2.

This takes us about a third of the way towards grokking (hate that word now, sad!) Marmer's definition. I'll spare you the hassle of scrolling back to the top – this is our end goal:

Degrees of Freedom is the rank of the projection matrix inside the quadratic form in the definition of a statistic.

Quadratic Forms

Thus far we've established that the degrees of freedom of a statistic correspond to the number of dimensions which a vector can inhabit across samples. We should now have a half-decent intuition for the relationships between the lengths of and angles between these vectors, as well as how we derive least squared error vectors from the arrangement of the fundamental vectors in a right triangle.

In this section, we'll begin develop an understanding for what a quadratic form is and how they'll help us to algebraically grapple with vector lengths, which we've seen are already closely related to degrees of freedom. We'll begin with a brief description of what quadratic forms are in general, then we'll see how to express them in terms of matrix multiplications so as to play nicely with our working expressions of linearly algebraic terms. If we can express each of our terms as matrix multiplications, then we can move on to the third piece of the puzzle: the rank of the projection matrix therein.

In general, a quadratic form is just a polynomial where all terms are degree two, e.g.

f(x)=x2;f(x,y)=xyf(x) = x^2; \quad f(x,y) = xy

And secretly we've been working with plenty of quadratic forms already:

termworking expressionQF MM
data vectorXi2\sum \color{green} X_i^2XX{\color{green}X}^\top {\color{green}X}
sample meanXˉ2\sum \color{blue} \bar X^2XˉXˉ{\color{blue}\bar X}^\top {\color{blue}\bar X}
expected valueμ2\sum {\color{LightSeaGreen}\mu}^2μμ{\color{LightSeaGreen}\mu}^\top {\color{LightSeaGreen}\mu}
error vector(Xμ)2\sum ({\color{orange}X - \mu})^2(Xμ)(Xμ)({\color{orange}X - \mu})^\top ({\color{orange}X - \mu})
residual vector(XXˉ)2\sum ({\color{red}X - \bar X})^2(XXˉ)(XXˉ)({\color{red}X - \bar X})^\top ({\color{red}X - \bar X})

To visualize quadratic forms, we plot a multivariate function of x,yx, y into a third dimension zz, for example:

We can parameterize this quadratic form by some value aa which will modulate the width of our parabolic volume, where:

z=ax2{a    narrow upwarda=0    a planea    narrow downwardz=ax^2 \begin{cases} a \rightarrow \infty &\implies \text{narrow upward} \\ a = 0 &\implies \text{a plane} \\ a \rightarrow -\infty &\implies \text{narrow downward} \\ \end{cases}

Similarly, a quadratic form ~of the form z=by2z=by^2 will open along the xx axis:

And we can combine both of these terms with a=b=1a = b= 1 which produces a conic section; a paraboloid about both axes which will be circular when a=ba=b, otherwise it will bend elliptically in either direction:

Since its cross sections are elliptical, we can talk about elliptical things such as major and minor axes. For example, when just once of the parameterizing constants aa or bb is negative, we get a paraboloid which opens upward along the axis corresponding to the positive constant, and downward along the negatively constant axis:

E.g. for b=ab = -a we get the above saddle shape. Turns out that we have now exhausted all the shapes which can be built from two variables in three dimensions (parabola, bowl, saddle), and any combination of theses shapes is guaranteed to produce another one of the existing shapes, just rotated about the zz axis by some amount.

The general form for any quadratic form in three dimensions is:

z=ax2+by2+cxyz = ax^2 + by^2 +cxy

But dealing with easier-to-plot expression like these becomes algebraically tedious, especially when considering quadratic forms in higher dimensions. Instead, we prefer to compactly express these as matrices.

Recall that for some vector vˉ\bar v, the following expressions are equivalent

vˉvˉ=vˉvˉ=vˉ2=vˉvˉcosθuˉvˉ=uˉvˉ=uˉvˉcosθ\begin{aligned} \bar v^\top \bar v &= \bar v \cdot \bar v = |\bar v|^2 = |\bar v||\bar v|\cos \theta \\ \\ \bar u^\top \bar v &= \bar u \cdot \bar v = |\bar u||\bar v|\cos \theta \end{aligned}

And if the two vectors uˉ\bar u and vˉ\bar v are perpendicular to each other, then their dot product is zero:

uˉvˉ    uˉvˉ=0\bar u \perp \bar v \implies \bar u^\top \bar v =0

because cos90º=0\cos 90º = 0. For our purposes, matrix multiplication is a largely syntactical abbreviation to simplify operations and save the trees. For an exhaustive refresher about matrix multiplication, see any of the related posts on linear algebra linear-algebra.

Returning to our quadratic forms of interest, instead of referring to our two variables as xx and yy, lets instead call them X1X_1 and X2X_2 which can be neatly collected into a data vector:

X=[X1X2]X = \begin{bmatrix}\colorbox{yellow}{$X_1$} \\ \colorbox{aqua}{$X_2$}\end{bmatrix}

Recall that

XX=[X1  X2][X1X2]=X12+X22=iXi2X^\top X = [\colorbox{yellow}{$X_1$} \; \colorbox{aqua}{$X_2$}]\begin{bmatrix}\colorbox{yellow}{$X_1$} \\ \colorbox{aqua}{$X_2$}\end{bmatrix} = \colorbox{yellow}{$X_1$}^2 + \colorbox{aqua}{$X_2$}^2 = \sum_i X_i^2

So the reflexive form XXX^\top X is a concise way of expressing a quadratic form as a matrix product. In order to express any arbitrary quadratic form as a matrix multiplication, we first need to introduce another matrix into the product:

XAXX^\top AX

where AA is a matrix of the form:

A=[ac1c2b]A = \begin{bmatrix} \colorbox{yellow}{$a$} & \colorbox{orange}{$c_1$} \\ \colorbox{orange}{$c_2$} & \colorbox{aqua}{$b$} \\ \end{bmatrix}

We use these entry labels for reasons that will become obvious momentarily. When we expand the aforementioned QF MM shorthand, we now get:

XAX=[X1  X2][ac1c2b][X1X2]=aX12+c1X1X2+c2X1X2+bX22=aX12+(c1+c2)X1X2+bX22\begin{aligned} X^\top AX &= [\colorbox{yellow}{$X_1$} \; \colorbox{aqua}{$X_2$}]\begin{bmatrix} \colorbox{yellow}{$a$} & \colorbox{orange}{$c_1$} \\ \colorbox{orange}{$c_2$} & \colorbox{aqua}{$b$} \\ \end{bmatrix}\begin{bmatrix} \colorbox{yellow}{$X_1$} \\ \colorbox{aqua}{$X_2$} \\ \end{bmatrix} \\ &=\colorbox{yellow}{$aX_1$}^2 +\colorbox{orange}{$c_1$}\colorbox{yellow}{$X_1$}\colorbox{aqua}{$X_2$}+\colorbox{orange}{$c_2$}\colorbox{yellow}{$X_1$}\colorbox{aqua}{$X_2$}+\colorbox{aqua}{$bX_2$}^2 \\ &=\colorbox{yellow}{$aX_1$}^2 + (\colorbox{orange}{$c_1$} + \colorbox{orange}{$c_2$})\colorbox{yellow}{$X_1$}\colorbox{aqua}{$X_2$}+\colorbox{aqua}{$bX_2$}^2 \\ \end{aligned}

Note that the middle\colorbox{orange}{middle} term only depends on the sum of the minor diagonal of AA and this now looks an awful lot like the general equation for any quadratic form parameterized by the values of AA. There exist infinitely many real-valued matrices AA with satisfactory minor diagonal terms. When given the choice in matrices whose values differ only along the minor diagonal of cc terms, we can simply prefer the matrix which splits the terms allowing us to write a single term c/2c/2.

For example, while all three of these matrices produce the same quadratic form, I would prefer A2A_2 because it is symmetric (A2=A2A_2^\top = A_2):

A1=[2241]A2=[2331]A3=[2151]\begin{aligned} A_1 = \begin{bmatrix} 2 & 2 \\ 4 & 1 \\ \end{bmatrix} \\ \\ A_2 = \begin{bmatrix} 2 & 3 \\ 3 & 1 \\ \end{bmatrix} \\ \\ A_3 = \begin{bmatrix} 2 & 1 \\ 5 & 1 \\ \end{bmatrix} \end{aligned}

And note as well that this truism about the role of the values along the minor diagonal generalizes to any sized vector XX so long as we scale accordingly:

X=[X1X2Xn]    XAX=X[ap/2u/2p/2bv/2u/2v/2v/2]XX = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix} \implies X^\top AX =X^\top \begin{bmatrix} a & p/2 & \cdots & u/2 \\ p/2 & b & \cdots & v/2 \\ \vdots & \vdots & \ddots &\vdots \\ u/2 & v/2 & \cdots & v/2 \\ \end{bmatrix} X

This operation becomes harder to visualize if you're washed but conceptually we're just adding parabolas in more directions.

Timmy T wit da Statistics

Now that we have a way of expressing Quadratic Forms as Matrix Multiplications, we can see how to apply them to the vectors of interest we've been studying thus far. We can model nn random i.i.id. observations as a data vector X\color{green} X with sample mean Xˉ\color{blue} \bar X taken from a distribution with true mean μ{\color{LightSeaGreen}\mu}. We also have the vectors measuring the deviations from these means:

  • Xμ\color{orange} X -\mu : the error vector measuring the deviation of our data vector from the true mean
  • XXˉ\color{red} X -\bar X the residual vector measuring the deviation of the data vector from the sample mean.

Now to algebraically express the lengths of these vectors, we use quadratic forms formed by taking the inner product of each vector with itself which produces a sum of squares:

termworking expressionQF MM
data vectorXi2\sum \color{green} X_i^2XX{\color{green}X}^\top {\color{green}X}
sample meanXˉ2\sum \color{blue} \bar X^2XˉXˉ{\color{blue}\bar X}^\top {\color{blue}\bar X}
expected valueμ2\sum {\color{LightSeaGreen}\mu}^2μμ{\color{LightSeaGreen}\mu}^\top {\color{LightSeaGreen}\mu}
error vector(Xμ)2\sum ({\color{orange}X - \mu})^2(Xμ)(Xμ)({\color{orange}X - \mu})^\top ({\color{orange}X - \mu})
residual vector(XXˉ)2\sum ({\color{red}X - \bar X})^2(XXˉ)(XXˉ)({\color{red}X - \bar X})^\top ({\color{red}X - \bar X})

Each of these are in terms of those particular vectors, which is less useful than rewriting them in terms of the underlying data they represent i.e. the residual vector and sample mean can and ought to be written in terms of the data vector X\color{green} X.

Simplifying these expression such that they're in terms of their basest quadratic form will improve our understanding of the related statistic, which for us so far has been sample variance which we saw is just the squared length per degree of freedom of the residual vector:

s2=1n1(XXˉ)2{\color{red}s^2} = \frac{1}{n-1}\sum({\color{red}X - \bar X})^2 \\

Additionally, we observed that the expected value of the sample variance is the true variance of the underlying distribution, which makes sample variance an unbiased estimator.

Since sample variance is based on vector length, we can write it as a quadratic form, ideally in terms of the data vector which is known. For a small number of observations like n=2n=2, we can go straight from sample variance to quadratic form:

s2=1ni=1n(XXˉ)2=121[(X1Xˉ)2+(X1Xˉ)2]=(X1X1+X22)2+(X1X1+X22)2=14X12+14X2212X1X2+14X12+14X2212X1X2=12X12+12X22X1X2\begin{aligned} {\color{red}s^2} &= \frac{1}{n}\sum^n_{i=1}({\color{red}X - \bar X})^2 \\ &=\frac{1}{2-1} \Big[(X_1 - \bar X)^2 + (X_1 - \bar X)^2\Big] \\ &= (X_1 - \frac{X_1 + X_2}{2})^2 + (X_1 - \frac{X_1 + X_2}{2})^2 \\ &= \frac{1}{4} X_1^2 + \frac{1}{4} X_2^2 - \frac{1}{2}X_1X_2 + \frac{1}{4}X_1^2 + \frac{1}{4}X_2^2 - \frac{1}{2}X_1X_2 \\ &= \frac{1}{2} X_1^2 + \frac{1}{2} X_2^2 -X_1X_2 \\ \end{aligned}

And we can then read off these coefficients and plug them into the equivalent quadratic form:

s2=[X1  X2][1/21/21/21/2][X1X2]{\color{red}s^2} = [X_1 \; X_2] \begin{bmatrix} 1/2 & -1/2 \\ -1/2 & 1/2 \\ \end{bmatrix} \begin{bmatrix} X_1 \\ X_2 \end{bmatrix}

And this is about as minimally tedious as things get, but it quickly fails to scale for a higher sample rate. Before tackling problems of higher dimensionality which are part and parcel with our aspirations for higher fidelity samples from our distribution, let's take this quadratic form and visualize it:

As is hopefully expected, it's a parabolic curve with a fold line on y=xy=x where X1=X2X_1 = X_2. This fold line corresponds to the distribution variance, measuring the spread of the data. When X1=X2X_1=X_2, the data are the same and so we get z=0z=0. The further away our data stray from this line, the bigger the gap between X1X_1 and X2X_2.

Let's see how this unfolds for n>2n > 2. First recall that the residual vector XXˉ\color{red} X- \bar X is perpendicular to the mean vectors μ,Xˉ{\color{LightSeaGreen}\mu}, {\color{blue}\bar X}, forming a right triangle with the data vector:

Per the Pythagorean theorem, we know that the squared length of the residual vector is equal to the squared length of the data vector minus that of the sample mean:

XXˉ2=X2Xˉ2=XXXˉ2=XIXXˉ2\begin{aligned} |{\color{red} X- \bar X}|^2 &= |{\color{green} X}|^2 - |{\color{blue} \bar X}|^2 \\ & = {\color{green}X}^\top {\color{green}X} - |{\color{blue} \bar X}|^2 \\ & = {\color{green}X}^\top I{\color{green}X} - |{\color{blue} \bar X}|^2 \end{aligned}

Here we introduce the identity matrix I=AI = A to ensure our representation of the squared length of the data vector has the appropriate form of XAXX^\top A X. The shape produced by the quadratic form expression of the data vector will be a circular paraboloid given by the general equation z=x2+y2z = x^2 + y^2. zz will only be zero when x=y=0x = y =0. To represent the squared length of the sample mean as a quadratic form in terms of X\color{green} X, we begin by expanding the sample mean:

Xˉ2=[XˉXˉXˉ]2|{\color{blue} \bar X}|^2 = |\begin{bmatrix} {\color{blue} \bar X} \\ {\color{blue} \bar X} \\ \vdots \\ {\color{blue} \bar X} \end{bmatrix}|^2

So the squared length is just the sum of the squared instances of itself, ideally as a QF in terms of the observations. First we have to be able to just write Xˉ{\color{blue} \bar X} as a matrix. Algebraically, this means we need to find some matrix which satisfies the form:

Xˉ=vˉX{\color{blue} \bar X} = \bar v {\color{green} X}

where vˉ\bar v is some object which is just 1n\frac{1}{n}, repeated nn times:

vˉ=[1n  1n    1n]\bar v = \Big[\frac{1}{n} \; \frac{1}{n} \; \cdots \; \frac{1}{n}\Big]

such that vˉX\bar v \color{green} X is the mean as a single number:

Xˉ=1nX1+1nX2++1nXn=1nXi\begin{aligned} {\color{blue} \bar X} &= \frac{1}{n}X_1 + \frac{1}{n}X_2 + \cdots + \frac{1}{n}X_n \\ &= \frac{1}{n}\sum X_i \end{aligned}

but we'd like a vector, not just a scalar, so the vector needs to be this quantity for every entry, which is attainable by multiplying it by the identity matrix:

1vˉX=Xˉ[111][1n  1n    1n][X1X2Xn]=[XˉXˉXˉ]\begin{aligned} \mathbb 1 \cdot \bar v {\color{green} X} &= {\color{blue} \bar X} \\ \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \Big[\frac{1}{n} \; \frac{1}{n} \; \cdots \; \frac{1}{n}\Big]\begin{bmatrix} {\color{green} X_1} \\ {\color{green} X_2} \\ \vdots \\ {\color{green} X_n} \end{bmatrix} &= \begin{bmatrix} {\color{blue} \bar X} \\ {\color{blue} \bar X} \\ \vdots \\ {\color{blue} \bar X} \\ \end{bmatrix} \end{aligned}

To simplify the number of operations, we can just combine the first two vectors:

[1n1n1n1n1n1n1n1n1n][X1X2Xn]\begin{bmatrix} \frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n} \\ \frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n} \\ \vdots & \vdots & \ddots &\vdots \\ \frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n} \\ \end{bmatrix} \begin{bmatrix} {\color{green} X_1} \\ {\color{green} X_2} \\ \vdots \\ {\color{green} X_n} \end{bmatrix}

which produces a square matrix where each entry kij=1n:(i,j)(n×n)k_{ij} = \frac{1}{n}: (i, j) \in (n \times n). Furthermore, we can factor out the constant 1n\frac{1}{n} leaving us with a matrix of all 1s which we write as JJ, so the "mystery object" vˉ\bar v is 1nJ\frac{1}{n}J:

Xˉ=1nJX{\color{blue} \bar X} = \frac{1}{n}J {\color{green} X}

Now, we want the squared length of Xˉ{\color{blue} \bar X} which we can obtain via the inner product:

(1nJX)2=(1nJX)(1nJX)=1nXJ1nJX=1n2XJJX\begin{aligned} \Big(\frac{1}{n}J {\color{green} X}\Big)^2 &= \Big(\frac{1}{n}J {\color{green} X}\Big)^\top\Big(\frac{1}{n}J {\color{green} X}\Big) \\ &= \frac{1}{n}{\color{green} X}^\top J^\top \frac{1}{n}J{\color{green}X} \\ \\ &= \frac{1}{n^2}{\color{green} X}^\top J^\top J{\color{green}X} \end{aligned}

Note that when distributing the transposition \top through the first term on the RHS, we apply it only to scalars (since we can't transpose scalars) and we reverse the order.

Additionally, since JJ is symmetric JJ=JJJ^\top J = JJ, and for e.g.

[11][11]=11+11=2=n\begin{bmatrix} 1 \\ 1 \end{bmatrix}^\top \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 1\cdot 1+ 1\cdot 1 = 2 = n

we get a vector with nn entries where nn is the rank of the vector. Similarly, JJJJ is an n×nn\times n square matrix where each entry is nn, which we can factor out to just be nJnJ, so all together for n=2n=2:

Xˉ2=1n2XJJX=1n2XnJX=1nXJX=[X1X2][12121212][X1X2]\begin{aligned} |{\color{blue} \bar X}|^2 &= \frac{1}{n^2}{\color{green} X}^\top J^\top J{\color{green}X} \\ &= \frac{1}{n^2}{\color{green} X}^\top n J{\color{green}X} \\ \\ &= \frac{1}{n}{\color{green} X}^\top J{\color{green}X} \\ \\ &=\begin{bmatrix} X_1 \\ X_2 \end{bmatrix}^\top \begin{bmatrix} \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & \frac{1}{2} \\ \end{bmatrix} \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \end{aligned}

which is itself a quadratic form with a fold at y=xy = -x since the sample mean is 00 when X2X_2 is the negation of X1X_1:

Now with each component of our pythagorean equation decomposed, we can visualize all of them yielding two parabolic surfaces which combine to the aforementioned circular paraboloid:

Returning to the stated goal of finding a quadratic form for the sample variance which is squared length per degree of freedom of the residual vector which we saw can be written in terms of the Pythagorean theorem:

XXˉ2=X2Xˉ2=XIXXJnX=X(IJn)X    s2=1n1(XXˉ)21n1XXˉ2=X(IJn)Xs2=1n1X(IJn)X\begin{aligned} |{\color{red} X- \bar X}|^2 &= |{\color{green} X}|^2 - |{\color{blue} \bar X}|^2 \\ & = {\color{green}X}^\top I{\color{green}X} - {\color{green}X}^\top \frac{J}{n}{\color{green}X} \\ & = {\color{green}X}^\top (I - \frac{J}{n}) {\color{green}X} \\ \implies {\color{red} s^2} = \frac{1}{n-1}\sum ({\color{red} X- \bar X})^2 \\ \frac{1}{n-1} |{\color{red} X- \bar X}|^2 &= {\color{green}X}^\top (I - \frac{J}{n}) {\color{green}X} \\ \\ {\color{red} s^2} &= \frac{1}{n-1} {\color{green}X}^\top (I - \frac{J}{n}) {\color{green}X} \end{aligned}

This brings us another third of the way towards unpacking ~the definition. Next we'll extract some useful properties of these matrices as projection matrices in order to complete the definition we've been working towards.

Projection Matrices

workin on it