I ask the reader to call to mind some of the great rivalries over the centuries: Achilles vs. Hector, Tesla vs. Edison, Ford vs. Ferrari, ... and of course Vassilevska Williams vs. Le Gall.
In the past decade, a handful of researchers have been duking it out over matrix multiplication algorithm performance and it seems like no one is paying attention?!?! Today we're gonna poast about this paper "A Refined Laser Method and Faster Matrix Multiplication,"1 by Josh Alman and Virginia Vassilevska Williams which shatters, okay well ... it doesn't really shatter the previous best method – but it makes a notable improvement which can likely be generalized to other classes of tensors to push the upper bound even lower.
The Problem of Matrix Multiplication
Given n×n matrices A,B over some field F, we want to compute Cn×n s.t.
Cij=k∑Aik⋅Bkj
but we wanna go fast. How many operations over the field (addition, multiplication, division) are needed to compute C? The authors of this paper posit that there are tight bounds on the number of operations:
nω+O(1);2≤ω≤3
but if you've ever taken a linear algebra exam, we know that we can certainly do worse than cubic time to apply the standard inner-product algorithm, making several arithmetic mistakes along the way to compute the result matrix.
The Inner Product Algorithm
The first (and usually last method) most of us pedestrian matrix multipliers learn just takes the dot product of all the rows of the first matrix with the columns of the second matrix. This is fine, but slow, as it requires many multiplication operations which are more expensive than additions. I won't get into the weeds (perhaps my next muse...) since there's plenty of weeds at the current layer of abstraction, but suffice it to say that the additions become asymptotically irrelevant and multiplication at the ALU level typically requires several additions, whereas the inverse is not true.2
If we're just limited to pen and paper, and compelled to multiply two 2×2 matrices X and Y, the approach we'd probably use would look something like this:
Coppersmith and Winograd are heavily cited as a "black box" for other computational performance benchmarks or building blocks in the field of computational theory since matrix multiplication is the foundation of pretty much all applied linear algebraic questions. Operators including matrix inversion, rank, determinant, lower-upper factorization/decomposition, eigen anything-ing, etc. Matrix multiplication also has applications in other fields such as graph theory, complexity theory, etc. so the motivation for wanting to go fast is well established.
Up until 1969, it was thought that it wasn't possible to compute the product of two matrices in faster-than-cubic time. In fact, the discovery of a faster method was made by Volker Strassen when trying to prove that standard Gaussian elimination (an n3 approach) was optimal. This discovery was published in a paper aptly titled Gaussian elimination is not optimal.2 And this was a watershed moment for investigation into faster means of matrix multiplication using various methods:
Till the past decade or so, when only minor improvments on the order of hundredths of thousandths of operations are being shaved off. This might not seem like much, but for massive matrices like the ones you print out to help you on your final exam, these add up. Additionally, the academic heart strives to know the theoretic optimum. Ultimately, the question of Matrix Multiplication optimization boils down to one of asymptotic rank of a Matrix Multiplication Tensor.
The Matrix Multiplication Tensor
There are many representations or interpretations of what a tensor is:
a multidimensional array: (Tijk)ijk∈Fn×m×p
A bilinear map, taking two vectors to a third vector: T:Fn×Fm→Fp
A trilinear map, taking three vectors to a scalar: T:Fn×Fm×Fp→c
a multilinear polynomial: T:Fn×Fm×Fp→F
The last interpretation is the one widely used in the Matrix Multiplication Problem since it's useful for encoding the notion of a "problem" within another problem.
We define a tensor T as a multilinear polynomial over sets of formal variables X={x1,...,xn},Y=...,Z=... so that
T=i=1∑nj=1∑mk=1∑pαijk⋅xiyjzk,
where αijk are elements of some field F e.g. C,R,Q. As mentioned before, every tensor under this interpretation (which is equivalent to any of the others) defines a computationl problem: for given vectors x,y, and every k∈[p] (the number of polynomials we want to compute), we want to compute the coefficients of zk which is a bilinear polynomial:
i∑nj∑mαijk⋅xiyj
The Matrix Multiplication Tensor (MMT), then, is the one to solve our problem. For m,s,t∈N, define:
⟨m,s,t⟩:=i=1∑mj=1∑sk=1∑txijyjkzki
to be the MMT for multiplying two matrices Am×s,Bs×t.
In general, the coefficient of zki in ⟨m,s,t⟩ is the (i,k)th entry of
The formal problem that we encode as an MMT is the Matrix Multiplication Problem:
⟨n,n,n⟩:=i=1∑nj=1∑nk=1∑nxijyjkzki
about which we care for the asymptotic rank.
Tensor Rank
In general, Rank is the complexity of a tensor, and a tensor is just the natural generalization of a matrix. The total Rank of a tensor T is the minimum number of Rank one tensors whose product equals T. We define a tensor Rank of one if there exists ai,bj,ck∈F such that our tensor can be expressed as the product of linear combinations:
We often express this tensor T as a⊗b⊗c (read "a tensor b tensor c"). Matrix Rank is the same just without the third zk,ck dimension.
A Rankr=0 mathematical object is just a scalar, an r=1 object is a vector; r=2 an n×n matrix, and Rank≥3 objects are tensors.
At a high level, the decomposition of Tn into R rank one terms provides an algorithm for multiplying arbitrary n×n matrices using R scalar multiplications.
Properties of Tensor Rank
Lemma: if Rank(⟨m,s,t⟩)=r, then ω≤3logmstr via the recursive application of Rank(⟨m,m,m⟩)=r implying ω≤3logmr. We can express this as the sum of r terms that are linear combinations of the x entries of our tensor times the linear combination of the y entries ... and so on:
and we use this expression to develop an algorithm for constant-sized matrices e.g. ⟨m,s,r⟩ representing two 2×2 matrices, and we use it to recursively solve for the product of larger matrices whose entries are themselves matrices:
⎣⎡[][][][]⎦⎤×⎣⎡[][][][]⎦⎤
The algorithm is:
For all ℓ=1,...,r recursively compute
Pℓ=(i′,j∑aℓi′jxi′j)(j′,k′∑bℓj′k′yj′k′)
For coefficients of zki, compute
ℓ∑rcℓkiPℓ
This algorithm can be used to multiply block matrices. Via recursive application, we can multiply matrices of arbitrary size, with the rank R controlling the asymptotic complexity of the algorithm: a matrix Mn×n can be multiplied with asymptotic complexity of O(nlogNR).
Thus, Pℓ is a "block" and a comparatively cheap quantity to compute since it's just a summation with few multiplications compared to our naive n3 approach. Additionally, ∑ℓrPℓ is cheap to compute since it's just another linear combination which can be computed in linear time. So, the whole runtime of this algorithm is dependent on the number of recursive calls to compute our blocks which is r, hence why we are concerned with the asymptotic rank of the MMT, and also how Strassen got
Rank(⟨2,2,2⟩)≤7⟹ω=2.81
Similarly, in 1978 Pan achieved ω=2.80 via
Rank(⟨70,70,70⟩)≤143,640⟹ω≤log70143,640=2.80
It was around this time that people stopped analyzing fixed size Matrix Multiplication Problems because it quickly becomes unwieldy to work with such large objects. So, the method for analysis changed. In 1987, Coppersmith and Winograd published a new method for designing Matrix Multiplication Algorithms.5 These methods –which have become more or less standard as far as I can tell– rely on two components:
an Algebraic Identity: corresponding to a Rank representation of some small tensor
a Method for Analyzing the Identity: which is the method of extracting a Matrix Multiplication Algorithm from other tensors
Publication
Algebraic Identity
Analysis
Strassen, 1969
2×2 MM using only 7 multiplications
Simple, recursive approach
⋮
⋮
⋮
Coppersmith, Winograd 1987
The CW⊗2 tensor and its powers
Laser Method, introduced by Coppersmith, Winograd, and Strassen
⋮
⋮
⋮
Stothers, 2010
CW⊗4
Laser Method
Vassilevska Williams, 2011
CW⊗8
Laser Method
Le Gall, 2014
CW⊗32
Laser Method
Vassilevska Williams, 2020
CW⊗32
Refined Laser Method
The Laser Method
The Laser Method is an indirect technique for extracting Matrix Multiplication Algorithms out of other objects. It was developed by Strassen, and optimized by Coppersmith and Winograd. Crucially, as recently as 2019,6 Josh Alman proved the correctness of the Laser Method, showing that for tensors T for which it can be applied, if any other method could achieve the optimum of ω=2 using T, then so can the Laser Method. Therefore, the Laser Method is optimal for a certain subset of tensors.
How does it work?
The Laser Method is a composition of reduction problems:
Take some algebraic problem P that we have an efficient algorithm for
Show how to reduce the problem of Matrix Multiplication to P
This is nearly identical to Strassen's original approach to proving sub-cubic time. E.g. given the algebriac problem instance of Matrix Multiplication for two 2×2 matrices:
At a high level, our goal is to embed matrix multiplication into this similar, and notably larger (by the introduction of the x33,y33 terms) problem P. Our measure of success is removal of some inputs and outputs of P to transform it into the Matrix Multiplication Problem:
The way we achieve this reduction is be setting some of the X,Y,Z variable of our instance of P to zero. Setting x33,y33,z3←0 yields our desired Matrix Multiplication tensor exactly (which is not always the case as we'll see shortly):
This is called zeroing-out a tensor which is similar to the notion of reducibility in complexity theory. We denote that P is reducible to ⟨2,2,2⟩ via zeroing as "⟨2,2,2⟩≤zoP" which does have a well-defined Rank bound.
Lemma: If Q≤zoP, then Rank(Q)≤Rank(P), so
if ⟨m,s,t⟩≤zoP, then ω≤3logmstRank(P).
Brief aside about the Strassen method
Strassen showed that it is possible to compute those polynomials using the following identities:
which reduces the number of matrix additions from 18 to 15, and the multiplications from 8 to 7!
Laser Method
The Laser Method is a bit more complicated than just the above reduction, namely due to the fact that it becomes increasingly challenging to zero out terms without removing terms we still need for the Matrix Multiplication proper for larger input matrices.
Thus, the Laser Method instead reduces direct sums of Matrix Multiplication Tensors to powers of tensors with a special structure via the zero-out reduction illustrated above. The process looks like this:
MM reducable to direct summation of other MMTs⟨n,n,n⟩⟹⊕i⟨mi,si,ti⟩⟹laser methodbounding T via its powersT⊗N⟹T
and composing all of these gives us a reduction from Matrix Multiplication to a tensor T.
Direct Summation
For some tensor i.e. T=x1y1z1+x1y2z2, we can add it with a slightly modified copy of itself yielding a "regular" sum:
Observe that both copies share a y2 variable. Conversely, a direct sum of two copies of T are completely disjoint: they have no variables in common despite summing over the same set of variables:
Schönhage showed that, because of this disjoint-ness, upper bounds for MMTs also yield upper bounds for ω via the following theorem:7
If the direct sum of k copies of ⟨m,s,t⟩ have Rank≤r m then ω≤3logmstkr.
This means that we just have to show that the rank of k disjoint copies of some MMT is smaller than k times the rank of the original tensor. This is the first step of the method diagrammed above:
ω≤3logmstkr⟨n,n,n⟩⟹Rank=r⊕k⟨m,s,t⟩⟹...
Tensor Exponentiation
Now, for the other half of the method, what the hell does this mean: T⊗⟹T. We need to define what it means to exponentiate a tensor. First, we define the product of two tensors via the Kronecker Product.
For tensors S,T over variable sets {X,Y,Z},{X′,Y′,Z′}, respectively, where
which we immediately generalize into the definition of the power of a tensor:
T⊗N=T⊗N copies⋯⊗T
Lemma: The product of any two MMTs can be combined into a product of their dimensions: ⟨a1,a2,a3⟩⊗⟨b1,b2,b3⟩=⟨a1b1,a2b2,a3b3⟩ which precisely corresponds to our idea of "blocking."
Observe that T⊗N will contain N tuples over each variable set of the original tensor, so we get larger and larger variable sets, but the rank is guaranteed to be bounded by the individual ranks of the tensors being multiplied per the following:
Lemma: Rank(S⊗T)≤Rank(S)⋅Rank(T), so Rank(T⊗N)≤Rank(T)N
and sometimes, Rank(T⊗N)<<Rank(T)N.
From these lemmas, we can get an expression for asymptotic rank:
R=N→∞limRank(T⊗N)1/N
For example, for the 2×2 MMT, Rank(⟨2,2,2⟩)=7, but the rank of the Nth power of it is nω! Thus, asymptotic rank is the secret workhorse of all modern Matrix Multiplication algorithms. This completes the right half of our methodology diagram:
...⟹laser methodR≤rNT⊗N⟹Rank=rT
We can interpret this to mean: if we have a bound on the asymptotic rank of the tensor, we also have a bound on the asymptotic rank of its Nth power. So now we just have to show the reduction via Laser Method allowing us to compose the left and right halves. If we can compose these steps for sufficiently large m,s,t,k, then we can get a really good bound on ω.
Laser Method
Finally, the middle bit: ...⟹laser method...
The Laser Method takes a tensor T with a "special structure" and uses that structure to reduce the direct sum ⊕⟨m,s,t⟩ to T⊗N for large m,s,t,k functions of N by zeroing out many variables of T⊗N yielding huge disjoint sums.
Special Sauce: Tensor Partitioning
What is the aforementioned "special structure?" We start with some tensor:
T=a∑nb∑mc∑pαabcxaybzc
and partition its variable sets X=X1∪...∪Xℓ,Y=Y1∪...∪Yℓ,Z=Z1∪...∪Zℓ into ℓ parts. Then, for i,j,k∈[ℓ], let the subtensor Tijk be
which contains only the triples αabc⋅xaybzc with xa∈Xi,yb∈Yj,zc∈Zjk.
Our whole tensor T can then be expressed as a sum of subtensors:
T=i∑ℓj∑ℓk∑ℓTijk
And if all the subtensors Tijk are MMTs, T is a sum of MMTs, though not yet a direct sum. We want them to be a disjoint, direct sum, which is provably possible for all such partitioned tensors. Trivially, we could achieve this by partitioning each variable set into m,s,t distinct parts such that we have a direct sum of 1×1×1 Matrix Multiplication Tensors. Such a decomposition isn't very useful though.
Let's take a look instead at a useful partitioning over a nontrivial tensor such as the Coppersmith-Winograd tensor parameterizd over some natural number q:
This tensor is fascinating because its asymptotic rank is q+2 which is optimal since that's the number of x,y or z variables that appear in it e.g. there are always precisely q+2x0 variables in it:
If we had an optimal asymptotic rank for an MMT, since the number of variables is n2, we'd be getting an asymptotic rank of n2, or ω=2, so – we'd really like to embed the Matrix Multiplication Problem in this tensor.
The optimal partitioning for this tensor resembles the following:
The inner products are nontrivial MMTs, and therefore challenging to get any fruitful zeroations out of. Instead, the Laser Method doesn't zero out the tensor itself, but takes the Nth power of it to give us more freedom for zeroing out.
The image of this process in my head is: taking two squares (matrices) we want to multiply, transforming their representation into a cube (tensor), exponentiating that cube into a rectangular prism, and then scratching out a bunch of the numbers we don't need until we're left with something "close" to the number of elements of just two slices of the rectangular prism which would be our initial matrices.
Taking a Large Power
T is constant sized, and partitioned s.t. the Tijk subtensors are matrix products. We take the Nth power of T:
When we take the Nth power, we get the Nth power of a sum of subtensors, which in turn gives a sum of N-tuples of T1 to TN, where each Ti is one of the subtensors Tijk. From the assumption that our subtensors are matrix products, we can conclude that T⊗N is a tensor product of matrix products, which we've established is also a matrix product. So, it's a sum of matrix products, just really really big ones. Recall that what we really want is a disjoint sum of matrix products – all with the same dimensions. Ours currently do not have the same dimension unless our subtensors all had the same dimension to begin with which was not a constraint of the partition.
So, we zero out some of our Ti. We do this by associating some distribution π to each of our T1,...,TN∈{Tijk}N where π is the frequency of each term, or the fraction of times we used a given subtensor Tijk in the tensor power:
πijk=N1∣∣∣{p∈[N]∣Tp=Tijk}∣∣∣
Now, all of our Ti∈T⊗N whose (T1,...,TN)N-tuple terms correspond to the same π are matrix products with the same dimension: ⟨mπ,nπ,tπ⟩. This is because
⟨a,b,c⟩⊗t⊗⟨p,q,r⟩⊗u≡iso⟨atpu,btqu,ctru⟩
the location of these terms in the sum does not matter, just the number of occurences – hence the ability to proxy dimension with our distribution ascription. So, we just need to pick a π which allows us to zero out the most terms or matrix products of disparate dimension.
Goal: pick π and zero our variables in T⊗N to get a direct sum of subtensors T1⊗...⊗TN which correspond to π. This is the Laser Method.
Suppose lasering thusly gives us L such tensors in the direct summation – we can use Schönhage's theorem and get
ω≤3logmπnπtπlogLRank(T⊗N);L∝ω
yielding an inversely proportionate relationship between L and ω: L↑ω↓. We want to maximize L for a fixed distribution π and express it in terms of our distribution. Then, we can minimize the upper bound on ω over the choices for π:
We're left to determmine L,mπ,nπ,tπ as functions of π,N,T. And already we've got a problem: we're scheming to zero out variables but π is currently a distribution over triples not individual variables. So, we have to do something to convert π from triples of Tijk into a distribution over single variables.
Marginalizing π
A marginal corresponds to the fraction of times that a term of shape Ti∗∗ appears in T1⊗...⊗TN for a given index i. We define the "x" marginals of π as
πi:=j,k∈[ℓ]∑πijk
Recall that T⊗N is a tensor over the N-tuples XN,YN,ZN, so its X variables are partitioned by Xc1×...×XcN for c1,...,cN∈[ℓ] and each subtensor T1⊗...⊗TN corresponding to π uses the X variables from a set
Xc1×...×XcN where ∀i∈[ℓ],∣∣∣{p∈[N]∣cp=i}∣∣∣=πiN
e.g. T1=Tc1jk for some j,k. So, every term in our product(s) has some marginals it obeys. The Laser Method zeroes out variables in T⊗N that do not have marginals in πi that we desire, yielding a sum of terms T1⊗...⊗