2.3728596

Published on
80 min read––– views

Introduction

I ask the reader to call to mind some of the great rivalries over the centuries: Achilles vs. Hector, Tesla vs. Edison, Ford vs. Ferrari, ... and of course Vassilevska Williams vs. Le Gall.

In the past decade, a handful of researchers have been duking it out over matrix multiplication algorithm performance and it seems like no one is paying attention?!?! Today we're gonna poast about this paper "A Refined Laser Method and Faster Matrix Multiplication,"1 by Josh Alman and Virginia Vassilevska Williams which shatters, okay well ... it doesn't really shatter the previous best method – but it makes a notable improvement which can likely be generalized to other classes of tensors to push the upper bound even lower.

The Problem of Matrix Multiplication

Given n×nn \times n matrices A,BA, B over some field F\mathbb F, we want to compute Cn×nC_{n\times n} s.t.

Cij=kAikBkjC_{ij} = \sum_k A_{ik} \cdot B_{kj}

but we wanna go fast. How many operations over the field (addition, multiplication, division) are needed to compute CC? The authors of this paper posit that there are tight bounds on the number of operations:

nω+O(1);2ω3n^{\omega + O(1)}; \quad 2 \leq \omega \leq 3

but if you've ever taken a linear algebra exam, we know that we can certainly do worse than cubic time to apply the standard inner-product algorithm, making several arithmetic mistakes along the way to compute the result matrix.

https://twitter.com/O42nl/status/1630420005765390336?s=20

The Inner Product Algorithm

The first (and usually last method) most of us pedestrian matrix multipliers learn just takes the dot product of all the rows of the first matrix with the columns of the second matrix. This is fine, but slow, as it requires many multiplication operations which are more expensive than additions. I won't get into the weeds (perhaps my next muse...) since there's plenty of weeds at the current layer of abstraction, but suffice it to say that the additions become asymptotically irrelevant and multiplication at the ALU level typically requires several additions, whereas the inverse is not true.2

If we're just limited to pen and paper, and compelled to multiply two 2×22\times 2 matrices XX and YY, the approach we'd probably use would look something like this:

X×Y=[x11x12x21x22][y11y12y21y22]=[x11y11+x12y21x11y12+x12y22x21y11+x22y21x21y12+x22y22]\begin{aligned} X \times Y &= \begin{bmatrix} \color{blue} x_{11} & \color{blue}x_{12} \\ \color{blue}x_{21} & \color{blue}x_{22} \end{bmatrix} \begin{bmatrix} \color{orange} y_{11} & \color{orange}y_{12} \\ \color{orange}y_{21} & \color{orange}y_{22} \end{bmatrix} \\ &= \begin{bmatrix} \color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21} & \color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22} \\ \color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21} & \color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22} \end{bmatrix} \end{aligned}

which is n3=8n^3 = 8 multiplications. Meh.

History of the Problem

Coppersmith and Winograd are heavily cited as a "black box" for other computational performance benchmarks or building blocks in the field of computational theory since matrix multiplication is the foundation of pretty much all applied linear algebraic questions. Operators including matrix inversion, rank, determinant, lower-upper factorization/decomposition, eigen anything-ing, etc. Matrix multiplication also has applications in other fields such as graph theory, complexity theory, etc. so the motivation for wanting to go fast is well established.

Up until 1969, it was thought that it wasn't possible to compute the product of two matrices in faster-than-cubic time. In fact, the discovery of a faster method was made by Volker Strassen when trying to prove that standard Gaussian elimination (an n3n^3 approach) was optimal. This discovery was published in a paper aptly titled Gaussian elimination is not optimal.2 And this was a watershed moment for investigation into faster means of matrix multiplication using various methods:

3

Till the past decade or so, when only minor improvments on the order of hundredths of thousandths of operations are being shaved off. This might not seem like much, but for massive matrices like the ones you print out to help you on your final exam, these add up. Additionally, the academic heart strives to know the theoretic optimum. Ultimately, the question of Matrix Multiplication optimization boils down to one of asymptotic rank of a Matrix Multiplication Tensor.

The Matrix Multiplication Tensor

There are many representations or interpretations of what a tensor is:

  • a multidimensional array: (Tijk)ijkFn×m×p(T*{ijk})*{ijk} \in \mathbb F^{n\times m\times p}
  • A bilinear map, taking two vectors to a third vector: T:Fn×FmFpT: \mathbb F^n \times \mathbb F^m \rarr \mathbb F^p
  • A trilinear map, taking three vectors to a scalar: T:Fn×Fm×FpcT: \mathbb F^n \times \mathbb F^m \times \mathbb F^p \rarr c
  • a multilinear polynomial: T:Fn×Fm×FpFT: \mathbb F^n \times \mathbb F^m \times \mathbb F^p \rarr \mathbb F

The last interpretation is the one widely used in the Matrix Multiplication Problem since it's useful for encoding the notion of a "problem" within another problem.

We define a tensor TT as a multilinear polynomial over sets of formal variables X={x1,...,xn},Y=...,Z=...X = \{x_1, ..., x_n\}, Y = ..., Z = ... so that

T=i=1nj=1mk=1pαijkxiyjzk,T = \sum_{i=1}^n \sum_{j=1}^m \sum_{k=1}^p \alpha_{ijk} \cdot x_i y_j z_k,

where αijk\alpha_{ijk} are elements of some field F\mathbb F e.g. C,R,Q\mathbb {C, R, Q}. As mentioned before, every tensor under this interpretation (which is equivalent to any of the others) defines a computationl problem: for given vectors x,y\mathbf{x}, \mathbf{y}, and every k[p]k \in [p] (the number of polynomials we want to compute), we want to compute the coefficients of zkz_k which is a bilinear polynomial:

injmαijkxiyj\sum^n_i \sum^m_j \alpha_{ijk} \cdot x_i y_j

The Matrix Multiplication Tensor (MMT), then, is the one to solve our problem. For m,s,tNm,s,t \in \mathbb N, define:

m,s,t:=i=1mj=1sk=1txijyjkzki\langle m,s,t\rangle := \sum_{i=1}^m \sum_{j=1}^s \sum_{k=1}^t x_{ij} y_{jk} z_{ki}

to be the MMT for multiplying two matrices Am×s,Bs×tA_{m\times s}, B_{s\times t}.

In general, the coefficient of zkiz_{ki} in m,s,t\langle m,s,t\rangle is the (i,ki,k)th entry of

[x11x1sxm1xms]×[y11y1tys1yst] \begin{bmatrix} x_{11} & \cdots & x_{1s} \\ \vdots & \ddots & \vdots \\ x_{m1} & \cdots & x_{ms} \\ \end{bmatrix} \times \begin{bmatrix} y_{11} & \cdots & y_{1t} \\ \vdots & \ddots & \vdots \\ y_{s1} & \cdots & y_{st} \\ \end{bmatrix}

The Matrix Multiplication Problem

The formal problem that we encode as an MMT is the Matrix Multiplication Problem:

n,n,n:=i=1nj=1nk=1nxijyjkzki\langle n, n, n \rangle := \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n x_{ij} y_{jk} z_{ki}

about which we care for the asymptotic rank.

Tensor Rank

In general, RankRank is the complexity of a tensor, and a tensor is just the natural generalization of a matrix. The total RankRank of a tensor TT is the minimum number of RankRank one tensors whose product equals TT. We define a tensor RankRank of one if there exists ai,bj,ckFa_i, b_j, c_k \in \mathbb F such that our tensor can be expressed as the product of linear combinations:

T=(naixi)(mbjyj)(pckzk)=injmkpaibjckour coeffxiyjzkT = \Big(\sum^n a_i x_i\Big) \Big(\sum^m b_j y_j\Big) \Big(\sum^p c_k z_k\Big) = \sum_i^n \sum_j^m \sum_k^p \underbrace{a_i b_j c_k}_{\text{our coeff}} \cdot x_i y_j z_k

We often express this tensor TT as abca \otimes b \otimes c (read "aa tensor bb tensor cc"). Matrix RankRank is the same just without the third zk,ckz_k, c_k dimension.

A Rank  r=0Rank \; r = 0 mathematical object is just a scalar, an r=1r = 1 object is a vector; r=2r = 2 an n×nn \times n matrix, and Rank3Rank \geq 3 objects are tensors.

At a high level, the decomposition of TnT_n into RR rank one terms provides an algorithm for multiplying arbitrary n×nn \times n matrices using RR scalar multiplications.

Properties of Tensor Rank

  • Lemma: if Rank(m,s,t)=rRank(\langle m, s, t\rangle) = r, then ω3logmstr\omega \leq 3 \log_{mst} r via the recursive application of Rank(m,m,m)=rRank(\langle m, m, m\rangle) = r implying ω3logmr\omega \leq 3 \log_m r. We can express this as the sum of rr terms that are linear combinations of the xx entries of our tensor times the linear combination of the yy entries ... and so on:
Rank(m,s,t)=r(i,jaijxij)(j,kbjkyjk)(k,ickizki)Rank(\langle m, s, t\rangle) = \sum_{\ell}^r \Big(\sum_{i', j} a_{\ell i'j}x_{i'j} \Big) \Big(\sum_{j', k} b_{\ell j'k}y_{j'k} \Big) \Big(\sum_{k, i} c_{\ell ki}z_{ki} \Big)

and we use this expression to develop an algorithm for constant-sized matrices e.g. m,s,r\langle m, s, r\rangle representing two 2×22\times 2 matrices, and we use it to recursively solve for the product of larger matrices whose entries are themselves matrices:

[[    ][    ][    ][    ]]×[[    ][    ][    ][    ]]\begin{bmatrix} \begin{bmatrix} \; & \; \end{bmatrix} & \begin{bmatrix} \; & \; \end{bmatrix} \\ \\ \begin{bmatrix} \; & \; \end{bmatrix} & \begin{bmatrix} \; & \; \end{bmatrix} \end{bmatrix} \times \begin{bmatrix} \begin{bmatrix} \; & \; \end{bmatrix} & \begin{bmatrix} \; & \; \end{bmatrix} \\ \\ \begin{bmatrix} \; & \; \end{bmatrix} & \begin{bmatrix} \; & \; \end{bmatrix} \end{bmatrix}

The algorithm is:

  1. For all =1,...,r\ell = 1, ..., r recursively compute
P=(i,jaijxij)(j,kbjkyjk)P_\ell = \Big(\sum_{i', j} a_{\ell i'j}x_{i'j} \Big) \Big(\sum_{j', k'} b_{\ell j'k'}y_{j'k'} \Big)
  1. For coefficients of zkiz_{ki}, compute
rckiP\sum_{\ell}^r c_{\ell k i} P_\ell

This algorithm can be used to multiply block matrices. Via recursive application, we can multiply matrices of arbitrary size, with the rank RR controlling the asymptotic complexity of the algorithm: a matrix Mn×nM_{n\times n} can be multiplied with asymptotic complexity of O(nlogNR)O(n^{\log_N R}).

Thus, PP_\ell is a "block" and a comparatively cheap quantity to compute since it's just a summation with few multiplications compared to our naive n3n^3 approach. Additionally, rP\sum_{\ell}^r P_\ell is cheap to compute since it's just another linear combination which can be computed in linear time. So, the whole runtime of this algorithm is dependent on the number of recursive calls to compute our blocks which is rr, hence why we are concerned with the asymptotic rank of the MMT, and also how Strassen got

Rank(2,2,2)7    ω=2.81Rank(\langle 2, 2, 2 \rangle) \leq 7 \implies \omega = 2.81

Similarly, in 1978 Pan achieved ω=2.80\omega = 2.80 via

Rank(70,70,70)143,640    ωlog70143,640=2.80Rank(\langle 70, 70, 70 \rangle) \leq 143,640 \implies \omega \leq \log_{70} 143,640 = 2.80

It was around this time that people stopped analyzing fixed size Matrix Multiplication Problems because it quickly becomes unwieldy to work with such large objects. So, the method for analysis changed. In 1987, Coppersmith and Winograd published a new method for designing Matrix Multiplication Algorithms.4 These methods –which have become more or less standard as far as I can tell– rely on two components:

  1. an Algebraic Identity: corresponding to a RankRank representation of some small tensor
  2. a Method for Analyzing the Identity: which is the method of extracting a Matrix Multiplication Algorithm from other tensors
PublicationAlgebraic IdentityAnalysis
Strassen, 19692×22\times 2 MM using only 7 multiplicationsSimple, recursive approach
\vdots\vdots\vdots
Coppersmith, Winograd 1987The CW2CW^{\otimes 2} tensor and its powersLaser Method, introduced by Coppersmith, Winograd, and Strassen
\vdots\vdots\vdots
Stothers, 2010CW4CW^{\otimes 4}Laser Method
Vassilevska Williams, 2011CW8CW^{\otimes 8}Laser Method
Le Gall, 2014CW32CW^{\otimes 32}Laser Method
Vassilevska Williams, 2020CW32CW^{\otimes 32}Refined Laser Method

The Laser Method

The Laser Method is an indirect technique for extracting Matrix Multiplication Algorithms out of other objects. It was developed by Strassen, and optimized by Coppersmith and Winograd. Crucially, as recently as 2019,5 Josh Alman proved the correctness of the Laser Method, showing that for tensors TT for which it can be applied, if any other method could achieve the optimum of ω=2\omega = 2 using TT, then so can the Laser Method. Therefore, the Laser Method is optimal for a certain subset of tensors.

How does it work?

The Laser Method is a composition of reduction problems:

  1. Take some algebraic problem PP that we have an efficient algorithm for
  2. Show how to reduce the problem of Matrix Multiplication to PP

This is nearly identical to Strassen's original approach to proving sub-cubic time. E.g. given the algebriac problem instance of Matrix Multiplication for two 2×22 \times 2 matrices:

X=[x11x12x21x22],Y=[y11y12y21y22],X = \begin{bmatrix} \color{blue} x_{11} & \color{blue}x_{12} \\ \color{blue}x_{21} & \color{blue}x_{22} \end{bmatrix}, Y = \begin{bmatrix} \color{orange} y_{11} & \color{orange}y_{12} \\ \color{orange}y_{21} & \color{orange}y_{22} \end{bmatrix},

evaluate the four polynomials:

x11y11+x12y21,  x11y12+x12y22x21y11+x22y21,  x21y12+x22y22 \color{blue}x_{11}\color{orange}y_{11} \color{black} + \color{blue}x_{12}\color{orange}y_{21}, \; \color{blue}x_{11}\color{orange}y_{12} \color{black} + \color{blue}x_{12}\color{orange}y_{22} \\ \color{blue}x_{21}\color{orange}y_{11} \color{black} + \color{blue}x_{22}\color{orange}y_{21}, \; \color{blue}x_{21}\color{orange}y_{12} \color{black} + \color{blue}x_{22}\color{orange}y_{22}

Hell, we may as well throw them into the matrix box that they belong in:

[x11y11+x12y21x11y12+x12y22x21y11+x22y21x21y12+x22y22]\begin{bmatrix} \color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21} & \color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22} \\ \color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21} & \color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22} \end{bmatrix}

and suppose we can solve some other problem PP, given terms

x11,x12,x21,x22,x33,y11,y12,y21,y22,y33\color{blue}x_{11}, x_{12}, x_{21}, x_{22},\color{black} x_{33}, \\ \color{orange} y_{11}, y_{12}, y_{21}, y_{22}, \color{black} y_{33}

we want to compute the 5 polynomials:

x11y11+x12y21,x11y12+x12y22+x33y33,x21y11+x22y21,x21y12+x22y22,x11y11+x22y22\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21}, \\ \color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22} \color{black}+ \color{black} x_{33}y_{33}, \\ \color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21}, \\ \color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}, \\ \color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}

At a high level, our goal is to embed matrix multiplication into this similar, and notably larger (by the introduction of the x33,y33x_{33}, y_{33} terms) problem PP. Our measure of success is removal of some inputs and outputs of PP to transform it into the Matrix Multiplication Problem:

x11y11+x12y21,x11y12+x12y22+x33y33,x21y11+x22y21,x21y12+x22y22,x11y11+x22y22\color{blue}x_{11}\color{orange}y_{11} \color{black} + \color{blue}x_{12}\color{orange}y_{21}, \\ \color{blue}x_{11}\color{orange}y_{12} \color{black} + \color{blue}x_{12}\color{orange}y_{22} + \color{red} \cancel{\color{black}x_{33}y_{33}}, \\ \color{blue}x_{21}\color{orange}y_{11} \color{black} + \color{blue}x_{22}\color{orange}y_{21}, \\ \color{blue}x_{21}\color{orange}y_{12} \color{black} + \color{blue}x_{22}\color{orange}y_{22}, \\ \color{red}\cancel{\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}}

which would leave us with just the polynomials needed for Matrix Multiplication as part of PP. But how can we cancel those inputs and outputs?

If we phrase PP in terms of a tensor that we have a known, good RankRank bound for such as the 2×22 \times 2 MMT:

(x11y11+x12y21)z11+(x11y12+x12y22)z12+(x21y11+x22y21)z21+(x21y12+x22y22)z22(\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21}\color{black})z_{11} \color{black}+ (\color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22}\color{black})z_{12} + \\ (\color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21}\color{black})z_{21} \color{black}+ (\color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}\color{black})z_{22}

we must now show how to reduce it to PP where:

P=(x11y11+x12y21)z11+(x11y12+x12y22+x33y33)z12+(x21y11+x22y21)z21+(x21y12+x22y22)z22+(x11y11+x22y22)z3\begin{aligned} P = &(\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21}\color{black})z_{11} + \\ &(\color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22} \color{black}+ \color{black} x_{33}y_{33})z_{12} + \\ &(\color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21}\color{black})z_{21} + \\ &(\color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}\color{black})z_{22} + \\ &(\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}\color{black})z_{3} \end{aligned}

The way we achieve this reduction is be setting some of the X,Y,ZX, Y, Z variable of our instance of PP to zero. Setting x33,y33,z30x_{33}, y_{33}, z_3 \leftarrow 0 yields our desired Matrix Multiplication tensor exactly (which is not always the case as we'll see shortly):

P=(x11y11+x12y21)z11+(x11y12+x12y22+x33y33)z12+(x21y11+x22y21)z21+(x21y12+x22y22)z22+(x11y11+x22y22)z3\begin{aligned} P = &(\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21}\color{black})z_{11} + \\ &(\color{blue}x_{11}\color{orange}y_{12} \color{black}+ \color{blue}x_{12}\color{orange}y_{22} \color{black} + \color{red} \cancel{\color{black} x_{33}y_{33}}\color{black})z_{12} + \\ &(\color{blue}x_{21}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{21}\color{black})z_{21} + \\ &(\color{blue}x_{21}\color{orange}y_{12} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}\color{black})z_{22} + \\ &(\color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{22}\color{orange}y_{22}\color{black})\color{red} \cancel{z_{3}} \end{aligned}

This is called zeroing-out a tensor which is similar to the notion of reducibility in complexity theory. We denote that PP is reducible to 2,2,2\langle 2, 2, 2 \rangle via zeroing as "2,2,2zoP\langle 2, 2, 2 \rangle \leq_{zo} P" which does have a well-defined RankRank bound.

Lemma: If QzoPQ \leq_{zo} P, then Rank(Q)Rank(P)Rank(Q) \leq Rank(P), so

if m,s,tzoP\langle m, s, t \rangle \leq_{zo} P, then ω3logmstRank(P)\omega \leq 3 \log_{mst} Rank(P).

Brief aside about the Strassen method

Strassen showed that it is possible to compute those polynomials using the following identities:

[x11x12x21x22][y11y12y21y22]=[m11m12m21m22]\begin{bmatrix} \color{blue} x_{11} & \color{blue}x_{12} \\ \color{blue}x_{21} & \color{blue}x_{22} \end{bmatrix} \begin{bmatrix} \color{orange} y_{11} & \color{orange}y_{12} \\ \color{orange}y_{21} & \color{orange}y_{22} \end{bmatrix} = \begin{bmatrix} m_{11} & m_{12} \\ m_{21} & m_{22} \end{bmatrix}

where

m11=x11y11+x12y21m21=w+v+(x11+x12x21x12)y22m21=w+u+x22(y21+y12y11y22)m22=w+u+v \begin{aligned} m_{11} &= \color{blue}x_{11}\color{orange}y_{11} \color{black}+ \color{blue}x_{12}\color{orange}y_{21} \color{black} \\ m_{21} &= w + v + (x_{11} + x_{12} - x_{21} - x_{12})y_{22} \\ m_{21} &= w + u + x_{22}(y_{21} + y_{12} - y_{11} - y_{22}) \\ m_{22} &= w + u + v \end{aligned}

where

u=(x21x11)(y12y22)v=(x21+x22)(y12y11)w=x11y11+(x21+x22x11)(y11+y22y12)\begin{aligned} u &= (x_{21} - x_{11})(y_{12} - y_{22}) \\ v &= (x_{21} + x_{22})(y_{12} - y_{11}) \\ w &= x_{11}y_{11} + (x_{21} + x_{22} - x_{11})(y_{11} + y_{22} - y_{12}) \\ \end{aligned}

which reduces the number of matrix additions from 18 to 15, and the multiplications from 8 to 7!


Laser Method

The Laser Method is a bit more complicated than just the above reduction, namely due to the fact that it becomes increasingly challenging to zero out terms without removing terms we still need for the Matrix Multiplication proper for larger input matrices.

Thus, the Laser Method instead reduces direct sums of Matrix Multiplication Tensors to powers of tensors with a special structure via the zero-out reduction illustrated above. The process looks like this:

n,n,n    imi,si,tiMM reducable to direct summation of other MMTs    laser methodTN    Tbounding T via its powers\underbrace{\langle n, n, n \rangle \implies \oplus_i \langle m_i, s_i, t_i \rangle}_{\text{MM reducable to direct summation of other MMTs}} \overset{\text{laser method}}{\implies} \underbrace{T^{\otimes N} \implies T}_{\text{bounding } T \text{ via its powers}}

and composing all of these gives us a reduction from Matrix Multiplication to a tensor TT.

Direct Summation

For some tensor i.e. T=x1y1z1+x1y2z2T = x_1y_1z_1 + x_1y_2z_2, we can add it with a slightly modified copy of itself yielding a "regular" sum:

A=T+T=(x1y1z1+x1y2z2)+(x2y2z2+x2y3z3)A = T + T' = (x_1y_1z_1 + x_1\color{red}y_2\color{black}z_2) + (x_2\color{red}y_2\color{black}z_2 + x_2y_3z_3)

Observe that both copies share a y2y_2 variable. Conversely, a direct sum of two copies of TT are completely disjoint: they have no variables in common despite summing over the same set of variables:

B=T+T=(x1y1z1+x1y2z2)+(x3y3z3+x3y4z4)B = T + T' = (x_1y_1z_1 + x_1y_2z_2) + (x_3y_3z_3 + x_3y_4z_4)

Schönhage showed that, because of this disjoint-ness, upper bounds for MMTs also yield upper bounds for ω\omega via the following theorem:6

If the direct sum of kk copies of m,s,t\langle m, s, t \rangle have RankrRank \leq r m then ω3logmstrk\omega \leq 3 \log_{mst} \frac{r}{k}.

This means that we just have to show that the rank of kk disjoint copies of some MMT is smaller than kk times the rank of the original tensor. This is the first step of the method diagrammed above:

n,n,nω3logmstrk    km,s,tRank=r    ...\underbrace{\langle n, n, n \rangle}_{\omega \leq 3 \log_{mst} \frac{r}{k} } \implies \underbrace{\oplus^k \langle m, s, t \rangle}_{Rank = r} \implies ...

Tensor Exponentiation

Now, for the other half of the method, what the hell does this mean: T    TT^{\otimes} \implies T. We need to define what it means to exponentiate a tensor. First, we define the product of two tensors via the Kronecker Product.

For tensors S,TS, T over variable sets {X,Y,Z},{X,Y,Z}\{X, Y, Z\}, \{X', Y', Z'\}, respectively, where

S=i,j,kαijkxiyjzk,T=i,j,kβijkxiyjzkS = \sum_{i,j,k} \alpha_{ijk}x_iy_jz_k, \\ T = \sum_{i',j',k'} \beta_{i'j'k'}x'_{i'}y'_{j'}z'_{k'}

not only does our notation converge on sanscrit, but we get a definition of a tensor product:

STS \otimes T over X×XX \times X', Y×YY \times Y', and Z×ZZ \times Z' is given by:

:SX×Y×Z×TX×Y×ZF,ST=i,j,k,  i,j,kαijkβijk(xixi)(yjyj)(zkzk)\otimes : S^{X \times Y \times Z} \times T^{X' \times Y' \times Z'} \rarr \mathbb F, \\ \\ S \otimes T = \sum_{\color{blue}i,j,k, \; \color{red}i',j',k'} \color{black}\alpha_{\color{blue}ijk}\beta_{\color{red}i'j'k'}\color{black}(\color{blue}x_i \color{red} x'_{i'}\color{black})(\color{blue}y_j \color{red} y'_{j'}\color{black})(\color{blue}z_k \color{red} z'_{k'}\color{black})

which we immediately generalize into the definition of the power of a tensor:

TN=TN copiesTT^{\otimes N} = T \otimes \underbrace{\cdots}_{N \text{ copies}} \otimes T

Lemma: The product of any two MMTs can be combined into a product of their dimensions: a1,a2,a3b1,b2,b3=a1b1,a2b2,a3b3\langle a_1, a_2, a_3 \rangle \otimes \langle b_1, b_2, b_3 \rangle = \langle a_1b_1, a_2b_2, a_3 b_3\rangle which precisely corresponds to our idea of "blocking."

Observe that TNT^{\otimes N} will contain NN tuples over each variable set of the original tensor, so we get larger and larger variable sets, but the rank is guaranteed to be bounded by the individual ranks of the tensors being multiplied per the following:

Lemma: Rank(ST)Rank(S)Rank(T)Rank(S \otimes T) \leq Rank(S)\cdot Rank(T), so Rank(TN)Rank(T)NRank(T^{\otimes N}) \leq Rank(T)^N

and sometimes, Rank(TN)<<Rank(T)NRank(T^{\otimes N}) \lt \lt Rank(T)^N.

From these lemmas, we can get an expression for asymptotic rank:

R=limNRank(TN)1/N\overrightharpoon{R} = \lim_{N \rarr \infty} Rank(T^{\otimes N})^{1/N}

For example, for the 2×22 \times 2 MMT, Rank(2,2,2)=7Rank(\langle 2, 2, 2\rangle) = 7, but the rank of the NNth power of it is nωn^\omega! Thus, asymptotic rank is the secret workhorse of all modern Matrix Multiplication algorithms. This completes the right half of our methodology diagram:

...    laser methodTNRrN    TRank=r... \overset{\text{laser method}}{\implies} \underbrace{T^{\otimes N}}_{\overrightharpoon{R} \leq r^N} \implies \underbrace{T}_{Rank = r}

We can interpret this to mean: if we have a bound on the asymptotic rank of the tensor, we also have a bound on the asymptotic rank of its NNth power. So now we just have to show the reduction via Laser Method allowing us to compose the left and right halves. If we can compose these steps for sufficiently large m,s,t,km, s, t, k, then we can get a really good bound on ω\omega.

Laser Method

Finally, the middle bit: ...    laser method...... \overset{\text{laser method}}{\implies} ...

The Laser Method takes a tensor TT with a "special structure" and uses that structure to reduce the direct sum m,s,t\oplus \langle m, s, t\rangle to TNT^{\otimes N} for large m,s,t,km,s,t, k functions of NN by zeroing out many variables of TNT^{\otimes N} yielding huge disjoint sums.

Special Sauce: Tensor Partitioning

What is the aforementioned "special structure?" We start with some tensor:

T=anbmcpαabcxaybzcT = \sum^n_a \sum^m_b \sum^p_c \alpha_{abc} x_a y_b z_c

and partition its variable sets X=X1...X,Y=Y1...Y,Z=Z1...ZX = X_1 \cup ... \cup X_\ell, Y = Y_1 \cup ... \cup Y_\ell, Z = Z_1 \cup ... \cup Z_\ell into \ell parts. Then, for i,j,k[]i,j,k \in [\ell], let the subtensor TijkT_{ijk} be

Ti,j,k=xaXiybYjzcZjkαabcxaybzcT_{i,j,k} = \sum_{x_a \in X_i} \sum_{y_b \in Y_j} \sum_{z_c \in Z_jk} \alpha_{abc} \cdot x_a y_b z_c

which contains only the triples αabcxaybzc\alpha_{abc} \cdot x_a y_b z_c with xaXi,ybYj,zcZjkx_a \in X_i, y_b \in Y_j, z_c \in Z_jk.

Our whole tensor TT can then be expressed as a sum of subtensors:

T=ijkTijkT = \sum^\ell_i \sum^\ell_j \sum^\ell_k T_{ijk}

And if all the subtensors TijkT_{ijk} are MMTs, TT is a sum of MMTs, though not yet a direct sum. We want them to be a disjoint, direct sum, which is provably possible for all such partitioned tensors. Trivially, we could achieve this by partitioning each variable set into m,s,tm,s,t distinct parts such that we have a direct sum of 1×1×11 \times 1 \times 1 Matrix Multiplication Tensors. Such a decomposition isn't very useful though.

Let's take a look instead at a useful partitioning over a nontrivial tensor such as the Coppersmith-Winograd tensor parameterizd over some natural number qq:

CWq=x0y0zq+1+x0yq+1+xq+1y0z0+iq(x0yizi+ziy0zi+xiyiz0)CW_q = x_0y_0z_{q+1} + x_0y_{q+1} + x_{q+1}y_0z_0 + \sum^q_i (x_0y_iz_i + z_iy_0z_i + x_iy_iz_0)

This tensor is fascinating because its asymptotic rank is q+2q+2 which is optimal since that's the number of x,yx,y or zz variables that appear in it e.g. there are always precisely q+2  x0q+2 \; x_0 variables in it:

CWq=x0+1y0zq+1+x0+1yq+1+xq+1y0z0+iq(x0+qyizi+ziy0zi+xiyiz0)CW_q = \color{blue}\underbrace{x_0}_{+1}\color{black}y_0z_{q+1} + \color{blue}\underbrace{x_0}_{+1}\color{black}y_{q+1} + x_{q+1}y_0z_0 + \sum^q_i (\color{blue}\underbrace{x_0}_{+q}\color{black}y_iz_i + z_iy_0z_i + x_iy_iz_0)

If we had an optimal asymptotic rank for an MMT, since the number of variables is n2n^2, we'd be getting an asymptotic rank of n2n^2, or ω=2\omega = 2, so – we'd really like to embed the Matrix Multiplication Problem in this tensor.

The optimal partitioning for this tensor resembles the following:

X0={x0},X1={x1,...,xq},X2={xq+1},Y0={y0},Y1={y1,...,yq},Y2={yq+1},Z0={z0},Z1={z1,...,zq},Z2={zq+1},X_0 = \{x_0\}, X_1 = \{x_1, ..., x_q\}, X_2 = \{x_{q+1}\}, \\ Y_0 = \{y_0\}, Y_1 = \{y_1, ..., y_q\}, Y_2 = \{y_{q+1}\}, \\ Z_0 = \{z_0\}, Z_1 = \{z_1, ..., z_q\}, Z_2 = \{z_{q+1}\}, \\

The tensor then becomes

CWq=T002+T020+T200+qT011+T101+T100=1,1,1+1,1,1+1,1,1scalars+1,1,q+q,1,1+1,q,1inner products\begin{aligned} CW_q &= T_{002} + T_{020} + T_{200} + \sum^q T_{011} + T_{101} + T_{100} \\ &= \underbrace{\langle 1,1,1 \rangle + \langle 1,1,1 \rangle + \langle 1,1,1 \rangle}_{\text{scalars}} + \underbrace{\langle 1,1,q \rangle + \langle q,1,1 \rangle + \langle 1,q,1 \rangle}_{\text{inner products}} \end{aligned}

The inner products are nontrivial MMTs, and therefore challenging to get any fruitful zeroations out of. Instead, the Laser Method doesn't zero out the tensor itself, but takes the NNth power of it to give us more freedom for zeroing out.

The image of this process in my head is: taking two squares (matrices) we want to multiply, transforming their representation into a cube (tensor), exponentiating that cube into a rectangular prism, and then scratching out a bunch of the numbers we don't need until we're left with something "close" to the number of elements of just two slices of the rectangular prism which would be our initial matrices.

Taking a Large Power

TT is constant sized, and partitioned s.t. the TijkT_{ijk} subtensors are matrix products. We take the NNth power of TT:

TN=(i,j,kTijk)N=(T1,...,TN){Tijki,j,k[]}NT1...TNT^{\otimes N} = \Big (\sum_{i,j,k} T_{ijk}\Big )^{\otimes N} = \sum_{(T_1, ..., T_N) \in \{T_{ijk} \vert i,j,k \in [\ell]\}^N} T_1 \otimes ... \otimes T_N

When we take the NNth power, we get the NNth power of a sum of subtensors, which in turn gives a sum of NN-tuples of T1T_1 to TNT_N, where each TiT_i is one of the subtensors TijkT_{ijk}. From the assumption that our subtensors are matrix products, we can conclude that TNT^{\otimes N} is a tensor product of matrix products, which we've established is also a matrix product. So, it's a sum of matrix products, just really really big ones. Recall that what we really want is a disjoint sum of matrix products – all with the same dimensions. Ours currently do not have the same dimension unless our subtensors all had the same dimension to begin with which was not a constraint of the partition.

So, we zero out some of our TiT_i. We do this by associating some distribution π\pi to each of our T1,...,TN{Tijk}NT_1, ..., T_N \in \{ T_{ijk}\}^N where π\pi is the frequency of each term, or the fraction of times we used a given subtensor TijkT_{ijk} in the tensor power:

πijk=1N{p[N]Tp=Tijk}\pi_{ijk} = \frac{1}{N} \Big \vert \{ p \in [N] \vert T_p = T_{ijk}\} \Big \vert

Now, all of our TiTNT_i \in T^{\otimes N} whose (T1,...,TN)(T_1, ..., T_N) NN-tuple terms correspond to the same π\pi are matrix products with the same dimension: mπ,nπ,tπ\langle m_\pi, n_\pi, t_\pi \rangle. This is because

a,b,ctp,q,ruisoatpu,btqu,ctru\langle a, b, c \rangle^{\otimes t} \otimes \langle p, q, r \rangle^{\otimes u} \overset{iso}{\equiv} \langle a^tp^u, b^tq^u, c^tr^u \rangle

the location of these terms in the sum does not matter, just the number of occurences – hence the ability to proxy dimension with our distribution ascription. So, we just need to pick a π\pi which allows us to zero out the most terms or matrix products of disparate dimension.

Goal: pick π\pi and zero our variables in TNT^{\otimes N} to get a direct sum of subtensors T1...TNT_1 \otimes ... \otimes T_N which correspond to π\pi. This is the Laser Method.

Suppose lasering thusly gives us LL such tensors in the direct summation – we can use Schönhage's theorem and get

ω3logRank(TN)Llogmπnπtπ;Lω\omega \leq 3 \frac{\log \frac{Rank(T^{\otimes N})}{L}}{\log m_\pi n_\pi t_\pi }; \quad L \propto \omega

yielding an inversely proportionate relationship between LL and ω\omega: LωL\uparrow \omega \downarrow. We want to maximize LL for a fixed distribution π\pi and express it in terms of our distribution. Then, we can minimize the upper bound on ω\omega over the choices for π\pi:

n,n,nωlimN3logmstrk    Scho¨nhagekm,s,tRankr    laser methodTNRankrN    recursionT=i,j,kTijkpartitioned tensor withRank=r\underbrace{\langle n, n, n \rangle}_{\omega \lim_{N \rightarrow \infty}\leq 3 \log_{mst} \frac{r}{k} } \overset{\text{Schönhage}}{\implies} \underbrace{\oplus^k \langle m, s, t \rangle}_{Rank \leq r} \overset{\text{laser method}}{\implies} \underbrace{T^{\otimes N}}_{Rank \leq r^N} \overset{\text{recursion}}{\implies} \underbrace{T=\sum_{i,j,k} T_{ijk}}_{\text{partitioned tensor with} Rank = r}

We're left to determmine L,mπ,nπ,tπL, m_\pi, n_\pi, t_\pi as functions of π,N,T\pi, N, T. And already we've got a problem: we're scheming to zero out variables but π\pi is currently a distribution over triples not individual variables. So, we have to do something to convert π\pi from triples of TijkT_{ijk} into a distribution over single variables.

Marginalizing π\pi

A marginal corresponds to the fraction of times that a term of shape TiT_{i**} appears in T1...TNT_1 \otimes ... \otimes T_N for a given index ii. We define the "xx" marginals of π\pi as

πi:=j,k[]πijk\pi_i :=\sum_{j,k \in [\ell]} \pi_{ijk}

Recall that TNT^{\otimes N} is a tensor over the NN-tuples XN,YN,ZNX^N, Y^N, Z^N, so its XX variables are partitioned by Xc1×...×XcNX_{c_1} \times ... \times X_{c_N} for c1,...,cN[]c_1, ..., c_N \in [\ell] and each subtensor T1...TNT_1 \otimes ... \otimes T_N corresponding to π\pi uses the XX variables from a set

Xc1×...×XcN where i[],{p[N]cp=i}=πiNX_{c_1} \times ... \times X_{c_N} \text{ where } \\ \forall i \in [\ell], \Big| \{ p \in [N] \vert c_p = i\}\Big| = \pi_i N

e.g. T1=Tc1jkT_1 = T_{c_1jk} for some j,kj,k. So, every term in our product(s) has some marginals it obeys. The Laser Method zeroes out variables in TNT^{\otimes N} that do not have marginals in πi\pi_i that we desire, yielding a sum of terms T1...TNT_1 \otimes ... \otimes T_N that obey those marginals. All remaining variables have exactly the correct frequency for every block TiT_i. This effectively converts the sum to a direct sum. The dimension of this direct sum –the number of XX variable sets matching πi\pi_i– is the multinomial coefficient:

M:=(Nπ1N,...,πN)M := { N \choose \pi_1 N, ..., \pi_\ell N}

and is bounded by the number of variable sets we have by definition of a disjoint sum. So the number of disjoint subtensors is LML \leq M, and we want to maximize the number of disjoint subtensors. To do this, we should try to find nontrivial partitions s.t. LML \rarr M. The Laser Method achieves LML \approx M if and only if the linear system defined by the marginals has a unique solution:

πix:=j,k[]πijk;  πjy:=i,k[]πijk;  πkz:=i,j[]πijk\pi^x_i :=\sum_{j,k \in [\ell]} \pi_{ijk}; \; \pi^y_j :=\sum_{i,k \in [\ell]} \pi_{ijk} ; \; \pi^z_k :=\sum_{i,j \in [\ell]} \pi_{ijk}

Iff this linear system has a unique solution, the Laser Method guarantees that we'll get the maximum number of disjoint tensors: LML \approx M. And, even if no solution exists –that is, the marginals of π\pi don't determine π\pi– the Laser Method still always gives a disjoint sum of roughly MM "good" disjoint subtensors of TNT^{\otimes N} which are consistent with π\pi.

Additionally, if there are other distributions with the same marginals as π\pi, then there may be "extra" subtensors also consistent with our distribution and which share variables. We have to remove them, which adds a lot of noise to the zeroing process.

E.g., suppose TijkT^{ijk} uses Xi,Yj,ZkX_i, Y_j, Z_k; and TiiiT^{iii} are "good" subtensors. The Laser Method gives us

(iMTiii)+T123with 1st Tiiishares x var+T253with 5th Tiiishares y var+T859\Big(\sum^M_i T^{iii}\Big) + \underbrace{T^{123}}_{\stackrel{\text{shares } x \text{ var}}{\text{with 1st }T^{iii}}} + \underbrace{T^{253}}_{\stackrel{\text{shares } y \text{ var}}{\text{with 5th }T^{iii}}} + T^{859}

Removing these noisy terms costs a sometimes/usually significant penalty since we have to remove sMs\cdot M noisy/bad tensors via zeroing, which also removes some of our MM good subtensors. Previous approaches using the laser method yielded the following bound on the remaining viable tensors:

L:=Θ(M/s)L := \Theta(M/s)

The latest development introduced by Vassilevska Williams shows that it's possible to attain the following improvement:

L:=Θ(M/s)L := \Theta(M/\sqrt s)

which impacts the overall ω\omega bound as follows:

ω3logRank(T)NLlogmπnπrπ3logRank(T)NsMlogmπnπrπ    3logRank(T)NsMlogmπnπrπ    \begin{aligned} \omega &\leq 3 \log \frac{\frac{Rank(T)^N}{L}}{\log m_\pi n_\pi r_\pi} \\ & \\ &\leq 3 \log \frac{\frac{Rank(T)^Ns}{M}}{\log m_\pi n_\pi r_\pi} \implies \\ & \\ &\leq 3 \log \frac{\frac{Rank(T)^N\sqrt s}{M}}{\log m_\pi n_\pi r_\pi} \implies \end{aligned}

It's Hip to be Square (rooted)

So, how did the authors make a categorical improvement on ss bad tensors?

It must be super sophisticated and Peter I can only take so many more confounded summations!

Nay, dear reader.

Whereas the prior method:

  1. greedily selected a good subtensor TiiiT^{iii} which shared variables with O(s)O(s) extra subtensors requiring removal
  2. Zeroed out O(s)O(s) variables not in the good tensors to remove O(s)O(s) bad terms,
  3. Each zeroed-out variable would zero out at most one good subtensor

thus, it would remove O(s)O(s) good subtensors for each one it would keep, resulting in the Θ(M/s)=L\Theta(M/s) = L good remaining subtensors.

The Refined Laser Method approach instead employs a probabilistic selection of candidate terms for removal:

  1. Pick Θ(M/s)\Theta(M/\sqrt s) good subtensors at random, zeroing out every variable they don't use

This would remove all the extra subtensors with non-zero probability.

B-b-but non-zero could be really bad

The analysis is optimal for large a family of tensors. This refined method improves upon the previous greedy approach whenever the marginals of π\pi do not uniquely determine π\pi which is most of the time for sufficiently large partitioned tensors, including CW8CW^{\otimes 8} and its powers. Since the previous best Matrix Multiplication algorithms use CW32CW^{\otimes 32}, this immediately improves ω2.37286\omega \leftarrow 2.37286.

The Throne Beckons

Notably, the authors concede that they have not applied their refined Laser Method on larger powers of the Coppersmith Winograd tensor e.g. CW64CW^{\otimes 64}... Presumably, this is a computationally intensive task, as searching for a good π\pi with that many terms could become intractable,,, maybe? Maybe not. Maybe they just wanted to publish their findings via the minimum necessary candidate tensor. Maybe the academic heart striving for optimality is tempered only by the need to publish.

Even cooler still is that it seems like only ... a few hundred? people are aware of this advancement – judging based on the number of citations the paper currently has. I also suspect that many of these citations might be similar to those referencing the CW paper: handwavey black box appropriation of optimal matrix multiplication performance to pad the literature survey of other adjacent publications.

Sitting next to me, generating heat but not doing too much unless a new Smite god gets dropped anytime soon, is a desktop harboring a GPU that's just begging to mine π\pi distributions so that my name could be listed among the ranks of Strassen, LeGall, and Vassilevska Williams.

This probably betrays some fundamental misunderstanding of the problem at a computational level, but I like the idea...

it's free real estate

The authors do note that this method is yielding diminishing returns on the order of hundredths of thousands of arithmetic operations being shaved off of ω\omega. In 2014,7 Le Gall proved that 2.3725 is the lower bound using CW532CW^{\otimes 32}_5 which has been the tensor d'etre for advancements made in the past decade:

  • 2.37288 by Vassilevska Williams in '12
  • 2.37287 by Le Gall in '14
  • 2.37286 by Vassilevska Williams in '20

but just imagine:

  • 2.37285 by [your name here] in '23

Additionally, we know that universal generalizations of the Laser Method can't improve past ω=2.18\omega = 2.18, so past this lofty point, we'd need a completely different tensor structure than the CWCW, and a different approach, since the Laser Method can find the optimal bound for any family of tensor for which it applies.

AlphaTensor: Those fuckers at DeepMind did it again

AlphaTensor discovers, from scratch, many provably correct Matrix Multiplication Algorithms that improve upon existing algorithms in terms of the number of scalar multiplications required.8

AlphaTensor is a Reinforcement Learning agent derived from the famed Go agent AlphaZero for solving planning procedure problems. AlphaTensor is trained to play a single player game whose objective is to find the tensor decomposition for arbitrary matrices in an instance of the Matrix Multiplication Problem within a finite factor space

The results are pretty neat, showing that AlphaTensor succeeded in finding better algorithm embeddings than several State of the Art techniques, including the 4x4 problem, for which it was previously thought that Strassen's algorithm from 1969 was optimal.

The task of finding Matrix Multiplication algorithm embeddings lends itself to DRL since it boils down to finding low rank decomposition of a specific three-dimensional tensor (which is our MMT). While not explicitly mentioned in Vassilevska Williams' paper, it may have been obvious that the reduction strategy for finding sub-cubic Matrix Multiplication Algorithms is NPNP-Hard. The search space is so large ( A1012|A| \approx 10^{12} for most interesting instances of the Matrix Multiplication Problem) that the optimal algorithm for even the 3x3x3 problem is unknown. All previous attempts have relied on human search, continuous optimzation, and combinatorial searches – aided by human-designed heuristics which are probably not optimal.


Algorithm 1

(for reference)

Input:1.  {ut,vt,wt}r=1R vectors of length n2s.t.Tn=t=1Rutvtwt2.  An×n,Bn×nOutput: Cn×nInitialize Cn×n0for r=1,...,R:mr(u1ra1+...+un2ran2)(v1rb1+...+vn2rbn2)for r=1,...,n2:Ci(wi1m1+...+wiRmR)Return C\boxed{ \begin{aligned} &\text{Input:} \\ &\quad 1. \; \{\mathbf u^t, \mathbf v^t, \mathbf w^t\}^R_{r=1} \text{ vectors of length } n^2 s.t. \\ &\qquad\qquad T_n = \sum_{t=1}^R \mathbf u^t \otimes \mathbf v^t \otimes \mathbf w^t\\ &\quad 2. \; A_{n \times n},B_{n \times n} \\ &\text{Output: } C_n\times n \\ \hline &\text{Initialize } C_{n\times n} \leftarrow \mathbf 0 \\ &\text{for } r = 1, ..., R: \\ &\qquad m_r \leftarrow (u_1^ra_1 + ... + u_{n^2}^ra_{n^2})(v_1^rb_1 + ... + v_{n^2}^rb_{n^2})\\ &\text{for } r = 1, ..., n^2: \\ &\qquad C_i \leftarrow (w_i^1m_1 + ... + w_i^Rm_R) \\ &\text{Return } C \\ \end{aligned} }

TensorGame

The formalization of the problem as an RL environment or "game" is as follows:

  • At each timestemp tt, the player selects entries of the input matrices to multiply, and is rewarded based on the number of operations required to get the correct result
  • The game state after each tt is described by the tensor StS_t, which is initialized to the target tensor we want to decompose: S0TnS_0 \leftarrow T_n
  • At each time step, the agent selects a thruple (ut,vt,wt)(\mathbf u^t, \mathbf v^t, \mathbf w^t) and StS_t is updated by subtracting the resultant Rank one tensor
StSt1utvtwtS_t \leftarrow S_{t-1} - \mathbf u^t \otimes \mathbf v^t \otimes \mathbf w^t\\
  • The terminal state is the zero-tensor St=0S_{t} = \mathbf 0, and is reached by applying the smallest number of moves or when RR_{\ell} is reached
    • Each step incurs a penalty of 1-1, and non-solution terminal states also receive a negative reward of γ(SR)-\gamma(S_{R_{\ell}}) which is a negative upper bound on the rank of the terminal tensor in order to encourage optimization for rank. However, this reward/penalty function can also be structured around other desirable, practical properties such as runtime
    • When a terminal state is reached and is a valid solution, the sequence of related factors satisfies
Tn=t=1RutvtwtT_n = \sum_{t=1}^R \mathbf u^t \otimes \mathbf v^t \otimes \mathbf w^t
  • Additionally, {ut,vt,wt}\{ \mathbf u^t, \mathbf v^t, \mathbf w^t \} are constrained to a discrete set of coefficients FF e.g. {2,1,0,1,2}Z\{ -2, -1, 0, 1, 2 \} \in \mathbb Z to avoid issues with floating point arithmetic

Application of TensorGame shows thousands of viable strategies to even well-studied Matrix Multiplication Problems like the 3x3x3 and 4x4x4 instances, demonstrating that the solution space is richer than previously thought.


Results

In all cases, AlphaTensor discovers algorithms that match or exceed existing State of the Art approaches.

Footnotes

  1. Vassilevska Williams et al. "A Refined Laser Method and Faster Matrix Multiplication." October 13, 2020. CM-SIAM.

  2. Roth/Martin. "Computer Organization and Design: Integer Arithmetic." University of Pennsylvania 2

  3. Image from "Matrix multiplication algorithm." Wikipedia

  4. Coppersmith and Winograd. "Matrix multiplication via arithmetic progressions." January 1987. Association for Computing Machinery

  5. Alman, Josh. "Limits on the Universal Method for Matrix Multiplication." 2019. Theory of Computing, Vol. 17.

  6. Schönhage, Arnold; Strassen, Volker. "Schnelle Multiplikation großer Zahlen" [Fast multiplication of large numbers]. Computing. Vol. 7.

  7. Le Gall, François. "Powers of Tensors and Fast Matrix Multiplication." ISSAC 2014. https://arxiv.org/abs/1401.7714

  8. Alhussein Fawzi et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." October 5, 2022. Nature, vol. 610.