GPUs

Published on
120 min read––– views

Introduction

Continuing with the other posts1 about ~how computers work at an ELI24 level, this post aims to understand GPU devices, how their architecture differs from CPUs (and why), as well as some rudimentary lessons in writing software for these devices.

I'll begin with an architectural comparison against the maybe-more-familiar CPU:

First of all, we can see that a GPU consists of a larger number of cores and a shallower cache hierarchy. It's worth noting that CPU cores, in general, are much more capable than those present in GPU devices, and the presence of additional caches vastly reduces memory access latency. The GPU paradigm of parallelization is evidenced at the hardware design level by the presence of singular control units (purple) responsible for entire thread groups of cores, implying that each core in a thread group must execute the same instruction.

The differences in architectures serve different purposes. CPUs are designed for low-latency, low-throughput serial operations, whereas GPUs are optimized for high-latency, high-throughput parallel operations.

Sample Benchmarks

We can gain some intuition about the ramifications of these hardware differences via benchmarks on modern devices2,3:

CPUGPU
cores810,752
L3 cache32 MBN/A
L2 cache512 KB/core6 MB (shared)
L1 cache64 KB/core128 KB/SM or 1Kb/core

† per Streaming Multiprocessor (a row of connected cores)

Parallel Operations

An example class of a parallel operation for which GPUs outperform CPUs is vector arithmetic:

[x0x1x2x3xn]+[y0y1y2y3yn]=[?????]\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{bmatrix} + \begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} ? \\ ? \\ ? \\ ? \\ \vdots \\ ? \end{bmatrix}

An implementation of vector addition designed for execution on a CPU might look something like:

void add(int n, float* x, float* y, float* z) {
	for (int i = 0; i < n; i++)
		z[i] = x[i] + y[i];
}

this code iterates over every single element in each input vector x, y which will be O(n)O(n). However, since each element in z is independent of any other element in z, we can see that this approach can be improved by parallelizing the summation and computing each entry simultaneously. The GPU code (which we call a kernel) which achieves this looks like:

__global__ void add(int n, float* x, float* y, float* z) {
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < n)
		z[i] = x[i] + y[i];
}

Here, we use the __global__ keyword, which indicates to the compiler that this function should be executed on the GPU. The next bit of GPU-spice is the index variable i which uses "thread local" variables which would be contextually available in each instance of this function being run in parallel. The way CUDA achieves this is with the concept of thread blocks.4 When we launch our GPU kernel, we must specify the number blocks to launch and how many threads each block contains – this pattern informs how we index into all of our data structures to be accessed by the GPU. For example, if we launch n/2n/2 blocks each containing two threads, the resulting thread organization will resemble:

B0{T0T1B1{T0T1 Bn2{T0T1[x0x1x2x3xn]+[y0y1y2y3yn]=[?????]\begin{matrix} B_0 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\ B_1 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\  B_{\frac{n}{2}} \begin{cases} T_0 \\ T_1 \\ \end{cases} \end{matrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{bmatrix} + \begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} ? \\ ? \\ ? \\ ? \\ \vdots \\ ? \end{bmatrix}

Each thread executes the same code and is now able to compute the appropriate vector index via its own block- and thread-index.

The last notable difference between the two implementation is the existence of a bounds check on the computed index i. This is necessary in the event that the dimension of our input vectors does not align with our block dimension e.g. for an odd value of nn, our block alignment would be:

B0{T0T1B1{T0T1 Bn2{T0T1[x0x1x2x3xn]  +[y0y1y2y3yn]  =[?????]  \begin{matrix} \begin{matrix} B_0 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\ B_1 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\ \\  B_{\frac{n}{2}} \begin{cases} T_0 \\ \color{red}T_1\color{black} \\ \end{cases} \end{matrix} \end{matrix} \begin{matrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{bmatrix} \\ \; \end{matrix} + \begin{matrix} \begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix} \\ \; \end{matrix} = \begin{matrix} \begin{bmatrix} ? \\ ? \\ ? \\ ? \\ \vdots \\ ? \end{bmatrix} \\ \; \\ \end{matrix}

With a GPU kernel in hand, we do still need actually delegate execution from the CPU. This is a multistep process following the procedure:

  1. Allocate memory on the GPU
  2. Copy the data to the GPU (since it doesn't have direct access itself)
  3. Run the kernel
  4. Copy the results back to the CPU
  5. Free the memory

1. Allocate memory on the GPU

// we use the _d suffix to denote that this will be sent to the device (GPU)
float* a_d;
float* b_d;
float* c_d;
int N = 4096;
cudaMalloc((void**) &a_d, N * sizeof(float));
cudaMalloc((void**) &b_d, N * sizeof(float));
cudaMalloc((void**) &c_d, N * sizeof(float));

2. Copy the data to the GPU

cudaMemcpy(a_d, a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a, N * sizeof(float), cudaMemcpyHostToDevice);

this function is nearly identical to the C standard counterpart, with the added argument denoting the type or direction of the copy. Read more about it here.5

3. Run the kernel

to run the kernel we specify how many blocks, and how many threads in each block. CUDA exposes these configurables with some funky syntax,6 but it's basically just division.

As an example, a function declared as __global__ void Func(float* parameter); must be called like this: Func<<< Dg, Db, Ns >>>(parameter);

where Dg, Db, Ns are the dims of the grid, each block in the grid, and the number of bites in shared memory per block (defaulted to 0).

int BLOCK_SZ = 2048;
add<<<ceil(N / (float) BLOCK_SIZE), BLOCK_SIZE>>>(N, a_d, b_d, c_d);

4. Copy the results back to the CPU

Same as step 2, we just reverse the direction:

cudaMemcpy(c, c_d, N * sizeof(float), cudaMemcpyDeviceToHost);

5. Free the memory

Similar to standard memory management APIs, CUDA provides a method to allow us to mark those device-variables for re-allocation via cudaFree.

cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);

Comparison

We can observe how the two implementations compare on each respective processing unit, noting that the GPU pays a penalty up front for fewer, slower caches, but remains leveled off for far-longer than the CPU, and also that for sufficiently large nn, we start to pay exponential time penalties for CPU-based computation.

Comparing the ratio of execution time in nn, this tradeoff becomes more evident:

Here we can see that that even our naive GPU kernel is nearly 200x faster for larger inputs. We can drive this number even higher by reducing memory accesses per operation.

Kernel Grids

Returning to a fixed instance of our vector addition kernel operating for a vector of size 6:

B0{T0T1B1{T0T1 B3{T0T1[x0x1x2x3x4x5]+[y0y1y2y3y4y5]=[x0+y0x1+y1x2+y2x3+y3x4+y4x5+y5]\begin{matrix} B_0 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\ B_1 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\  B_{3} \begin{cases} T_0 \\ T_1 \\ \end{cases} \end{matrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} + \begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \end{bmatrix} = \begin{bmatrix} x_0 + y_0 \\ x_1 + y_1 \\ x_2 + y_2 \\ x_3 + y_3 \\ x_4 + y_4 \\ x_5 + y_5 \end{bmatrix}

The code to delegate execution of our kernel to the GPU would resemble:

int N = 6;
int BLOCK_SIZE = 2;
add<<<ceil(N / (float) BLOCK_SIZE), BLOCK_SIZE>>>(N, a_d, b_d, c_d);

and our resulting kernel grid would be:

Not that the blockDim along the x dimension is constant for all blocks, so the only values which would change within our kernel code from earlier would be the block index (valued 0, 1, or 2) and the block's thread index (valued 0 or 1). We can observe then that our index value is a bijection onto the indices of our input vectors:

i:B×T{0,1,2,3,4,5}i : B \times T \rightarrow \{0, 1, 2, 3, 4, 5 \}
__global__ void add(int n, float* x, float* y, float* z) {
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < n)
		z[i] = x[i] + y[i];
}

e.g. the computed index of the first thread in the second block would correspond to

int i = 1 * 2 + 0;

computing:

B0{T0T1B1{T0T1 B3{T0T1[x0x1x2x3x4x5]+[y0y1y2y3y4y5]=[x0+y0x1+y1x2+y2x3+y3x4+y4x5+y5]\begin{matrix} B_0 \begin{cases} T_0 \\ T_1 \\ \end{cases}\\ \colorbox{yellow}{$B_1$} \begin{cases} \colorbox{yellow}{$T_0$} \\ T_1 \\ \end{cases}\\  B_{3} \begin{cases} T_0 \\ T_1 \\ \end{cases} \end{matrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} + \begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \end{bmatrix} = \begin{bmatrix} x_0 + y_0 \\ x_1 + y_1 \\ \colorbox{yellow}{$x_2 + y_2$} \\ x_3 + y_3 \\ x_4 + y_4 \\ x_5 + y_5 \end{bmatrix}

Our example kernel currently only references the .x dimension of our thread- and block-index variables, but CUDA allows us to configure multidimensional kernel grids. In some scenarios, we might prefer a two or even three dimensional kernel grid:

For n>2n > 2-dimensional kernel grids, we specify the kernel parameters via dim3 types like so:

dim3 dimGrid(3,3,3);
dim3 dimbBlock(3,3,3);
add<<<dimGrid, dimBlock>>>(N, a_d, b_d, b_c);

This abstraction is mainly syntactic sugar for index-readability. Since arrays of arbitrary dimension are still stored as flattened contiguous chunks of memory, we still need to make use of the dimension variables in order compute our row/column/... indices. Indexing into a two-dimensional array via dimension variables is given by the formula:

i=y×width+xi = y \times \text{width} + \text{x}

A naive (oh, so terribly, terribly naive7) matrix multiplication kernel on a 2-dimensional kernel grid might then resemble:

__global__ void matmul(int n, float* a, float* b, float* c) {
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	if (row < n && col < n) {
		float dot_prod = 0.f;
		for (int i =0; i < n; i++) {
			dot_prod += a[row*n + i] * b[i*n + col];
		}
		c[row * n + col] = dot_prod;
	}
}

We could similarly achieve the same result with a 1-dimensional kernel grid like so:

__global__ void matmul(int n, float* a, float* b, float* c) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	int row = idx / n;
	int col = idx % n;
	if (row < n && col < n) {
		float dot_prod = 0.f;
		for (int i =0; i < n; i++) {
			dot_prod += a[row*n + i] * b[i*n + col];
		}
		c[row * n + col] = dot_prod;
	}
}

This adds a negligible amount of space overhead, as well as a bit more cognitive overhead, hence the preference for leveraging kernel grid syntactic sugar wherever possible.

This formula generalizes to higher dimensions as well like so:

i=z×width×height+y×width+xi = z \times \text{width} \times \text{height} + \text{y} \times \text{width} + x

Practical GPU usage: writing kernels for neural nets

Feed Forward

The feed forward step of an individual neural network yky_k neuron boils down to computing the dot product of an input vector x\vec{x} and network weights w\vec{w}, plus some scalar bias bkb_k associated with the neuron:

yk=xw+bk=[x0  x1  x2  xn][w0w1w2wn]+b\begin{aligned} y_k &= \vec{x} \cdot \vec{w} + b_k\\ &= [x_0 \; x_1 \; x_2 \; \cdots x_n ] \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \\ \end{bmatrix} + b \end{aligned}

We can generalize the computations for all neurons by treating both the inputs and the weights as matrices, and the bias values as a vector:

y=XW+b=[x0,0x0,1x0,mx1,0x1,1x0,mxb,0xb,1xb,m][w0,0w0,1w0,nw1,0w1,1w0,nwm,0wm,1wm,0n]+[b0b1b2bn]\begin{aligned} y &= X W + \vec{b} \\ &= \begin{bmatrix} x_{0,0} & x_{0,1} & \cdots & x_{0,m} \\ x_{1,0} & x_{1,1} & \cdots & x_{0,m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{b,0} & x_{b,1} & \cdots & x_{b,m} \\ \end{bmatrix} \begin{bmatrix} w_{0,0} & w_{0,1} & \cdots & w_{0,n} \\ w_{1,0} & w_{1,1} & \cdots & w_{0,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,0} & w_{m,1} & \cdots & w_{m,0n} \\ \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_n \\ \end{bmatrix} \end{aligned}

The kernel for this computation is similar to the matrix multiplication kernel, except our matrices are no longer uniform/square, so we need to take additional params to describe their dimensions:

__global__ void forward(
	int batch_sz, int n, int out_w,
	float* inputs, float* weights, float* biases, float* outputs
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < batch_sz && col < out_w) {
		// initialize the outputs to be the corresponding weights
		outputs[row * out_w + col] = biases[col];
		for (int i =0; i < n; i++) {
			outputs[row * out_w + col] +=
				weights[i * out_w + col] * inputs[row * n + i];
		}
	}
}

Activation

In order to prevent the network from collapsing to a single linear layer, we introduce a non-linearity such as ReLU:8

ReLU(x)={x   if x>00   o.w.\text{ReLU}(x) = \begin{cases} x \; \text{ if } x \gt 0 \\ 0 \; \text{ o.w.} \\ \end{cases}

The ReLU\text{ReLU} kernel is pretty straightforward:

__global__ void relu(int w, int h, float* inputs, float* outputs) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < h && col < w) {
		float activation = inputs[row * w + col];
		outputs[row * w + col] = activation > 0.f ? activation : 0.f;
	}
}

we apply this activation to every layer in the network barring the final output representing e.g. the probability that a given input corresponds to a given output.

Clamping

We need to translate a ReLU\text{ReLU}'d activation to a probability, such as:

softmax(xi)=exij=1Kexj\text{softmax}(x_i) =\frac{e^{x_i}}{\sum^K_{j=1}e^{x_j}}

The caveat of using an exponential translation like softmax\text{softmax} is that, for many inputs, we run the risk of numerical overflow since we're summing over many potentially-huge numbers. The workaround is to subtract the largest element of the input vector x\vec{x} that we're applying the softmax\text{softmax} to from the exponent of each element within:

softmax(xi)=eximax(x)j=1Kexjmax(x)\text{softmax}(x_i) =\frac{e^{x_i - \max(\vec{x})}}{\sum^K_{j=1}e^{x_j- \max(\vec{x})}}

The corresponding kernel (hopefully starting to look like a familiar pattern) is:

__global__ void softmax(int w, int h, float* inputs, float* outputs) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < h && col < w) {
		// find the maximum
		float max_val = intput[row * w];
		for (int i = 1; i < w; i++) {
			max_val = max(max_val, inputs[row * w + i])l
		}

		float divisor = 1e-9f; // prevent div by zero
		for (int i = 0; i. < w; i ++) {
			divisor += exp(inputs[row * w + i] - max_val);
		}
		outsputs[row * w + col] =
			exp(inputs[row * w + col] - max_val) / (divisor);
	}
}

Loss

The final function we might be interested in for a feed forward pass would be a loss function to measure the network's performance. A cross-entropy loss9 function takes two vectors and iterates over all nn possible output values, comparing the predicted results p\vec{p} of each to the actual output q\vec{q}, assuming the outputs of the model are probabilities:

L(p,q)=x=0np(x)logq(x)\mathcal{L}(\vec{p}, \vec{q}) = -\sum^n_{x=0} p(x) \log q(x)

The lower our predicted probability is for a target score, the higher the loss:

Our loss kernel can be unidimensional since our output is a single dimension:

__global__ void cross_entropy_loss(
	int w, int h, float* preds, float* reals, float* outputs
) {
	int ix = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < h) {
		float loss = 0.f;
		for (int i = 0; i < w; i++) {
			loss -= real[idx * w + i] * log(max(1e-6, preds[idx * w + i]));
		}
		outputs[idx] = loss;
	}
}

Backpropagation

Backpropagation is the process of making incremental updates to the initially-random weights and biases of a neural network based on partial derivative of the loss with respect to the weight or bias of each neuron:

bkbkLbkwkwkLwk\begin{aligned} b_k &\leftarrow b_k - \frac{\partial \mathcal L}{\partial b_k} \\ \\ w_k &\leftarrow w_k - \frac{\partial \mathcal L}{\partial w_k} \end{aligned}

We can compute the partials at each layer by leveraging the chain rule. Starting with the last layer and the loss with respect the the activations at connecting it to the output, and working our way backwards multiplying each interior partial using the partials computed for the layer in front of it:

Lxn=Lananxn\frac{\partial \mathcal L}{\partial x^n} = \frac{\partial \mathcal L}{\partial a^n}\frac{\partial a^n}{\partial x^n}

For our softmax\text{softmax} activation and cross entropy loss function HH, we can compute the partials with respect to the inputs:

s=softmax[s1x1s1x2s1xKs2x1s2x2s2xKsKx1sKx2sKxK]\begin{aligned} &s = \text{softmax} \\ &\begin{bmatrix} \frac{\partial s_1}{\partial x_1} & \frac{\partial s_1}{\partial x_2} & \cdots & \frac{\partial s_1}{\partial x_K} \\ \frac{\partial s_2}{\partial x_1} & \frac{\partial s_2}{\partial x_2} & \cdots & \frac{\partial s_2}{\partial x_K} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial s_K}{\partial x_1} & \frac{\partial s_K}{\partial x_2} & \cdots & \frac{\partial s_K}{\partial x_K} \\ \end{bmatrix} \quad \end{aligned}

Since each element of the softmax depends on every other element in the input, the derivative will be a matrix. To simplify all the constituent partials, we make use of properties of logarithms to rearrange the inner expressions like so:

logsixi=1sisixisilogsixi=sixi\begin{aligned} \frac{\partial \log s_i}{\partial x_i} &= \frac{1}{s_i} \frac{\partial s_i}{\partial x_i}\\ \\ s_i \cdot \frac{\partial \log s_i}{\partial x_i} &= \frac{\partial s_i}{\partial x_i}\\ \end{aligned}

This lets us compute the derivative of a softmax\text{softmax} element into that element times the partial of the logarithm of that element. This is useful since it also lets us remove the division operator as well as one of the exponentiations:

logsi=logexij=0Kexj=logexilogj=0Kexj=xilogj=0Kexj\begin{aligned} \log s_i &= \log \frac{e^{x_i}}{\sum^K_{j=0}e^{x_j}} \\ &= \log e^{x_i} - \log \sum^K_{j=0} e^{x_j} \\ &= x_i - \log \sum^K_{j=0} e^{x_j} \\ \end{aligned}

Now, taking the partial derivative:

logsixk=xixklogj=0Kexjxk\begin{aligned} \frac{\partial \log s_i}{\partial x_k} &= \frac{\partial x_i}{\partial x_k} - \frac{\partial \log \sum^K_{j=0} e^{x_j}}{\partial x_k} \end{aligned}

The first term simplifies to 1 if the element index matches the derivative index, otherwise it disappears:

xixk={1   if k=i0   if ki\frac{\partial x_i}{\partial x_k} = \begin{cases} 1 \; \text{ if } k = i \\ 0 \; \text{ if } k \neq i \\ \end{cases}

as the more complicated second term, we can again apply the chain rule to simplify the computation for ourselves:

logj=0Kexjxk=1j=0Kexjj=0Kexjxk=1j=0Kexjj=0Kexjxk\begin{aligned} \frac{\partial \log \sum^K_{j=0} e^{x_j}}{\partial x_k} &= \frac{1}{\sum^K_{j=0} e^{x_j}}\frac{\partial \sum^K_{j=0} e^{x_j}}{\partial x_k} \\ \\ &= \frac{1}{\sum^K_{j=0} e^{x_j}} \sum^K_{j=0} \frac{\partial e^{x_j}}{\partial x_k} \end{aligned}

The term inside the summation depends on xx for only a single element: xkx_k, the rest are constants which will become zero, so we can further simplify:

=exkj=0Kexj=sk\begin{aligned} &= \frac{e^{x_k}}{\sum^K_{j=0} e^{x_j}} = s_k \end{aligned}

So the complete equation for the partial derivative of sks_k becomes:

sixx=silogsixk=si(xixksk)\begin{aligned} \frac{\partial s_i}{\partial x_x} = s_i\cdot \frac{\partial \log s_i}{\partial x_k} &= s_i(\frac{\partial x_i}{\partial x_k} - s_k) \end{aligned}

and we can plug this back into our matrix of softmax back propagation updates:

s=softmax[s1(1s1)s1(0s2)s1(0sn)s2(0s1)s2(1s2)s2(0sn)sn(0s1)sn(0s2)sn(1sn)]\begin{aligned} &s = \text{softmax} \\ &\begin{bmatrix} s_1(1 - s_1) & s_1(0 - s_2) & \cdots & s_1(0 - s_n) \\ s_2(0 - s_1) & s_2(1 - s_2) & \cdots & s_2(0 - s_n) \\ \vdots & \vdots & \ddots & \vdots \\ s_n(0 - s_1) & s_n(0 - s_2) & \cdots & s_n(1 - s_n) \end{bmatrix} \quad \end{aligned}

Now for the derivatives of our loss function w.r.t. the activations:

s=softmaxL(p,s)=i=0Kpilogsi[s1(1s1)s1(0s2)s1(0sn)s2(0s1)s2(1s2)s2(0sn)sn(0s1)sn(0s2)sn(1sn)][Ls1Ls2Lsn]\begin{aligned} &s = \text{softmax} & \mathcal{L}(\vec{p}, \vec{s}) = -\sum^K_{i=0} p_i \log s_i\\ &\begin{bmatrix} s_1(1 - s_1) & s_1(0 - s_2) & \cdots & s_1(0 - s_n) \\ s_2(0 - s_1) & s_2(1 - s_2) & \cdots & s_2(0 - s_n) \\ \vdots & \vdots & \ddots & \vdots \\ s_n(0 - s_1) & s_n(0 - s_2) & \cdots & s_n(1 - s_n) \end{bmatrix} &\begin{bmatrix} \frac{\partial \mathcal L}{\partial s_1} \\ \frac{\partial \mathcal L}{\partial s_2} \\ \vdots \\ \frac{\partial \mathcal L}{\partial s_n} \end{bmatrix} \end{aligned}

which is a bit simpler:

s=softmaxL(p,s)=i=0Kpilogsi[s1(1s1)s1(0s2)s1(0sn)s2(0s1)s2(1s2)s2(0sn)sn(0s1)sn(0s2)sn(1sn)][p1s1p2s2pnsn]\begin{aligned} &s = \text{softmax} & \mathcal{L}(\vec{p}, \vec{s}) = -\sum^K_{i=0} p_i \log s_i\\ &\begin{bmatrix} s_1(1 - s_1) & s_1(0 - s_2) & \cdots & s_1(0 - s_n) \\ s_2(0 - s_1) & s_2(1 - s_2) & \cdots & s_2(0 - s_n) \\ \vdots & \vdots & \ddots & \vdots \\ s_n(0 - s_1) & s_n(0 - s_2) & \cdots & s_n(1 - s_n) \end{bmatrix} &\begin{bmatrix} \frac{-p_1}{s_1} \\ \frac{-p_2}{s_2} \\ \vdots \\ \frac{-p_n}{s_n} \\ \end{bmatrix} \end{aligned}

Multiplying the two collections, we attain the derivative of our loss w.r.t. the inputs:

[s1(1s1)s1(0s2)s1(0sn)s2(0s1)s2(1s2)s2(0sn)sn(0s1)sn(0s2)sn(1sn)][p1s1p2s2pnsn]=[(p1s1(p1+p2++pn))(p2s2(p1+p2++pn))(pnsn(p1+p2++pn))]\begin{aligned} \begin{bmatrix} s_1(1 - s_1) & s_1(0 - s_2) & \cdots & s_1(0 - s_n) \\ s_2(0 - s_1) & s_2(1 - s_2) & \cdots & s_2(0 - s_n) \\ \vdots & \vdots & \ddots & \vdots \\ s_n(0 - s_1) & s_n(0 - s_2) & \cdots & s_n(1 - s_n) \end{bmatrix} \begin{bmatrix} \frac{-p_1}{s_1} \\ \frac{-p_2}{s_2} \\ \vdots \\ \frac{-p_n}{s_n} \\ \end{bmatrix} = \begin{bmatrix} -(p_1 - s_1(p_1 + p_2 + \cdots + p_n)) \\ -(p_2 - s_2(p_1 + p_2 + \cdots + p_n)) \\ \vdots\\ -(p_n - s_n(p_1 + p_2 + \cdots + p_n)) \\ \end{bmatrix} \end{aligned}

and since pp is a probability distribution, the sum over all elements (p1+p2++pn)(p_1 + p_2 + \cdots + p_n) will be 1, so we can simplify the resultant expression:

[s1(1s1)s1(0s2)s1(0sn)s2(0s1)s2(1s2)s2(0sn)sn(0s1)sn(0s2)sn(1sn)][p1s1p2s2pnsn]=[(p1s1)(p2s2)(pnsn)]=[s1p1s2p2snpn]\begin{aligned} \begin{bmatrix} s_1(1 - s_1) & s_1(0 - s_2) & \cdots & s_1(0 - s_n) \\ s_2(0 - s_1) & s_2(1 - s_2) & \cdots & s_2(0 - s_n) \\ \vdots & \vdots & \ddots & \vdots \\ s_n(0 - s_1) & s_n(0 - s_2) & \cdots & s_n(1 - s_n) \end{bmatrix} \begin{bmatrix} \frac{-p_1}{s_1} \\ \frac{-p_2}{s_2} \\ \vdots \\ \frac{-p_n}{s_n} \\ \end{bmatrix} &= \begin{bmatrix} -(p_1 - s_1) \\ -(p_2 - s_2) \\ \vdots\\ -(p_n - s_n) \\ \end{bmatrix} \\ \\ &= \begin{bmatrix} s_1 - p_1 \\ s_2 - p_2 \\ \vdots\\ s_n - p_n \\ \end{bmatrix} \end{aligned}

so our loss w.r.t. inputs is just the vector difference between our softmax'd outputs and the true distribution. Accordingly, the kernel to compute this is much friendlier than all the algebra above:

__global__ void cross_entropy_backwards(
	int w, int h, float* preds, float* reals, float* outputs
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < h && col < w)
		outputs[row * w + col] = preds[row * w + col] - reals[row * w + col];
}

With the derivative of our loss w.r.t. the inputs from the output layers, we can compute each preceding derivative as well via the chain rule by multiplying the expression by the partial derivative of the inputs with respect to the activations of the preceding layer:

Lxn=Lananxn    Lan1=Lananxnxnan1\frac{\partial \mathcal L}{\partial x^n} = \frac{\partial \mathcal L}{\partial a^n}\frac{\partial a^n}{\partial x^n} \implies \frac{\partial \mathcal L}{\partial a^{n-1}} = \frac{\partial \mathcal L}{\partial a^n}\frac{\partial a^n}{\partial x^n}\frac{\partial x^n}{\partial a^{n-1}}

Recalling that our equation for the inputs to the previous layers is the matrix multiplication of the activations and the weights plus the biases:

y=XW+b=[x0,0x0,1x0,mx1,0x1,1x0,mxb,0xb,1xb,m][w0,0w0,1w0,nw1,0w1,1w0,nwm,0wm,1wm,0n]+[b0b1b2bn]xn=An1Wn+bn\begin{aligned} y &= X W + \vec{b} \\ &= \begin{bmatrix} x_{0,0} & x_{0,1} & \cdots & x_{0,m} \\ x_{1,0} & x_{1,1} & \cdots & x_{0,m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{b,0} & x_{b,1} & \cdots & x_{b,m} \\ \end{bmatrix} \begin{bmatrix} w_{0,0} & w_{0,1} & \cdots & w_{0,n} \\ w_{1,0} & w_{1,1} & \cdots & w_{0,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,0} & w_{m,1} & \cdots & w_{m,0n} \\ \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_n \\ \end{bmatrix} \\ x^n &= A^{n-1}W^n + \vec{b}^n \end{aligned}

Now here I've made an egregious mess of notation – the superscripts are indices not exponents. Taking the derivative of this expression yields just the weights of that previous layer

xnan1=An1Wn+bnAn1=Wi\begin{aligned} \frac{x^n}{\partial a^{n-1}} &= \frac{\partial A^{n-1}W^n + \vec{b}^n}{\partial A^{n-1}} \\ \\ &= W^i \end{aligned}

The corresponding kernel then just needs to perform matrix multiplication between the weights and the error from the next layer:

__global__ void backward(
	int batch_size, int n, int out_w,
	float* weights, float* biases, float* d_loss, float* out_d_loss
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < batch_size && col < out_w) {
		float dloss = 0.f;
		for (int i = 0; i < n; i++) {
			float w = weights[i * out_w + col];
			dloss += w * d_loss[row * n + i];
		}
		out_d_loss[row * out_w + col] = dloss;
	}
}

The derivative of our chosen non-linearity is straightforward:

ReLU(x)x={1   if x>00   if x0\frac{\partial \text{ReLU}(x)}{\partial x} = \begin{cases} 1 \; \text{ if } x \gt 0 \\ 0 \; \text{ if } x \leq 0 \\ \end{cases}

To apply backpropagation, we again apply the chain rule: multiplying by the partial derivative of our loss function w.r.t. the previous activation:

LaReLU(x)x={La   if x>00   if x0\frac{\partial \mathcal L}{\partial a} \frac{\partial \text{ReLU}(x)}{\partial x} = \begin{cases} \frac{\partial \mathcal L}{\partial a} \; \text{ if } x \gt 0 \\ 0 \; \text{ if } x \leq 0 \\ \end{cases}

in kernel form:

__global__ void relu_backwards(
	int w, int h, int ns, float* a, float* d_loss, float* b
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < w && col < h) {
		float activation = a[row * w + col]
		b[row * w + col] = activation > 0.f ? d_loss[row * w + col] : 0.f;
	}
}

Loss of weights and biases

To compute the error of our weights and biases, we take the partial derivative with respect to the desired quantity:

xn=An1Wn+bn    xnwn=An1Wn+bnwn=An1    Lxnxnwn=An1Lxn=xnwnxnbn=An1Wn+bnbn=1    Lxnxnbn=1Lxn=xnbn\begin{aligned} x^n &= A^{n-1}W^n + \vec{b}^n \implies \\ \\ \frac{\partial x^n}{\partial w^n} &= \frac{\partial A^{n-1}W^n + \vec{b}^n}{\partial w^n} = A^{n-1} \\ \\ &\implies \frac{\partial \mathcal L}{\partial x^n} \frac{\partial x^n}{\partial w^n} = A^{n-1}\frac{\partial \mathcal L}{\partial x^n} = \frac{\partial x^n}{\partial w^n} \\ \\ \frac{\partial x^n}{\partial b^n} &= \frac{\partial A^{n-1}W^n + \vec{b}^n}{\partial b^n} = 1 \\ \\ &\implies \frac{\partial \mathcal L}{\partial x^n} \frac{\partial x^n}{\partial b^n} = 1\frac{\partial \mathcal L}{\partial x^n} = \frac{\partial x^n}{\partial b^n} \end{aligned}

To update these values, we augment the existing weight or bias according to the derivative times some learning rate divided by our batch size:

wwηbsLwnbbηbsLbn\begin{aligned} w \leftarrow w - \frac{\eta}{bs}\frac{\partial \mathcal L}{\partial w^n} \\ \\ b \leftarrow b - \frac{\eta}{bs}\frac{\partial \mathcal L}{\partial b^n} \end{aligned}

and the last kernel to perform this gradient descent:

__global__ void update_layer(
	int w, int h, int batch_size, float η,
	float* weights, float* biases, float* activations, float* d_loss
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < h && col < w) {
		float d_w = 0.f;
		float d_b = 0.f;
		for (int i = 0; i < batch_size; i++) {
			float activation = activations[i * h + row];
			float d_l = d_loss[i * w + col];
			dw += activation + d_l;
			db += d_l;
		}
		weights[row * w + col] -= η * d_w / batch_size;
		biases[col] -= η * d_b / batch_size;
	}
}

Performance for Measuring GPU Kernels

When we run code designed for the GPU, there's three primary devices that we're concerned with:

  • the GPU itself
  • the GPU's high bandwidth memory (DRAM)
  • the CPU responsible for scheduling kernels and copying data to/from the HBM

Each of these devices has corresponding penalties referring to as compute time, memory access latency, and overhead respectively.

Continuing with the arbitrary example of executing an MNIST handwritten numerical digit, suppose we profiled our application and found that it takes 2s to load the data and 9.5s to train our neural net for 10 epochs:

We might be able to bring down the CPU overhead by using more sophisticated methods of optimized string parsing to ingest the MNIST csv data,10 bringing its overall time down to 1s, but we're still left with ~1s epochs:

The next step might be to observe that we don't need all of the data during the first epoch, just the data required for the first batch, so we can run these two tasks nearly simultaneously:

However, notice that this does not eliminate the overhead time, so even if we were (hypothetically *cough* *cough*) able optimize our GPU-based training computations, our first epoch would still be bottlenecked by the time required to load the first batch of training data:

CUDA API

A brief note about the CUDA API which helps understand precisely how we're able to parallelize these first two tasks. By design, CUDA is not a synchronous API, so when we run a kernel with the method<<< ... >>>(params) code, the CPU does not await the results of those tasks. Calling additional kernels adds them to a queue of kernels to be scheduled by the CPU which is managed by CUDA drivers for us. Making a synchronous/CPU-bound call after a kernel has been scheduled does wait for the scheduled kernels to complete before executing.

E.g. using the neural net methods implemented above, our training loop might look something like this:

load_mnist();

for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
	for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		softmax<<dimGrid, dimBlock>>>(...);
		cross_entropy_loss<<dimGrid, dimBlock>>>(...);

		cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
		compute_accuracy(...);

		cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);

		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);
	}
}

So, in order to parallelize the training and loading tasks, we can first schedule all of our kernels up front:

load_mnist();

for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
	for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		softmax<<dimGrid, dimBlock>>>(...);
		cross_entropy_loss<<dimGrid, dimBlock>>>(...);

		cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);

		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);

		cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
		compute_accuracy(...);
	}
}

Then, instead of loading the full data set, we can instead define a batch-loader to be called once before scheduling our kernels, and then call it again before our synchronization point:

load_next_batch();

for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
	for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		relu<<dimGrid, dimBlock>>>(...);
		forward<<dimGrid, dimBlock>>>(...);
		softmax<<dimGrid, dimBlock>>>(...);
		cross_entropy_loss<<dimGrid, dimBlock>>>(...);

		cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);
		backward<<dimGrid, dimBlock>>>(...);
		relu_backwards<<dimGrid, dimBlock>>>(...);

		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);
		update_layer<<dimGrid, dimBlock>>>(...);

		if (epoch == 0) load_next_batch();

		cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
		compute_accuracy(...);
	}
}

GPU perf

When evaluating and improving code targeting the GPU, we're interested in the two other metrics besides CPU overhead. Our GPU compute time is bound by the throughput of the datatype being used on the device, typically measured in floating point operations per second (or FLOPS). The 3090 Ti3 we've been using as a benchmark has a theoretical limit of 40 TFLOPS on 32-bit floating point numbers, and a memory bandwidth of 1.01 TB/s. Therefore, we can perform computations much faster than we can actually load data from memory, so we might introduce a third meta-metric of computation intensity to measure how many FLOPS we can execute per byte loaded from memory.

To find the optimal performance ratio, we can plot the theoretical FLOPs against computational intensity to find the desired intersection:

If we're on the right side, we're compute bound, the GPU is working at full power. This is good. Here, we gain performance improvements by using datatypes with better OPS, optimizing our algorithms, or using more expensive/sophisticated hardware such as tensor cores for a higher throughput limit.

The left is bad, the GPU is being underworked, waiting for data from memory. To move from left to right, we can use different kinds of memory, or do some kernel fusion.

Kernel fusion

Consider one of the numerous bottlenecks in our relatively naive CUDA implementation of a neural network:

__global__ void relu(int w, int h, float* inputs, float* outputs) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < h && col < w) {
		float activation = inputs[row * w + col];
		outputs[row * w + col] = activation > 0.f ? activation : 0.f;
	}
}

This function has two memory access (reading a float from inputs and writing a float to outputs) both of size 4 bytes. We're also only doing two operations: the comparison and assignment of activation. This gives us a computation intensity of 2 FLOPs8 B\frac{2 \text{ FLOPs}}{8 \text{ B}} (not very intense imo).

Since our relu function is exclusively used after invoking the forward function, we can fuse these two functions together, completely eliminating the need for the memory accesses and operations added by relu:

__global__ void forward_relu(
	int batch_sz, int n, int out_w,
	float* inputs, float* weights, float* biases, float* outputs
) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < batch_sz && col < out_w) {
		float out = biases[col];
		outputs[row * out_w + col] = biases[col];
		for (int i =0; i < n; i++) {
			out += weights[i * out_w + col] * inputs[row * n + i];
		}
		outputs[row * out_w + col] = out > 0.f ? out : 0.f;
	}
}

We can similarly fuse the backwards pass. This achieves the "hypothetical" improvements alluded to earlier, reducing out total execution time by a factor of 2. At the moment, the rinky-dink example is hardly utilizing the GPU. Since our data in network is very small, the kernels aren't even filling all possible threads. If we increase the batch_size from 16 to 64, we can achieve an even higher rate of parallelization, bringing our per-epoch training time down from ~half a second to a tenth of a second, yielding a 5x improvement over the initial implementation.

GPU Memory Hierarchy

With respect to programming for a GPU with CUDA, there are two kinds of memory that we're concerned with: on- and off-chip. There are some segments of memory connected to the actual GPU chip, as well as smaller amounts of memory within the chip itself. Each thread within a block is able to access the global VRAM which resides off-chip:

Memory allocated via e.g. cudaMalloc(...) or statically initialized globally (with the __device__ keyword) lives in global memory which is the largest and slowest memory pool the GPU can access. Each thread also has access to a local set of on-chip registers. Variables created within our kernels are stored in these registers. Of course, if we initialize too many variables within our kernels (more than ~255 for modern commercial hardware), we induce kernel register occupancy, which will cause values in our registers to be evicted to "local memory" which is actually off-chip ("local" indicates that the variables are local to that thread).

We also have a small (on the order of 64 KB) section of VRAM called "constant memory" which is off-chip, cached, and read-only. Access to different addresses within this special segment are serialized, meaning that if we accesses the same address from multiple threads, we can achieve faster reads than if we made identical accesses to global memory.

To use constant memory, we use the __constant__ keyword when declaring a variable, and then use the cudaMemcpyToSymbol function to move data from the CPU to constant memory.

Finally, we have shared memory which is block-local memory:

Shared memory is on-chip, and therefore preferable to local, constant, and global memory, especially when the data be stored would be used by multiple threads. We specify __shared__ when declaring kernel variables in order to store them in this region.

kindon-/off-chipaccessscopelifetime
registersonr/wthreadthread
sharedonr/wblockblock
localoffr/wthreadthread
globaloffr/wglobaldynamic
constantoffrglobaldynamic

Optimizing w.r.t. memory layout

With deeper knowledge of the memory hierarchy on our GPU, we can return to one of our previous kernels to see how we might take advantage of the layout in order to milk out some performance gains.

Recall the naive matrix multiplication kernel from earlier:

__global__ void matmul(int n, float* a, float* b, float* c) {
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	if (row < n && col < n) {
		float dot_prod = 0.f;
		for (int i =0; i < n; i++) {
			dot_prod += a[row*n + i] * b[i*n + col];
		}
		c[row * n + col] = dot_prod;
	}
}

For two n×nn \times n square matrices, this code will access each row of a nn times and each column of b ntimes, storing the dot product between corresponding elements in the thread's registers.

[b0,0b0,1b0,2b0,3b1,0b1,1b1,2b1,3b2,0b2,1b2,2b2,3b3,0b3,1b3,2b3,3][a0,0a0,1a0,2a0,3a1,0a1,1a1,2a1,3a2,0a2,1a2,2a2,3a3,0a3,1a3,2a3,3][c0,0c0,1c0,2c0,3c1,0c1,1c1,2c1,3c2,0c2,1c2,2c2,3c3,0c3,1c3,2c3,3]\begin{aligned} & &\begin{bmatrix} \colorbox{yellow}{$b_{0,0}$} & b_{0,1} & b_{0,2} & b_{0,3} \\ \colorbox{yellow}{$b_{1,0}$} & b_{1,1} & b_{1,2} & b_{1,3} \\ \colorbox{yellow}{$b_{2,0}$} & b_{2,1} & b_{2,2} & b_{2,3} \\ \colorbox{yellow}{$b_{3,0}$} & b_{3,1} & b_{3,2} & b_{3,3} \\ \end{bmatrix} \\ \\ &\begin{bmatrix} \colorbox{yellow}{$a_{0,0}$} & \colorbox{yellow}{$a_{0,1}$} & \colorbox{yellow}{$a_{0,2}$} & \colorbox{yellow}{$a_{0,3}$} \\ a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \\ \end{bmatrix} &\begin{bmatrix} \colorbox{yellow}{$c_{0,0}$} & c_{0,1} & c_{0,2} & c_{0,3} \\ c_{1,0} & c_{1,1} & c_{1,2} & c_{1,3} \\ c_{2,0} & c_{2,1} & c_{2,2} & c_{2,3} \\ c_{3,0} & c_{3,1} & c_{3,2} & c_{3,3} \\ \end{bmatrix} \end{aligned}

This is bad! To reduce the number of reads from global memory, we can utilize shared memory by splitting our input matrices into block-sized tiles:

[b0,0b0,1b0,2b0,3b1,0b1,1b1,2b1,3b2,0b2,1b2,2b2,3b3,0b3,1b3,2b3,3][a0,0a0,1a0,2a0,3a1,0a1,1a1,2a1,3a2,0a2,1a2,2a2,3a3,0a3,1a3,2a3,3][c0,0c0,1c0,2c0,3c1,0c1,1c1,2c1,3c2,0c2,1c2,2c2,3c3,0c3,1c3,2c3,3]\begin{aligned} & &\begin{bmatrix} \colorbox{lavender}{$b_{0,0}$} & \colorbox{lavender}{$b_{0,1}$} & b_{0,2} & b_{0,3} \\ \colorbox{lavender}{$b_{1,0}$} & \colorbox{lavender}{$b_{1,1}$} & b_{1,2} & b_{1,3} \\ b_{2,0} & b_{2,1} & b_{2,2} & b_{2,3} \\ b_{3,0} & b_{3,1} & b_{3,2} & b_{3,3} \\ \end{bmatrix} \\ \\ &\begin{bmatrix} \colorbox{lavender}{$a_{0,0}$} & \colorbox{lavender}{$a_{0,1}$} & a_{0,2} & a_{0,3} \\ \colorbox{lavender}{$a_{1,0}$} & \colorbox{lavender}{$a_{1,1}$} & a_{1,2} & a_{1,3} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \\ \end{bmatrix} &\begin{bmatrix} \colorbox{lavender}{$c'_{0,0}$} & \colorbox{lavender}{$c'_{0,1}$} & c_{0,2} & c_{0,3} \\ \colorbox{lavender}{$c'_{1,0}$} & \colorbox{lavender}{$c'_{1,1}$} & c_{1,2} & c_{1,3} \\ c_{2,0} & c_{2,1} & c_{2,2} & c_{2,3} \\ c_{3,0} & c_{3,1} & c_{3,2} & c_{3,3} \\ \end{bmatrix} \end{aligned}

Each thread in the block now reads the corresponding values from each tile into shared memory, and proceeds to compute the partial dot product of that tile. We then shift our tiles on a and b to finish to partial dot products for those rows and columns (denoted ci,jc'_{i,j}):

[b0,0b0,1b0,2b0,3b1,0b1,1b1,2b1,3b2,0b2,1b2,2b2,3b3,0b3,1b3,2b3,3][a0,0a0,1a0,2a0,3a1,0a1,1a1,2a1,3a2,0a2,1a2,2a2,3a3,0a3,1a3,2a3,3][c0,0c0,1c0,2c0,3c1,0c1,1c1,2c1,3c2,0c2,1c2,2c2,3c3,0c3,1c3,2c3,3]\begin{aligned} & &\begin{bmatrix} b_{0,0} & b_{0,1} & b_{0,2} & b_{0,3} \\ b_{1,0} & b_{1,1} & b_{1,2} & b_{1,3} \\ \colorbox{lavender}{$b_{2,0}$} & \colorbox{lavender}{$b_{2,1}$} & b_{2,2} & b_{2,3} \\ \colorbox{lavender}{$b_{3,0}$} & \colorbox{lavender}{$b_{3,1}$} & b_{3,2} & b_{3,3} \\ \end{bmatrix} \\ \\ &\begin{bmatrix} a_{0,0} & a_{0,1} & \colorbox{lavender}{$a_{0,2}$} & \colorbox{lavender}{$a_{0,3}$} \\ a_{1,0} & a_{1,1} & \colorbox{lavender}{$a_{1,2}$} & \colorbox{lavender}{$a_{1,3}$} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \\ \end{bmatrix} &\begin{bmatrix} \colorbox{lavender}{$c_{0,0}$} & \colorbox{lavender}{$c_{0,1}$} & c_{0,2} & c_{0,3} \\ \colorbox{lavender}{$c_{1,0}$} & \colorbox{lavender}{$c_{1,1}$} & c_{1,2} & c_{1,3} \\ c_{2,0} & c_{2,1} & c_{2,2} & c_{2,3} \\ c_{3,0} & c_{3,1} & c_{3,2} & c_{3,3} \\ \end{bmatrix} \end{aligned}

We repeat this process of tiling till we've covered the whole matrix. This vastly reduces the number of reads from global memory and improves our kernel's performance. The code is a bit psychopathic, but it looks like this:

__shared__ float a_tile[TILE_WIDTH][TILE_WIDTH];
__shared__ float b_tile[TILE_WIDTH][TILE_WIDTH];

int row = blockIdx.y * TILE_WIDTH + threadIdx.y;
int col = blockIdx.x * TILE_WIDTH + threadIdx.x;
int tx = threadIdx.x;
int ty = threadIdx.y;
float dot_prod = 0.f;

for (int tile_offset = 0; tile_offset < n; tile__offset += TILE_WIDTH) {
	int a_chk = tile_offset * tx < n && row < n
	a_tile[ty][tx] = a_chk ? a[row * n + tile_offset + tx] : 0.f;

	int b_chk = tile_offset * ty < n && col < n
	b_tile[ty][tx] = b_chk ? b[(tile_offset + ty) * n + col]  : 0.f;

	// wait for all threads to complete the above boundary checks for tiles
	__syncthreads();
	for (int i = 0; i < TILE_WIDTH; i++) {
		dot_prod += a_tile[ty][i] * b_tile[i][tx];
	}
	// wait for all threads to compute their partial `dot_prod`
	__syncthreads();
<