Building BLAS, Part 2: Introducing devblas

Posted on 2025-12-01 :: 1620 Words

Why a separate post to introduce the library?

In the previous post in the BLAS series, we looked at how NumPy significantly outperformed a naive IJK ordered three-loop C++ implementation on the same matrix sizes. The post was an announcement for both the blog series and the library we will implement as the series progresses. Over the past two weeks, I spent some time setting up the library and plugging in our naive C++ implementation into the codebase.

However, the efforts resulted in a codebase that has now become large enough to deserve its own introduction post, so that you do not feel lost when I use code from the library to demonstrate the concepts we will discuss in upcoming posts.

In its entirety, the library at the moment consists of a naive IJK ordered IGEMM/SGEMM implementation exposed via C, but internally implemented in C++ so that we can use templates to write reusable code. I have deliberately avoided the use of classes and other OOP constructs to avoid runtime overhead, and I expose the API in C to maintain a stable ABI.

The library also provides a generic benchmarking harness (which we'll use a lot) that allows the user to pass in a function pointer to their own implementation of a GEMM for benchmarking, as long as it follows the specified function signature.

So, to avoid cluttering other posts with implementation details from the library, I decided to write an intro to it. In this post, we'll discuss the implementation and design details, and how to use it to benchmark your own experimental kernels.

The library lives here on Github. The README will have instructions on building, but I will very quickly give a brief overview.

Building and linking

Let's say that you've cloned the repository to ~/devblas (a simple git clone in the home dir).

Here's a quick overview before we move into the internal details. (These instructions are also there on the repository README, I am just repeating here for completeness).

To build, we'll first have to generate the build files with CMake and then run the build command. I recommend having a recent enough version of CMake and Ninja installed. To build, run CMake in the top-level of the repository.

$ cmake -S. -Bbuild -GNinja -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=devblas_libs

You can plug in your own C and C++ compilers here. I mainly use Clang on WSL, so I specify it in the command. Now let's build the source.

$ cmake --build build && cmake --install build

This will build the code, produce a .so file, and install the library under devblas_libs/devblas in the source root.

To link a C file with the .so file, compile the file like this (from the repository root):

$ clang test.c -o test -Iinclude -Ldevblas_libs -ldevblas

To run the test binary, preload the shared object file and invoke the executable.

$ LD_PRELOAD=devblas_libs/devblas/libdevblas.so ./test

To make the experimentation process easier, I've provided an example file called bench_driver.c in the top-level. It will automatically link with the produced .so file, so you won't have to go through the hassle of linking manually. To enable compilation of bench_driver.c, add -DDEVBLAS_BUILD_BENCH=ON to the CMake command, and it should produce a binary called bench_driver in the build dir.

$ cmake -S. -Bbuild -GNinja -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DCMAKE_BUILD_TYPE=Release \
  -DDEVBLAS_BUILD_BENCH=ON
$ build/bench_driver
naive_gemm_ijk_driver:
        Average GFLOP/s: 0.601536
naive_gemm_ijk_driver:
        Average GFLOP/s: 0.611877

The Benchmarking Harness

The benchmarking harness is a crucial feature of the library, because we will make extensive use of it to benchmark our kernel implementations as we progress with the series. Internal details are irrelevant to this post (we shall always use the C bindings, so it is much easier to just read the internal details on Github), so we will omit them here.

The benchmarking harness provides two benchmarking functions, called bench_igemm(...) for IGEMMs and bench_sgemm(...) for SGEMMs.

Let's look at the bench_driver.c file to see how they're used.

#include "devblas/c_api/blas.h"
#include <assert.h>
#include <stdio.h>

int main(void) {
  bench_igemm(naive_igemm_ijk, "naive_gemm_ijk_driver", 1, 3,
              DEVBLAS_LAYOUT_ROW_MAJOR, 1024, 1024, 1024, 1024, 1024, 1024);
  bench_sgemm(naive_sgemm_ijk, "naive_gemm_ijk_driver", 1, 3,
              DEVBLAS_LAYOUT_COLUMN_MAJOR, 1024, 1024, 1024, 1024, 1024, 1024);
  return 0;
}

Both the functions follow a very similar signature, differing only in the data types of the matrices. We'll just talk about bench_sgemm(...) from now, and everything should apply to bench_igemm(...) as well.

The first argument is a function pointer, naive_sgemm_ijk. It points to our implementation of the naive SGEMM that we called matmul() in the previous post. Its signature is:

void naive_sgemm_ijk(devblas_layout_t, const float *, const float *, float *,
                     int, int, int, int, int, int);

It takes a layout (we'll talk about that very soon), pointers to the input matrices, pointer to the output matrix, the integers M, N, K, (the matrix dimensions), and lda, ldb, ldc (the leading dimensions, which I'll get to very soon too).

The second argument to the benchmark function is the name of the benchmark that makes it easier to understand the logging. It defaults to default_naive_sgemm if the value is NULL. Third argument is the number of warmup iterations to run so that we are not timing cold runs, and following that is the number of iterations to actually run and measure. The rest of the arguments are just values that the naive_sgemm_ijk function will accept, so that the harness can run it with the correct arguments.

Let's write a very simple example "kernel" and plug that into the benchmark function.

#include "devblas/c_api/blas.h"
#include <stdio.h>

void my_experimental_kernel(devblas_layout_t, const float *, const float *, float *,
                            int, int, int, int, int, int) {
  printf("Hello, World!\n");
}

int main(void) {
  bench_sgemm(my_experimental_kernel, "my_experimental_kernel", 0, 10,
              DEVBLAS_LAYOUT_ROW_MAJOR, 0, 0, 0, 0, 0, 0);
  return 0;
}

In this example, we are just printing "Hello, World!" in our kernel, and telling the benchmark function to not warmup and run this function 10 times. Matrix sizes and leading dimensions (what we called lda, ldb and ldc) are all 0 so that we don't allocate any matrices inside the benchmark function.

This outputs something like this:

Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
my_experimental_kernel:
        Average GFLOP/s: -nan

The function got called 10 times, as we expected, and gave -nan as the average GFLOP/s count. That's expected, we didn't give it anything to compute.

For a legitimate kernel function, it will produce a valid GFLOP/s value that should give a fair idea of what slows down a kernel and what speeds it up.

That is all we need to know about benchmarking for now. Let's move to...

Layout

In general, no reasonable library will represent a matrix as a 2-D nested array. In general, matrices are stored as 1-D arrays, and we use the concept of strides to calculate which index in the 1-D array maps to the required index in 2-D.

Now, anyone familiar with C or C++ will think of a 2-D matrix as a collection of multiple row arrays. However, initial BLAS implementations and Fortran, which was (and still is) heavily used for scientific computing, represented 2-D matrices in a column-major order. So instead of traversing the rows first, you traverse the columns first.

In a setup that stores matrices as 1-D arrays, we can easily change our access pattern to switch between a row-major (as we have it in C, arrays of rows) or a column-major (as we have it in Fortran, array of columns) layout.

Let's consider the following matrix: $$ \begin{pmatrix} a & b \\ c & d \end{pmatrix} $$

In a row-major format, the stored 1-D array would look like this $$ \begin{bmatrix} a & b & c & d \end{bmatrix} $$

So when you access the next element, it is either the next one in this row, or the first element in the next row.

However, in a column-major layout, we would walk the matrix column first, something like this: $$ \begin{bmatrix} a & c & b & d \end{bmatrix} $$ so that c (just below a) becomes the element next to a in the 1-D array.

That is what layouts do. They just change the storage and access pattern of the 1-D array that represents the 2-D matrix.

Leading dimensions

Another thing that we kept for later was the concept of leading dimensions (referred to as lda, ldb, ldc) above.

Now, it is not always necessary that we will want to multiply the whole matrix in one go. There is a very high probability that you might need to multiply submatrices, and that is exactly what leading dimensions do.

Let's extend the matrix we wrote above to a larger 3x3 size $$ \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} $$

Now let's say we have another matrix of the same size, but we don't want to multiply the whole matrix. Suppose we want to multiply the top left 2x2 submatrix. $$ S = \begin{pmatrix} a & b \\ d & e \end{pmatrix} $$

Assuming a row-major layout, we pass in the dimensions of the submatrix, M=2, N=2, K=2 into the GEMM function. However, our matrices are not 2x2 in size, they are 3x3. Therefore, to actually land on the first element of the next logical row, we pass in a leading dimension of 3 for both the input matrices.

This means that we'll access first K successive elements for each row for first matrix of size MxK, and the moment we cross K, we move from the end of this logical row to the start of the next one by skipping (lda - K) elements. Accessing the elements requires us to compute the stride, which means how many elements we need to jump to reach the first element of row $R+1$ from the first element of row $R$.

And leading dimensions are exactly that: the stride values used when operating on submatrices.

Closing comments

This was a text-heavy post, but it was necessary to introduce you to the architecture of the library, because I will be using it heavily in the upcoming posts.

I did not want to clutter subsequent posts (which will be theory- and visualization-heavy) with library implementation details, and it seemed better to introduce them before we start on that kind of stuff and you get lost in the clutter.

I'll introduce the memory hierarchy, cache behavior, and locality in the next post, and we'll modify our naive GEMM kernel and benchmark the modifications as we discuss the concepts, so that we can observe their effects.

The next post will be out in a while. Stay tuned!