The Motivation
For years, I’ve relied on highly optimized BLAS libraries like OpenBLAS, BLIS, and MKL without ever really understanding the machinery inside them.
Recently, while working on some multithreaded code, I found myself benchmarking NumPy operations against naive C++ implementations. The results, though not surprising, were drastic enough to trigger a deeper curiosity about how NumPy achieves such speed. We all know NumPy is backed by C, and often links against BLAS libraries like OpenBLAS, but that only raises a more interesting question:
What makes a BLAS implementation so much faster than straightforward C++ loops?
To find a real answer to that question (not a hand-wavy one), I decided to build a simple (but fully working) BLAS library of my own, which I’m calling devblas.
Gathering Evidence
Of course, claims mean nothing without data, so I also wrote a small benchmarking setup to compare NumPy’s performance against raw C++ loops. The goal isn’t to show off numbers, but to illuminate the motivation behind diving into BLAS internals and to make this announcement a bit more engaging than a plain “I started a new project” post.
Before we get into the code, I'd like to make some disclaimers: All of these code snippets were implemented and executed on
WSL2running on Windows 11, and therefore the results might not be as eye-catching as they would be on a native Linux box or a tuned machine. But the trends will still be relevant and, in my opinion, show a difference enough to prove useful.
Let's write the C++ naive GEMM (matrix multiplication) first. We start with a basic function called matmul that takes three int pointers, the first two for inputs and the third pointer is the matrix that the results will be written into, alongside M, N, and K, the matrix dims.
The code is simple enough to be self explanatory. We essentially iterate in three nested loops and accumulate the results of the multiplication into C in the innermost loop. Here is the function:
void matmul(const int *A, const int *B, int *C, int M, int N, int K) {
for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
C[i * N + j] = 0;
for (int k = 0; k < K; ++k) {
C[i * N + j] += A[i * K + k] * B[k * N + j];
}
}
}
}
The reasoning is simple: Use the columns as the strides and calculate the offsets to write into. Once we have the relevant offsets, we read the values from A and B, multiply them, and accumulate them in C.
But, on it's own, this function gives us nothing! So we'll have to write a main function to actually call it. Here it is:
int main() {
int M = 2000;
int N = 2000;
int K = 2000;
// Let's use std::vector<int> for simplicity here.
std::vector<int> A(M * K);
std::vector<int> B(K * N);
std::vector<int> C(M * N);
// Fill up the vectors with increasing values.
for (int i = 0; i < M * K; ++i) {
A[i] = i;
}
for (int i = 0; i < K * N; ++i) {
B[i] = i;
}
Timer t = Timer();
matmul(A.data(), B.data(), C.data(), M, N, K);
std::cerr << std::fixed;
std::cerr << "Elapsed time: " << t.elapsed() << std::endl;
return 0;
}
So essentially, we create two matrices A and B, each of size 2000x2000, and initialized the elements from 0 to 2000*2000.
Notice a Timer object in the function? Yep, as we don't have a convenient utility like Python's time module, we'll roll our own really simple timer that tells us the time elapsed during the operation.
Here is the implementation:
struct Timer {
Timer() : start_(std::chrono::steady_clock::now()) {}
double elapsed() {
using namespace std::literals;
auto now = std::chrono::steady_clock::now();
return (double)((now - start_) / 1ns) * (double)1e-6;
}
private:
std::chrono::time_point<std::chrono::steady_clock> start_;
};
So now, our full benchmark code looks like this:
#include <chrono>
#include <iostream>
#include <vector>
void matmul(const int *A, const int *B, int *C, int M, int N, int K) {
for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
C[i * N + j] = 0;
for (int k = 0; k < K; ++k) {
C[i * N + j] += A[i * K + k] * B[k * N + j];
}
}
}
}
struct Timer {
Timer() : start_(std::chrono::steady_clock::now()) {}
double elapsed() {
using namespace std::literals;
auto now = std::chrono::steady_clock::now();
return (double)((now - start_) / 1ns) * (double)1e-6;
}
private:
std::chrono::time_point<std::chrono::steady_clock> start_;
};
int main() {
int M = 2000;
int N = 2000;
int K = 2000;
// Let's use std::vector<int> for simplicity here.
std::vector<int> A(M * K);
std::vector<int> B(K * N);
std::vector<int> C(M * N);
// Fill up the vectors with increasing values.
for (int i = 0; i < M * K; ++i) {
A[i] = i;
}
for (int i = 0; i < K * N; ++i) {
B[i] = i;
}
Timer t = Timer();
matmul(A.data(), B.data(), C.data(), M, N, K);
std::cerr << std::fixed;
std::cerr << "Elapsed time: " << t.elapsed() << std::endl;
return 0;
}
And that's it! Now to run this, we need to compile the code first, so let's do that.
$ clang++ naive_sgemm.cc -o naive_sgemm
And on running the binary, something like this should appear (note that the results are in ms):
$ ./naive_sgemm
Elapsed time: 21758.562206
The binary took a whooping 21.7 seconds to do a 2000x2000 GEMM!
And now, to compare, we write the same code in Python, only using NumPy. The code is small enough to fit in just one code block.
import numpy as np
import time
def matmul(A: np.ndarray, B: np.ndarray) -> np.ndarray:
return A @ B
def main():
M = 2000
K = 2000
N = 2000
a = [i for i in range(M * K)]
b = [i for i in range(K * N)]
A = np.array(a).reshape(M, K)
B = np.array(b).reshape(K, N)
start = time.perf_counter()
_ = matmul(A, B)
print(f"Elapsed time: {(time.perf_counter() - start) * 1000:.6f}")
if __name__ == "__main__":
main()
Running this is simple (assuming you have a virtual environment sourced and NumPy installed in it):
$ python np_sgemm.py
Elapsed time: 16247.785098
That's a difference of over 5 seconds!!
And this is where the idea came from. Just to ensure that my NumPy installation linked to a BLAS library, I ran
$ python -c "import numpy; numpy.__config__.show()"
Which contained this snippet in its output
"Build Dependencies": {
"blas": {
"name": "scipy-openblas",
"found": true,
"version": "0.3.30",
"detection method": "pkgconfig",
"include directory": "/opt/_internal/cpython-3.12.12/lib/python3.12/site-packages/scipy_openblas64/include",
"lib directory": "/opt/_internal/cpython-3.12.12/lib/python3.12/site-packages/scipy_openblas64/lib",
"openblas configuration": "OpenBLAS 0.3.30 USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64",
"pc file directory": "/project/.openblas"
},
"lapack": {
"name": "scipy-openblas",
"found": true,
"version": "0.3.30",
"detection method": "pkgconfig",
"include directory": "/opt/_internal/cpython-3.12.12/lib/python3.12/site-packages/scipy_openblas64/include",
"lib directory": "/opt/_internal/cpython-3.12.12/lib/python3.12/site-packages/scipy_openblas64/lib",
"openblas configuration": "OpenBLAS 0.3.30 USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64",
"pc file directory": "/project/.openblas"
}
}
indicating that it was linked against OpenBLAS during build.
The Conclusion
This was enough evidence to motivate me to write a BLAS implementation. Of course, it shall not be a production level library intended as a drop-in replacement for OpenBLAS, but my target is to take it far enough in an educational context to demonstrate visible performance gains.
This, as I mentioned earlier, is rather an announcement post for a series of upcoming posts in which we shall implement the library kernel-by-kernel, starting from SGEMM and following with double precision and more kernels.
All the resources I follow or study from I shall link at the relevant time and place.
There's more to come, and lots to learn.
Till then, sit tight!