Building BLAS, Part 2: Introducing devblas

vinayakdev.sci@gmail.com — Mon, 01 Dec 2025 00:00:00 +0000

Why a separate post to introduce the library?

In the previous post in the BLAS series, we looked at how NumPy significantly outperformed a naive IJK ordered three-loop C++ implementation on the same matrix sizes. The post was an announcement for both the blog series and the library we will implement as the series progresses. Over the past two weeks, I spent some time setting up the library and plugging in our naive C++ implementation into the codebase.

Blog Series: Building a BLAS Library From Scratch

vinayakdev.sci@gmail.com — Mon, 17 Nov 2025 00:00:00 +0000

The Motivation

For years, I’ve relied on highly optimized BLAS libraries like OpenBLAS, BLIS, and MKL without ever really understanding the machinery inside them.

Recently, while working on some multithreaded code, I found myself benchmarking NumPy operations against naive C++ implementations. The results, though not surprising, were drastic enough to trigger a deeper curiosity about how NumPy achieves such speed. We all know NumPy is backed by C, and often links against BLAS libraries like OpenBLAS, but that only raises a more interesting question:

Posts on vinayakdsci

Building BLAS, Part 2: Introducing devblas

Why a separate post to introduce the library?

Blog Series: Building a BLAS Library From Scratch

The Motivation