Diving Deeper into ML Kernel Design

A detailed walkthrough of designing and optimizing machine-learning kernels for the NVIDIA Blackwell architecture around the new NVFP4 4-bit data type. It covers four kernel families — Batched GEMV, GEMM, Dual GEMM, and Group GEMM — from a naive baseline through aggressive optimization, plus a standalone glossary of the hardware and programming concepts involved and a distilled set of PTX / hardware lessons that the official docs tend to leave out.

It is written both as a record of my own process and as a resource for anyone working close to the metal on Blackwell. If you spot something technically inaccurate, I'd genuinely like to hear about it — naregmegan@gmail.com.

Sections

Reach me at naregmegan@gmail.com