Diving Deeper into ML Kernel Design

NVFP4 on NVIDIA Blackwell · GPU Mode × NVIDIA Competition

A detailed walkthrough of designing and optimizing machine-learning kernels for the NVIDIA Blackwell architecture around the new NVFP4 4-bit data type. It covers four kernel families — Batched GEMV, GEMM, Dual GEMM, and Group GEMM — from a naive baseline through aggressive optimization, plus a standalone glossary of the hardware and programming concepts involved and a distilled set of PTX / hardware lessons that the official docs tend to leave out.

It is written both as a record of my own process and as a resource for anyone working close to the metal on Blackwell. If you spot something technically inaccurate, I'd genuinely like to hear about it — naregmegan@gmail.com.

View source on GitHub ↗

Sections

00 Introduction What this series covers, the GPU Mode x NVIDIA competition, and a primer on the NVFP4 data type and block scaling. 01 Batched GEMV Iterative optimization of a batched matrix-vector kernel: kernel fusion, split-K, coalescing, and chasing the memory-bound wall. 02 GEMM Using tcgen05 tensor-core instructions on Blackwell, scale-factor formatting, the 'core matrix' concept, and TMA data movement. 03 Dual GEMM Fusing silu(A @ B1) * (A @ B2) into a single kernel that keeps both accumulators live in TMEM. 04 Group GEMM Fusing a group of differently-shaped GEMMs into one launch with per-tile tensormap patching and careful scheduling. 05 Glossary Standalone reference entries for the architectural and programming concepts used throughout the series. 06 Kernel Dev Notes Process and tooling lessons: branch hygiene, tracking performance data, reading Nsight Compute for tcgen05, and using LLMs. 07 PTX & HW Lessons The hardest-won, least-documented lessons about PTX, CUDA, and Blackwell hardware, grouped by topic.