Diving Deeper into ML Kernel Design
NVFP4 on NVIDIA Blackwell · GPU Mode × NVIDIA Competition
A detailed walkthrough of designing and optimizing machine-learning kernels for
the NVIDIA Blackwell architecture around the new NVFP4 4-bit data type. It covers
four kernel families — Batched GEMV, GEMM, Dual GEMM, and Group GEMM — from a
naive baseline through aggressive optimization, plus a standalone glossary of the
hardware and programming concepts involved and a distilled set of PTX / hardware
lessons that the official docs tend to leave out.
It is written both as a record of my own process and as a resource for anyone
working close to the metal on Blackwell. If you spot something technically
inaccurate, I'd genuinely like to hear about it —
naregmegan@gmail.com.
Sections
00
Introduction
What this series covers, the GPU Mode x NVIDIA competition, and a primer on the NVFP4 data type and block scaling.
01
Batched GEMV
Iterative optimization of a batched matrix-vector kernel: kernel fusion, split-K, coalescing, and chasing the memory-bound wall.
02
GEMM
Using tcgen05 tensor-core instructions on Blackwell, scale-factor formatting, the 'core matrix' concept, and TMA data movement.
03
Dual GEMM
Fusing silu(A @ B1) * (A @ B2) into a single kernel that keeps both accumulators live in TMEM.
04
Group GEMM
Fusing a group of differently-shaped GEMMs into one launch with per-tile tensormap patching and careful scheduling.
05
Glossary
Standalone reference entries for the architectural and programming concepts used throughout the series.
06
Kernel Dev Notes
Process and tooling lessons: branch hygiene, tracking performance data, reading Nsight Compute for tcgen05, and using LLMs.
07
PTX & HW Lessons
The hardest-won, least-documented lessons about PTX, CUDA, and Blackwell hardware, grouped by topic.