Kernel Dev Notes

NVFP4 Kernel Series · Section 6 of 7

In this section I cover some procedural and engineering related learnings and improvements that will help me, and hopefully the reader as well, in any future kernel development endeavors.

Organization of different dev branches

Understanding how hardware works and interacts with software is both a theoretical and experimental process. All optimizations should be driven by theoretical improvement; however, the implementation in code, complex interactions with other hardware features, and timing dependent quirks can all influence how different the real performance impact is from the theoretical impact.

Thus, it's important to track both the theory and implementation alongside one another in a cohesive manner. This is definitely something I struggled with throughout this process because there were so many complex features to implement and track. One way this process could've been improved is by using git to track all of the modifications in different dev branches.

For example, instead of creating iterative development files, a git branch would track the development of different features, and merging features would then become much easier to perform and track.

Having this organizational structure would have allowed me to more easily merge and iterate on successful optimizations, while setting failed experiments aside (with the ability to easily revisit them later by merging the feature in with later iterations of the kernel).

Organization of performance data

Building off of the last section it's crucial to keep performance data in tandem with code versions. Additionally, I often found it helpful to collect the same set of information for all iterations of the kernel just in case seemingly irrelevant information becomes useful later on.

Building a framework to collect my own metrics, which could be carried forward to further kernel iterations, was also crucial especially for kernels which use tensor cores (particularly tcgen05) and warp specialization.

Analyzing performance of tensor cores, especially tcgen05 (normal NCU numbers can be very misleading)

(See Glossary for further details on Nsight Compute)

As NVIDIAs focus has shifted towards accelerating ML workloads, in particular deep learning, recent generations of NVIDIA architectures have started devoting more chip resources to tensor cores (tcgen05 and TMEM for Blackwell) and large bandwidth data movement between SMs and global memory (TMA and async transactions). This means fewer and less of a focus on CUDA cores: the array of tiny ALUs that previously composed most of the compute for NVIDIA chips.

With this shift has come a change in the way modern kernels are programmed for NVIDIA chips. Optimizations using tcgen05, TMA, and concepts such as warp specialization have become far more prominent and useful (read more about these in the glossary and previous sections). However, some of the tooling shipped for NVIDIA developers hasn't been updated to properly reflect these changes. The main example I encountered in developing these NVFP4 kernels was using Nsight Compute to analyze warp specialized tcgen05 kernels.

If you attempt to analyze one of these kernels using NCU in the "traditional" way many of the statistics will be misleading:

Stall statistics can be skewed by the initial filling of the warp specialized pipeline. For example, an MMA kernel with three warp specialties of load, compute, and store will record the compute and store warps as stalled / idle until the first processing batch has been passed to it. This skew is especially pronounced for smaller problem sizes since those initial stalls make up a much larger portion of the overall runtime. Thus it's important to take into account these skews when analyzing where the kernel is experiencing the most stalls outside of these "priming" stalls.

The software paradigm consists of launching asynchronous operations and waiting on their completion. This can skew active warps per cycle counts because although warps are technically active after initiating asynchronous work (TMA, tcgen05 MMA, etc...), they get recorded as idle for those interim cycles where warps are waiting for that work to complete.

Occupancy is a less reliable signal for how well hardware is being utilized. Due to the asynchronous nature of the instructions being used in combination with the demand for large portions of SMEM per CTA, these both drive down the theoretical and achieved occupancy below what one would expect for how well the hardware is actually being used via the async operations (see glossary for further details on theoretical and achieved occupancy).

Something I need to look further into is what tools NCU provides to analyze tcgen05 and warp specialized kernels, or developing a tool for this myself.

Notes on using LLMs for kernel development

I periodically used Claude Code and Codex to help in kernel development in the following ways:

Debug

There were two use cases I found particularly useful when it came to AI assisted debugs:

1) Simple logical and syntatical errors: Usually the LLM could identify the root cause very quickly. This dramatically reduced the amount of time and brain-power spent identifying trivial mis-types or PTX syntax errors (like swapping of impact parameters).
2) Deciphering long and complex system error reports: Error messages can be quite verbose, especially when the written code is at the assembly level. I used these agents to consolidate long system error log dumps and decipher or track register or memory corruption through those error reports. This made tracking down the root cause of those errors much more efficient.

LLMs struggled the most when it came to more complex bugs related to synchronization and more recent hardware features (like the new features introduced in the Blackwell chip). Providing materials to be used in the model with RAG or grep improved the model's understanding, but without a full contextual understanding of the features being discussed, it often missed out on key ideas that could cause bugs.

An example debug that LLMs struggled with was the one I discuss in V15 of the group_gemm optimization log. The models didn't recognize the race condition between SMEM stores and TMA stores to GMEM due to their operation on separate memory proxies.

Assisted Ideation

Brainstorming with the help of AI can be helpful as it can double check the logic behind your ideas and suggest things you hadn't thought of yet. I found this method of using AI better than having it code outright because implementing the key features yourself forces you to both fully understand everything about the code and verify any new ideas the AI generated (which could potentially be incorrect).

Code refactoring and organization

LLMs are very good at refactoring and cleaning up the syntax of code to make it much cleaner and more readable, while keeping the underlying functionality the same. Of course, the larger the portion of code being re-written the more risk there is of small bugs or performance changes being introduced. For this reason I had the LLM restructure code in a chunked iterative manner, tackling smaller chunks of code to re-organize a larger section of code. That allowed me to verify correctness and performance remained unchanged.