← NVFP4 Kernels View source on GitHub ↗

Glossary

This glossary collects the architectural and programming concepts referenced throughout the kernel walkthroughs. The earlier sections introduce each concept in the context in which it first becomes useful; the entries below are intended as standalone references that you can read out of order. Where a concept maps directly to a PTX instruction or hardware feature I include the relevant syntax block.

Coalesced Memory Access and Memory Broadcasting

Coalesced memory access is the property that when a warp issues a memory operation, the threads within that warp address a contiguous, aligned range of bytes, so the hardware can service all 32 threads with the minimum number of memory transactions (ideally one cache line). On Blackwell a 128B transaction is the natural unit, so a warp loading 32 consecutive 4-byte values from an aligned address gets one transaction; the same load with a stride of 1 between threads becomes 32 transactions and runs ~32x slower. Coalescence is the single biggest knob on memory-bound kernels — the GEMV kernels in this blog gain ~20-30% just by reordering rows so that adjacent warps own adjacent slices of A.

Memory broadcasting is the dual property: when all threads in a warp read the *same* address, the hardware services the request as a single load and fans the value out to every thread. Broadcasts are also one-transaction operations, but for a different reason — the address is identical across the warp rather than contiguous. SFB scale-factor loads in GEMV/GEMM benefit from broadcasting because the same scalar applies to every result column for a given K block; that pattern means the compiler can fold the load into a single broadcast even when it appears inside a per-thread loop.

Register Spilling

Each thread on an SM gets a fixed-size slice of the SM-wide register file (generally 255 32-bit registers on Blackwell). If the compiler's live-range analysis decides the kernel needs more registers than that, the excess values "spill" to a per-thread region of local memory, which physically lives in L1/L2/HBM. Spilled values look like normal local variables in C++ but every read or write goes through the memory system; a few spilled variables in an inner loop can easily double a kernel's runtime.

Two signals tell you spilling is happening: ptxas warnings at compile time (-Xptxas -v will print <N> bytes stack frame, <M> bytes spill stores) and Nsight Compute's "Local Memory Accesses" counter being non-zero. The common remedies are reducing the unroll factor of inner loops, moving large intermediate arrays into SMEM, or shrinking the number of simultaneously-live values (e.g., by accumulating results in TMEM rather than registers). The GEMV V6_3 experiment explicitly trades a small SMEM round-trip for a register-count reduction in an attempt to lift occupancy.

Thread, Warp, CTA/Thread-block, Cluster, Grid

The CUDA execution hierarchy is five levels deep on Blackwell:

  Grid     -- the whole kernel launch
   Cluster  -- a group of CTAs (1-16) that share a fast interconnect (DSMEM)
    CTA     -- a thread block; runs entirely on one SM; owns SMEM
     Warp   -- 32 threads that issue instructions in lockstep
      Thread -- the smallest schedulable unit

A thread is the basic unit of execution; each thread has its own register state and program counter. Threads are grouped 32 at a time into warps; warps are the unit of instruction issue on the hardware (every cycle the scheduler picks a ready warp and issues one instruction across its 32 lanes). Warps are bundled into CTAs (also called thread blocks); a CTA runs entirely on one SM, can have between 1 and 32 warps, and owns the SMEM allocated to that CTA. CTAs are bundled into clusters of 1-16 (the default cluster size is 1, in which case the cluster level is invisible); all CTAs in a cluster co-reside on physically adjacent SMs and can talk to each other through the cluster interconnect (DSMEM, distributed SMEM). Clusters are bundled into a grid, which is the unit of work launched by a single kernel invocation.

Synchronization scopes are aligned with this hierarchy: __syncthreads() syncs the CTA, barrier.cluster.* syncs the cluster, no built-in mechanism syncs the grid (you have to launch a separate kernel for grid synchronization, or use the cooperative-groups API).

Spatial/Temporal Cache Locality

Cache locality refers to the property that recently used or nearby data is more likely to be needed again soon. Spatial locality means that if a program accesses address X, it is likely to access addresses near X soon after — this is what coalesced loads exploit, and what makes 128B cache lines a useful granularity. Temporal locality means that if a program accesses address X, it is likely to access X again soon — this is what keeps "hot" data in L1/L2.

Cache-aware kernel design tries to maximize both. In GEMM kernels the B-operand cache locality is the easy win: the same B tile is multiplied against every A row of the result tile, so as long as the B tile fits in SMEM (or L2 for a multi-CTA case), you only pay for one B load per output column. In GEMV the A matrix is read once and discarded, so spatial locality (coalesced loads) matters but temporal locality on A does not. The Cache Hint feature (covered separately) lets the kernel give the hardware explicit guidance about temporal locality on specific operands.

Operator/Kernel Fusion

Kernel fusion is the practice of combining multiple logical operations into a single GPU kernel launch, eliminating the intermediate round-trips through GMEM and the launch overhead between them. The naive implementation of silu(A @ B1) * (A @ B2) in the Dual GEMM section is three kernels: one for each matmul, plus one for the elementwise silu/multiply. The fused implementation does the same work in one kernel that keeps both GEMM accumulators live in TMEM and applies the elementwise ops on the way to GMEM, saving two kernel launches and two round-trips of the intermediate matrices.

The trade-off is per-kernel resource usage. A fused kernel needs SMEM / TMEM / registers for *all* of the fused operands at once; if any of those resources becomes the occupancy limiter, performance can regress. The Group GEMM kernel in this blog is the most extreme example of fusion in the project — g independent GEMM operations get fused into one kernel launch, with per-tile descriptor patching (tensormap.replace) or per-group pre-encoded tensormaps used to route each CTA to the right data buffers. The launch overhead saved at G=8 is ~2us, which is significant on problems where the kernel itself only takes ~10us.

Occupancy / Occupancy Limiter

Occupancy is the number of warps a single SM is running concurrently, expressed either as an absolute count or as a fraction of the SM's maximum supported warps (64 on Blackwell). High occupancy lets the SM hide latency: while one warp is stalled on a memory load or a TMA completion, the scheduler can pick another ready warp and keep issuing instructions.

A kernel's occupancy is set at compile/launch time by whichever resource runs out first. The three usual occupancy limiters are:

  Registers per thread -- 65536 per SM / threads-per-CTA / CTAs-per-SM
  SMEM per CTA          -- ~228KB per SM on Blackwell
  Threads per CTA       -- 2048 thread per SM

The compiler reports per-thread register usage with -Xptxas -v; Nsight Compute's "Occupancy" page reports which resource is the active limiter. Note that high occupancy isn't an end in itself — it's only useful when the kernel actually has parallelizable work to interleave. Many tcgen05 kernels run at low occupancy (a few warps per SM) and still saturate the tensor cores because the in-flight TMA and MMA operations carry their own internal pipelining.

Difference between theoretical and achieved occupancy - the above is "theoretical occupancy" defined statically at compile time by the resource demand of the kernel. "Achieved occupancy" is the average percentage of warps out of the theoretical maximum that remain "active" on the SM. A warp is "active" if it is running or ready to run (i.e. not stalled).

General NVIDIA GPU Compute Architecture (Blackwell)

Blackwell is the NVIDIA architecture introduced with the B100 / B200 / GB200 chips. From the compute side the relevant building block is the streaming multiprocessor (SM); a B200 chip has 148 SMs.

Each SM contains:

- 128 CUDA cores (FP32/INT lanes) organized into 4 sub-partitions - 4 warp schedulers, one per sub-partition - One tensor core unit, accessed via the tcgen05 instruction family - 256KB of tensor memory (TMEM), accessed only by tensor cores - 228KB of shared memory (configurable carveout shared with L1 cache) - 65536 32-bit registers in a unified register file

The tensor core is what makes Blackwell distinctive for ML workloads: it implements block-scaled MMA operations natively for FP4/FP6/FP8 inputs, with the scale factors and the accumulator both living in TMEM. The tcgen05.mma instruction (covered in detail in the GEMM section) is the gateway to this hardware.

General NVIDIA GPU Memory Architecture (Blackwell)

The memory hierarchy on Blackwell is six levels deep, from largest / slowest to smallest / fastest:

  HBM3e (GMEM)   -- 192 GB on a B200; bandwidth ~8 TB/s; latency ~1us
  L2              -- ~60MB on a B200; bandwidth ~14 TB/s; latency ~200ns
  L1 / SMEM      -- 228KB per SM; configurable carveout; latency ~30ns
  TMEM           -- 256KB per SM; accessible only by tensor cores
  Registers      -- 256KB per SM (65536 x 32b); per-thread; latency ~5ns
  DSMEM          -- the cluster interconnect that lets one CTA address another CTA's SMEM in the same cluster; latency is in between L1 and L2.

Kernel design is largely about choosing which level of the hierarchy each piece of data lives in. The naive approach (everything in GMEM, loaded into registers as needed) is dominated by HBM bandwidth on most workloads; the Blackwell-optimized approach uses TMA to bring tiles GMEM->SMEM asynchronously, uses tcgen05 to keep accumulators in TMEM, uses SMEM as a staging buffer between TMEM and the TMA store engine, and only touches the register file for elementwise post-processing.

Tail Effect / Waves

A "wave" is one round of CTAs running concurrently on the GPU. If a B200 with 148 SMs can hold one CTA each at the kernel's resource requirements, then launching 296 CTAs runs as two waves: wave 1 fills all 148 SMs, then wave 2 takes their place.

Tail effect is the inefficiency that shows up when the total number of CTAs isn't an even multiple of the wave size. Launching 149 CTAs runs the same wall-clock time as 296 CTAs, because the leftover CTA spends a whole wave's worth of time blocking the kernel from finishing. The fewer total waves a kernel has, the more pronounced the tail effect is in relative terms — a kernel that takes 5 waves with a 30%-full final wave loses 6% of its theoretical throughput; a kernel that takes 1 wave with a 30%-full single wave loses 70%.

Persistent kernels (covered in the Group GEMM section) eliminate the tail effect entirely by having each CTA loop over multiple work tiles, so the kernel's runtime is determined by total work divided by wave size rather than by ceil(work / wave_size).

Split-K

Split-K is a parallelization strategy for matrix multiplication that splits the K-dimension across multiple CTAs (or threads). Each shard computes a partial accumulation A[m, k0:k1] @ B[k0:k1, n], and at the end the partial results are summed into the final result. Split-K trades a small amount of extra work (the reduction) for more parallelism on problem shapes where M and N are small relative to K.

The reduction can happen at several scopes:

- Within a CTA: each thread/warp produces a partial sum, which is reduced via SMEM and warp shuffles. This is what GEMV V2/V3 do. - Across CTAs in a cluster: partial sums land in DSMEM and are reduced cluster-locally before the final write. This is what the failed Dual GEMM V5 experiment attempted. - Across the whole grid: partial sums land in a GMEM workspace and a second kernel reduces them. This is what Dual GEMM V3 and the failed Group GEMM split-K experiment use.

The cost of split-K scales with the reduction scope: in-CTA is nearly free, cluster-local adds DSMEM bandwidth, grid-wide adds GMEM bandwidth and a kernel launch. Whether split-K wins depends entirely on whether the parallelism gain outweighs the reduction cost for the given problem shape.

Shared Memory

Shared memory (SMEM) is per-CTA scratchpad memory that lives on the SM. On Blackwell each SM has 228KB of combined L1/SMEM, configurably split; in practice the kernels in this blog set the L1/SMEM carveout via cudaFuncSetAttribute(..., cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared) so that ~228KB is available to SMEM.

SMEM is organized into 32 banks, each 4 bytes wide. A warp can read or write 32 different banks in one cycle without conflict; if two threads in a warp hit the same bank with different addresses, the access serializes (a "bank conflict") and the warp pays a cycle for each conflict. The Swizzle modes covered separately are how the hardware remaps SMEM accesses to avoid bank conflicts on tcgen05 / TMA accesses.

SMEM is also the only memory level that participates in the shared::cluster qualifier — when a TMA instruction loads to shared::cluster, the destination is a DSMEM address that may be on a peer CTA's SMEM. This is what enables TMA multicast and DSMEM-backed reductions.

Swizzling and Swizzle Modes

Swizzling is the practice of remapping the SMEM addresses used to store a matrix tile so that subsequent warp-level loads/stores don't collide on SMEM banks. The PTX tcgen05 / TMA APIs support several swizzle modes:

  CU_TENSOR_MAP_SWIZZLE_NONE   -- linear layout
  CU_TENSOR_MAP_SWIZZLE_32B    -- 32-byte swizzle
  CU_TENSOR_MAP_SWIZZLE_64B    -- 64-byte swizzle
  CU_TENSOR_MAP_SWIZZLE_128B   -- 128-byte swizzle

The swizzle mode is encoded into the TMA descriptor at host-side encode time; it also affects the matrix descriptor's LBO/SBO computation (the "core matrix" shape changes from 8x16B in no-swizzle mode to 8x128B in 128B-swizzle mode). The relevant detail at the kernel level is that choosing a swizzle mode is mostly a question of which tile widths fit the available bandwidth: 128B swizzle gives the best bank-conflict behavior for large tiles, while smaller swizzle modes are useful when the tile width is narrow enough that 128B would waste SMEM.

For NVFP4 GEMM the kernels in this blog default to 128B swizzle for the A and B operand tiles, because both are large enough (typically 128x256 or larger) to benefit. The Group GEMM V18 transposed-output experiment uses 64B swizzle for a narrower tile.

Hardware Utilization

Hardware utilization is the measure of how much of the GPU's theoretical peak you're actually getting. It can be sliced several ways:

- SM occupancy: what fraction of SMs are running any work - Per-SM throughput: what fraction of the SM's tensor-core / FP32 / memory peak the kernel is hitting on the SMs it does occupy - Wave utilization: how full the last wave of CTAs is

A kernel that launches 8 CTAs on a 148-SM GPU has at most 5% SM occupancy regardless of how efficient those 8 CTAs are; a kernel that launches 148 CTAs but only uses 5% of each SM's tensor core throughput has 100% SM occupancy and 5% per-SM throughput. The right metric to chase depends on the workload: GEMV-class problems usually have plenty of work to fill SMs but low per-SM throughput because matrix-vector products are bandwidth-bound; tcgen05 GEMM on small shapes can have near-peak per-SM throughput but few CTAs to launch.

The persistent kernel design pattern (and CLC) addresses both at once: launch exactly enough CTAs to fill the GPU once, then have each CTA loop over multiple output tiles.

NVCC Compiler Flags for Optimization

The kernels in this blog use a handful of NVCC and ptxas flags worth calling out:

  -gencode=arch=compute_100a,code=sm_100a
       Targets the Blackwell (sm_100a) architecture specifically. The `a` suffix means "architecture-accelerated" — features like tcgen05 only compile under sm_XX0a.

  -Xptxas, --allow-expensive-optimizations=true
       Allows ptxas to spend more compile time on instruction scheduling, register allocation, and constant folding. Produced a >2x speedup on the GEMV V6_2_2 kernel by collapsing FP4/FP8 conversion sequences into more compact PTX.

  --relocatable-device-code=false (i.e., -rdc=false)
       Disables device-side linking, which lets ptxas perform whole-program optimization rather than emitting linkable stubs. Generally improves register allocation slightly.

  -Xptxas -v
       Prints register usage, SMEM usage, and spill statistics per function. The standard diagnostic for "why isn't this compiling at the occupancy I expected?"

The GEMV V6_2_2 entry has the most striking demonstration: a single-line code change combined with the --allow-expensive-optimizations=true flag was responsible for a ~50% latency reduction.

Nsight Compute (NCU)

Nsight Compute is NVIDIA's kernel-level profiler. It instruments a single kernel launch and produces a structured report with sections for SM utilization, memory bandwidth, occupancy, source-level instruction mix, roofline analysis, and per-warp stall reasons.

Important caveats for kernels that use tcgen05 and single-thread launches:

(a) NCU sees the executing thread, not the tensor core. tcgen05.mma is issued by one thread; the actual MMA work happens on hardware that NCU's normal counters don't expose. You see a long sequence of mbarrier waits and very few "useful" instructions per cycle, which the profiler may mis-flag as "stalled" even when the kernel is near-peak. The metrics worth looking at are mbarrier stalls (which section of the pipeline is the bottleneck), TMA throughput, and the L2 / HBM traffic counters.

(b) NCU significantly perturbs kernel timing because it inserts instrumentation. Don't trust the wall-clock timing reported by NCU; use CUDA events outside the profiled kernel for that.

(c) The "Source" view of NCU on tcgen05-heavy kernels can be misleading because much of the work happens in a single asm volatile block, so all the SASS-level counters bucket into that single line.

In the GEMV V6_2_4 entry I describe how mis-reading NCU's "Arithmetic instruction" attribution sent me in the wrong direction for several kernel iterations before I uncovered the actual bottleneck (FP4/FP8 conversion intrinsics generating dozens of LOP3/PRMT/IMAD instructions).

Latency Hiding and Software Pipelining

Latency hiding is the practice of issuing slow operations early enough that something else can run while they complete. GPUs hide latency in two main ways:

(a) Warp-level: the SM has many concurrent warps, and when one warp stalls (e.g., on a memory load) the scheduler picks another. This is what high occupancy buys you — more warps means more candidates to swap in.

(b) Software-level: explicit pipelining where the producer (TMA) and consumer (MMA) run on different warps with multiple SMEM buffers, and the producer races ahead to fill future buffers while the consumer is still working on current ones. This is what N-stage ring buffers and pipeline depth (PIPE_STAGES) achieve.

The N-stage software pipeline is the dominant pattern in tcgen05 kernels. The producer warp (TMA) fills PIPE_STAGES buffers ahead; the consumer warp (MMA) processes them in order; mbarriers track which stages are ready for consumption and which buffers are free to overwrite. Tuning PIPE_STAGES is a trade-off between latency hiding (higher is better) and SMEM occupancy (higher means less SMEM for other things).

Cross-tile pipelining (introduced in Group GEMM V11/V15) extends this across work-tile boundaries: the producer doesn't drain the pipeline between tiles, so TMA loads for the next tile begin while the current tile's MMA is still running. The complementary technique is TMEM ping-pong (Group GEMM V13/V16/V22), which decouples the MMA->epilogue handoff in the same way.

Asynchronous Mechanisms Details

"Asynchronous" on Blackwell means hardware-level: the instruction returns to the issuing thread immediately, and the actual work runs on a separate hardware unit (TMA engine, tensor core, etc.) that completes some time later. The kernel synchronizes with that hardware unit via an mbarrier.

The async instructions used in this blog are:

  cp.async.bulk{.tensor}.*        -- TMA loads/stores; tracked by mbarrier
  cp.async.bulk.commit_group +
  cp.async.bulk.wait_group        -- bulk-group tracking for TMA stores
  tcgen05.mma                     -- tensor core MMA; tracked by mbarrier via tcgen05.commit
  tcgen05.cp                      -- SMEM->TMEM copy; ordered against subsequent tcgen05.mma by hardware rules, no explicit fence needed
  tcgen05.ld                      -- TMEM->register load; tracked by tcgen05.wait::ld
  clusterlaunchcontrol.try_cancel -- CLC work-steal; tracked by mbarrier

Each of these has a slightly different completion-tracking mechanism, but the pattern is consistent: issue → wait → consume. The producer-consumer mbarrier dance (covered under Memory Consistency) is the glue.

TMA Details

The Tensor Memory Accelerator (TMA) is a dedicated hardware unit on each SM that performs asynchronous, multi-dimensional tensor transfers between GMEM and SMEM/DSMEM. TMA is the modern replacement for the older cp.async family — same idea but with native support for strided multi-D tiles, swizzling, and built-in mbarrier completion tracking.

Key TMA instructions:

  cp.async.bulk.tensor.{1,2,3,4,5}d.{...}    -- multi-D tile transfer
  cp.async.bulk.tensor.*.multicast::cluster  -- TMA Multicast variant
  cp.async.bulk{.tensor}                     -- 1D bulk (used for SF)

Every TMA tensor instruction takes a *tensor map* (CUtensorMap, also called a "TMA descriptor"). The tensor map is a 128-byte opaque struct encoded by the CUDA driver API (cuTensorMapEncodeTiled) that describes:

- The base GMEM address - The full GMEM shape (3D, 4D, etc.) and per-dim strides - The tile shape that this descriptor produces in SMEM - The swizzle mode - The element type and interleaving format

The tensor map is opaque because its layout is hardware-dependent and not part of the stable ABI; you can't initialize one in device code by hand. You can, however, modify selected fields of an existing tensor map from device code using tensormap.replace (introduced in Group GEMM V6), which is what enables per-CTA per-tile descriptor patching for variable-shape problems like Group GEMM.

Completion tracking: TMA loads use the mbarrier::complete_tx::bytes mechanism — every TMA load tells its associated mbarrier how many bytes it transferred, and the mbarrier flips its phase when the expected total has arrived (set by mbarrier.arrive.expect_tx). TMA stores use a different mechanism, cp.async.bulk.commit_group + cp.async.bulk.wait_group N, which batches all in-flight TMA stores into a group and waits for the group as a whole to complete.

Tensor Memory (TMEM)

Tensor Memory (TMEM) is a per-SM 256KB memory region that exists specifically to feed and receive data from the Blackwell tensor cores. Architecturally it sits between SMEM and the register file in the memory hierarchy; functionally it's only addressable by tcgen05 instructions, not by ordinary loads/stores.

TMEM is organized as 128 rows (called "lanes") of 512 columns, where each cell is 32 bits. Allocation happens via tcgen05.alloc, which carves out a contiguous range of *columns* (all 128 lanes) and returns the column-index of the allocation. Per the PTX rules, allocations must be a power-of-2 number of columns with a minimum of 32.

What goes in TMEM:

- The MMA accumulator (the d-tmem operand of tcgen05.mma) — laid out as M rows x N columns of FP32 values for an MxN tile. - Block scale factors (SFA, SFB) — laid out in a special 32x4x4-style format dictated by the MMA hardware.

What does NOT go in TMEM:

- Input matrix data (A, B); these stay in SMEM and are passed to tcgen05.mma via matrix descriptors that point into SMEM.

The TMEM Multi-buffering glossary entry (later) covers how the kernels in this blog allocate 2 or 3 distinct result regions in TMEM to overlap MMA on the current tile with the epilogue's TMEM read of the previous tile.

CTA_GROUP Details

CTA_GROUP is a PTX qualifier (.cta_group::1, .cta_group::2) that appears on many tcgen05 instructions and on TMA instructions. It controls how many CTAs in a cluster collaborate on a single operation:

  .cta_group::1 -- single-CTA scope (the default for most use cases). The instruction's effect is local to this CTA.
  .cta_group::2 -- two-CTA scope. The two CTAs of a 2-CTA cluster collaborate on the operation, splitting the work across both SMs.

For tcgen05.mma.cta_group::2 ("2SM MMA" in NVIDIA marketing materials), the M dimension of the MMA is doubled relative to the single-CTA form, with each CTA owning half of the resulting TMEM accumulator. The MMA hardware on the two SMs collaborates over the cluster interconnect to combine A from one CTA's SMEM with B from the other's. The Dual GEMM V2 kernel uses this to halve the per-CTA SMEM pressure for B, which lifts occupancy and lets the software pipeline go deeper.

For TMA, .cta_group::2 similarly indicates that the load should be attributed to both CTAs of the cluster — important for the mbarrier arrival accounting when the two CTAs are sharing the loaded tile.

Note that CTA_GROUP is distinct from TMA Multicast. Multicast also fans data out to multiple CTAs, but the *MMA itself* still runs on one CTA; CTA_GROUP::2 changes the MMA semantics.

DSMEM Details

DSMEM (Distributed Shared Memory) is the cluster-scope view of SMEM: every CTA in a cluster can read and write the SMEM of every other CTA in the same cluster, via a special address space tag (shared::cluster). Physically the access goes over the cluster interconnect, which is much faster than going through L2/GMEM but slower than local SMEM.

DSMEM is the mechanism that lets:

- TMA Multicast deliver the same data to multiple CTAs (the destination address is a DSMEM address that resolves to the same SMEM offset on each receiver). - Multi-CTA reductions write partial results into DSMEM and reduce them cluster-locally, without going through GMEM. - 2SM tcgen05.mma instructions share A or B tiles between the two CTAs of a 2-CTA cluster.

A few syntactic markers signal DSMEM usage:

  cp.async.bulk.tensor.*.shared::cluster.*    -- DSMEM destination
  mbarrier.arrive.*.shared::cluster.*          -- arrival on a DSMEM mbar
  tcgen05.commit.*.shared::cluster.*           -- DSMEM-scope commit

The trade-off with DSMEM is twofold: clusters must be co-launched onto physically adjacent SMs, which constrains scheduling; and the cluster size is capped at 16, so DSMEM doesn't scale as a substitute for L2/GMEM communication. For matrix multiplication it's a near-perfect fit because clusters of 2-8 CTAs map naturally onto the cooperative-MMA patterns the tensor cores already support.

Memory Consistency and Synchronization Details

The Blackwell memory consistency model has three pieces that the kernels in this blog interact with: mbarriers, fences, and proxies.

Mbarriers (memory barriers, but they're more like a hybrid of a barrier and an asynchronous counter than a classical lock). An mbarrier is a 64-bit object in SMEM with three pieces of state:

- Expected arrival count (initialized via mbarrier.init) - Expected byte count (set by mbarrier.arrive.expect_tx) - Phase bit (flipped each time the barrier "completes")

A producer signals arrival with mbarrier.arrive (decrements arrival count) or, for TMA, via the mbarrier::complete_tx::bytes qualifier on the TMA instruction (decrements arrival count by 1 *and* decrements byte count by the bytes transferred). When both counters hit zero, the phase bit flips. Consumers spin on mbarrier.try_wait.parity, which returns true once they observe the expected phase. The phase tracking is what allows a single mbarrier to be reused for many producer-consumer rounds without re-initialization — each round flips the phase, and the consumer expects the next phase value.

Fences. The Blackwell ISA has several kinds of fences:

  fence.proxy.{kind}.{scope}         -- memory-proxy fence
  fence.mbarrier_init.release.cluster -- mbarrier init visibility
  tcgen05.fence::after_thread_sync   -- tcgen05 instruction ordering

Proxies. Blackwell has multiple "memory access proxies", each with its own coherence domain:

  generic proxy   -- ordinary loads/stores
  async proxy     -- TMA / cp.async.bulk operations
  tensormap proxy -- CUtensorMap descriptor accesses

A release through one proxy is not automatically visible to a subsequent access through another. The standard pattern is a fence.proxy.{target_proxy}.{scope} between writes through one proxy and reads through the other. The tensormap.cp_fenceproxy instruction (Group GEMM V6) is a specialized combined copy + proxy fence used specifically for tensormap.replace flows.

Warp Specialization

Warp specialization is the practice of assigning different roles to different warps within a CTA. Pre-tcgen05 kernels typically used the same code path for every warp (one warp = one block of threads); tcgen05 kernels generally don't, because most tcgen05 operations are issued by a single thread anyway, so dedicating warps to disjoint roles lets each warp's instruction stream focus on one phase of the software pipeline.

The canonical tcgen05 GEMM kernel has three roles:

  TMA warp           -- 1 warp; issues TMA loads, manages mbar_addr_tma / mbar_addr_mma producer-consumer
  MMA warp           -- 1 warp; issues tcgen05.cp + tcgen05.mma, signals mbar_addr_epi when done
  Epilogue warps     -- 4 warps; load TMEM via tcgen05.ld, convert to FP16, store to GMEM (either directly or via a TMA store)

Group GEMM V17 adds a fifth role:

  Tile-descriptor warp -- 1 warp; resolves (group, m_off, n_off, sfa, sfb, M) for upcoming tiles and publishes via SMEM slots gated by mbar_addr_tile_ready

The roles communicate exclusively through mbarriers (and, for the tile-descriptor warp, through SMEM-published metadata). Each role's code path is a tight loop optimized for that role's instruction mix — the TMA warp is dominated by cp.async.bulk.tensor instructions, the MMA warp by tcgen05.mma, the epilogue warps by tcgen05.ld + half2 stores. Crossing role boundaries inside the same warp would force the compiler to merge those instruction streams and produces worse scheduling.

Cache Hint Details

PTX TMA and bulk-async instructions accept an optional cache hint operand that tells the L2 cache replacement policy how the data is expected to be reused. The hint is supplied via the L2::cache_hint qualifier and a 64-bit policy value:

  EVICT_NORMAL    -- default behavior; data competes for cache space normally
  EVICT_FIRST     -- data is unlikely to be reused; evict early
  EVICT_LAST      -- data is likely to be reused; keep in cache longer
  EVICT_UNCHANGED -- don't update the eviction priority

For matrix multiplication these hints map naturally onto operand re-use patterns. In Dual GEMM, the A matrix is read multiple times (once for each of A @ B1 and A @ B2 within a single tile, and again across multiple output tiles); B1 and B2 are read once per tile and then discarded. So A loads get EVICT_LAST and B loads get EVICT_FIRST. The hint doesn't change correctness — it's a performance optimization that becomes meaningful once L2 pressure is non-trivial.

Data Layout

Several non-obvious data layout constraints come up in tcgen05 kernels:

(a) Core matrix layout for A/B in SMEM. The tensor core hardware expects tiles in SMEM laid out as "core matrices" — 8x16B (or 8x32B / 8x64B / 8x128B depending on swizzle mode) atomic units, with K-major core-matrix ordering. The TMA's 3D transfer mode is the canonical way to produce this layout from a row-major GMEM matrix. The Group GEMM section discusses the {32, mn_dim, k_dim/32} 3D shape that gives the right layout.

(b) Scale factor 32x4x4 layout in TMEM. SFA / SFB scale factors have an unusual repeated-row layout in TMEM dictated by the MMA hardware. The competition inputs were pre-reshuffled into this layout, so the kernels in this blog can load them with simple 1D contiguous TMA transfers (tcgen05_1dtma_g2s_sf); a more general kernel would need a 3D TMA with the right strides.

(c) Output tile layout for C in SMEM. The TMA store engine expects the SMEM output tile in a layout compatible with the target tensor map. In the Group GEMM kernel the SMEM C tile is laid out as TD_SMEM_M x OUT_N_CHUNK halves, written by the epilogue's tcgen05.ld + half2 stores, then transferred to GMEM by a 2D TMA store with the C tensor map.

The transposed-output experiments (Group GEMM v6_3 / V18) are the most layout-intensive parts of the blog because they swap many of the assumptions about who is row-major vs column-major.

Host/Device Data Transfer Details

Three mechanisms appear in this blog for moving data between host and device memory:

(a) cudaMemcpy / cudaMemcpyAsync. The standard DMA path. The driver issues a DMA over PCIe (or NVLink, if available); the transfer goes through HBM on the device side. Async variants let the H2D copy overlap with subsequent host-side work.

(b) Pinned (page-locked) host memory via cudaMallocHost. Pinning removes the page-fault and intermediate-staging-buffer overheads from DMA, so cudaMemcpyAsync on pinned memory is ~2x faster than on pageable memory. The pinning cost is paid once at allocation, so it's only a win for buffers that get reused across many kernel launches.

(c) __grid_constant__ kernel arguments. A CUDA C++ qualifier on a kernel parameter that places the argument in a constant memory bank broadcast to all CTAs in the grid. The argument is folded into the kernel launch parameters, capped at ~32KB. The Group GEMM V14 entry uses this to ship the entire GroupDescs table (including per-group pre-encoded tensormaps) without a separate cudaMemcpy.

There is a fourth mechanism not used in this blog: zero-copy mapped memory via cudaHostAlloc(..., cudaHostAllocMapped), which gives the GPU a pointer into pinned host memory that it can read directly over PCIe. Zero-copy is useful for very small transfers where the DMA setup overhead exceeds the transfer time, but the per-access PCIe latency makes it a poor fit for the kernel hot path.

Device Memory Management

The kernel hot path should not contain allocator calls. cudaMalloc on the order of 5-10 microseconds — comparable to the kernel runtime itself for the small-shape benchmarks — and cudaFree is similar.

Two patterns avoid allocator overhead:

(a) Persistent buffers across calls. The kernel wrapper holds device-side buffers as static globals, allocates them lazily on the first call, and re-uses them on subsequent calls. The Group GEMM V3 entry (PersistentBuffers struct) is the canonical example. When the required size grows, the wrapper frees and re-allocates; otherwise it just re-uses. The trade-off is that the buffers stay allocated until process exit, which is generally fine for inference workloads.

(b) __grid_constant__ arguments (covered under Host/Device Data Transfer Details) sidestep allocation entirely for small metadata tables.

The competition harness in this blog deliberately re-runs the kernel many times for benchmarking, which makes static persistent buffers a near-free optimization (the first call pays the allocation cost; subsequent calls reuse). For a one-shot kernel the trade-off is less clear-cut.

Cluster Launch Control (CLC)

CLC is a Blackwell hardware feature that lets a kernel implement *dynamic* persistent scheduling: each CTA in the grid attempts to "steal" work from CTAs that haven't been launched yet, and the hardware atomically cancels the launch of the stolen CTA.

The PTX entry point is clusterlaunchcontrol.try_cancel.async:

clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.b128 [addr], [mbar];

The instruction writes an opaque 128-bit response to SMEM (success plus the stolen CTA's ctaid, or failure) and arrives on an mbarrier when complete. The CTA then queries the response with clusterlaunchcontrol.query_cancel.is_canceled and, if it succeeded, extracts the stolen ctaid with clusterlaunchcontrol.query_cancel.get_first_ctaid::x.b32.b128.

CLC is the dynamic complement to static persistent scheduling (which assigns work tiles to CTAs based on tile_idx % gridDim.x). The advantage is automatic load balancing across CTAs whose tiles have varying work; the disadvantage is one mbarrier round-trip per tile boundary, which on small problem shapes can cost more than the static scheduler's worst-case imbalance.

The Group GEMM V8 entry introduces CLC; V11 uses it; V12 backs out of CLC for the static-persistent path because the per-tile overhead dominated on the small benchmark shapes.

Cluster Multicast TMA (Group GEMM V17+)

cp.async.bulk.tensor.{...}.multicast::cluster.{...} issues a single TMA load from HBM and fans the data out to multiple CTAs in a cluster via the cluster interconnect (DSMEM path). A 16-bit ctaMask operand selects which CTAs in the cluster receive the data: bit i set means "deliver to CTA i in the cluster and arrive on its mbarrier".

The mbarrier that tracks completion must be initialized with an arrival count equal to the number of receiving CTAs, because each receiving CTA records its own arrival. The corresponding tcgen05.commit uses .multicast::cluster so a single MMA completion releases the shared mbarrier on every CTA in the cluster.

Used in Group GEMM V17 to halve HBM traffic on the A operand of a 2-CTA cluster: only the leader CTA issues the TMA, and the multicast delivers the same A tile to both CTAs. Distinct from 2SM tcgen05.mma (.cta_group::2), which is a different collaboration pattern entirely — multicast TMA only changes how input data is delivered; the MMA itself still runs on a single CTA.

TMEM Multi-buffering / Ping-pong (Group GEMM V13+)

Allocating 2 or 3 distinct TMEM regions for results, indexed by a rotating tmem_buf counter (XOR for 2-deep, modulo for 3-deep). Lets the MMA warp write into one region while the epilogue warps read from another, decoupling the long TMEM->reg->half->SMEM->TMA chain from the MMA critical path.

tcgen05_alloc_tmem<1>(tmem_addr_ptr, TD_MMA_N * 2 * NUM_BUFFERS);
const int tmem_result_ptrs[NUM_BUFFERS] = { ... };
// MMA warp: tmem_buf = (tmem_buf + 1) % NUM_BUFFERS; write to result_ptrs[tmem_buf]
// Epilogue:  tmem_buf = (tmem_buf + 1) % NUM_BUFFERS; read from result_ptrs[tmem_buf]
// Synchronization: NUM_BUFFERS pairs of (epi, epi_done) mbarriers,
//                  each pair indexed by tmem_buf

Each TMEM region needs its own epi / epi_done mbarrier pair with independent phase counters. Getting the phase tracking wrong is the most common correctness/hang trap in this blog — see sub_v6_tmem_circ and submission_v5_static_overlap. For 2-deep, XOR with a first_tile guard on the consumer side is enough; for 3-deep, modulo arithmetic plus a primed flag works.

2-deep introduced in Group GEMM V13; 3-deep in V22. The depth is constrained by total TMEM capacity (512 columns) and by what's worth pipelining — the third buffer in V22 was a small win on top of the 2-deep design.

Cross-tile TMA Pipelining (Group GEMM V11+)

Carrying TMA pipeline state across work-tile boundaries using a monotonic glob_k_off counter plus a first_tile prologue:

int glob_k_off = 0;
bool first_tile = true;
for (int tile_idx = blockIdx.x; tile_idx < total_tiles; tile_idx += gridDim.x) {
    if (first_tile) {
        // Pre-fill PIPE_STAGES of TMA loads
        for (int s = 0; s < PIPE_STAGES; s++) tma_load_stage(s * TD_SMEM_K, s);
        glob_k_off = PIPE_STAGES * TD_SMEM_K;
        first_tile = false;
    }
    for (; k_off < K; k_off += TD_SMEM_K) {
        int stage = (glob_k_off / TD_SMEM_K) % PIPE_STAGES;
        // mbar_wait + tma_load_stage; glob_k_off keeps incrementing
    }
}

Eliminates the per-tile pipe-fill stall. The TMA producer mbarriers must be reusable across tiles (not re-init'd at each tile boundary), and the consumer's mbarrier phase must be derived from glob_k_off / TD_SMEM_K so it stays in sync after a tile transition.

Introduced in Group GEMM V11 (where it regressed slightly on small problem shapes), stabilized in V15 (where the per-tile overhead it removed dominated the new serial dependency it introduced).

Tile-descriptor Producer Warp (Group GEMM V17+)

A specialization of the Warp Specialization pattern. A dedicated warp resolves (group, m_off, n_off, sfa_addr, sfb_addr, M) for each upcoming work tile and publishes the values via SMEM slots, gated by an mbar_addr_tile_ready mbarrier. The TMA, MMA, and epilogue warps consume the published metadata instead of re-computing it.

// Tile-descriptor warp
for (int t = 0; t < tiles_per_cta; t++) {
    int tile_idx = blockIdx.x + t * gridDim.x;
    // Binary search groups[] for which group this tile belongs to
    int group = ...; int m_off = ...; int n_off = ...;
    // Publish to SMEM slots
    group_smem[t % SLOTS] = group; m_off_smem[t % SLOTS] = m_off; ...
    // Signal readiness
    mbar_arrive(mbar_tile_ready + (t % SLOTS) * 8, ...);
}

Removes redundant binary-search-for-group work from every other warp, and the resolution happens in parallel with the previous tile's MMA. Distinct from CLC (which dynamically rebalances work across CTAs); the tile-descriptor warp only handles per-CTA tile sequencing, with the tile-to-CTA mapping fixed at launch by either static scheduling or CLC.

__grid_constant__ Kernel Arguments

A CUDA C++ qualifier on a kernel parameter that places the argument in a constant-memory bank broadcast to all CTAs in the grid:

__global__ void nvfp4_group_gemm_kernel(
    const __grid_constant__ GroupDescs groups,
    const __grid_constant__ CUtensorMap tmap_a_temp,
    ...
);

Properties:

- Free to read on the device side (constant memory cache hit) - Free to ship from host (folded into the existing kernel launch parameter path; no separate cudaMemcpy) - Capped at ~32KB total per kernel - Always read-only from device code

Used in Group GEMM V14 to pass the per-group descriptor table (GroupDescs struct of up to MAX_G slots) as a kernel argument rather than a cudaMallocHost + cudaMemcpyAsync pair. By V17 the struct also embeds the pre-encoded per-group CUtensorMaps, eliminating the in-kernel tensormap.replace machinery entirely. The 32KB cap is well above the MAX_G = 8 slot requirement (8 * (3 * 128B + small data) ≈ 3.5 KB).

Named Barriers (bar.sync N, M)

bar.sync with explicit barrier-resource ID N (0..15) and thread count M:

bar.sync     a {, b};

  a -- barrier resource (0..15)
  b -- number of threads participating

CUDA's __syncthreads() compiles to bar.sync 0 (with all threads in the CTA participating, so the thread-count operand is implicit). The two-operand form lets a *subset* of warps synchronize among themselves without involving the rest of the CTA. The kernels in Group GEMM V12+ use this to let the epilogue warps ping-pong among themselves on barrier 2, while the TMA + MMA warps independently ping-pong on barrier 3:

// inside the epilogue warps
asm volatile("bar.sync 2, %0;" :: "r"(WARP_SIZE * (NUM_WARPS - 2)));

// inside the TMA / MMA warps
asm volatile("bar.sync 3, %0;" :: "r"(WARP_SIZE * 2));

The threads participating in barrier N must agree on the thread-count value — if they don't, the barrier deadlocks. Up to 16 named barriers are available per CTA, which is more than enough for the warp-specialization patterns used in this blog.

Cluster-scope Mbarrier Arrivals

The default mbarrier.arrive.expect_tx.release.cta.shared::cta.b64 helper used in the basic GEMM kernel can only arrive on a barrier local to this CTA. When a producer CTA needs to signal a barrier that lives on a peer CTA (or that is read by multiple CTAs via multicast), the helper is upgraded to cluster scope:

mbarrier.arrive.expect_tx.release.cta.shared::cluster.b64 _, [%0], %1;

The .shared::cluster qualifier reinterprets the mbarrier address as a DSMEM address — an SMEM address that may live on a peer CTA in the same cluster. The hardware routes the arrival through the cluster interconnect.

Used in Group GEMM V17 alongside TMA Multicast and cluster-scope tcgen05.commit.multicast::cluster.b64 so a single MMA completion arrives on the mbarriers of every CTA in the cluster. Without the cluster-scope arrival, the peer CTA's mbarrier arrival count would under-count and mbar_wait would never return.

Reach me at naregmegan@gmail.com