A detailed, multi-part walkthrough of designing and optimizing machine-learning kernels for the NVIDIA Blackwell architecture around the new NVFP4 4-bit data type. This revolved around an NVIDIA hosted competition, ranking the best performing kernels. It covers four kernel families — Batched GEMV, GEMM, Dual GEMM, and Group GEMM — from naive baselines through aggressive tcgen05 / TMA optimization, plus a standalone glossary of the hardware concepts involved and a distilled set of PTX / hardware lessons the official docs leave out.
Read the series →Notes on optimizing a tree-traversal kernel for a simulated VLIW SIMD machine — and, more broadly, on programming and optimizing alongside LLMs: how to handle model over-certainty and sycophancy, when to separate concept from implementation, and how appealing to a third-party authority can knock a model out of a local minimum.
Read the writeup →SME2/SSVE (Scalable Matrix Extension / Streaming Scalable Vector Extension) are newer hardware features on ARM processor architectures, so there isn't a ton of support for hardware specific high performance operations (particularly BLAS operations, which this hardware is designed to target). This project attempts to implement BLAS routines specifically for SME2/SSVE using intrinsics provided by the Clang compiler.
Verilog generator for HFT algorithmsHigh Frequency Trading and quantitative trading relies largely on the speed of decision making and execution. FPGAs can be used to create task specific hardware and replicate that hardware thousands of times over, allowing for highly efficient and parallelized computation. In this project I wrote a Verilog generator. When given a json file which describes mathematical rules for executing trades the generator creates both the hardware description and a test bench which can be used to virtually run the hardware using Icarus.
Open Source ContributionsUpdates Pending...
Some Machine Learning Projects
Studying emergent behavior between competing RL Agents:
https://github.com/callaunchpad/emergence
Experimenting with various methods for a Visual Q&A system (from K-nearest to DNNs):
https://github.com/callaunchpad/Musecage
I wrote a very simple neural network library in Java:
https://github.com/NaregAmirianMegan/Artificial-Neural-Network
Ancestry mapping using Markov Chains:
https://github.com/callaunchpad/LocalAncestry
Atari Breakout Agent:
https://github.com/NaregAmirianMegan/atari-breakout-ML