
The authors discuss their efforts to optimize AI compute on NVIDIA's H100 GPU, highlighting the importance of new instructions like wgmma.mma_async and TMA for achieving high utilization. They also introduce ThunderKittens, an embedded DSL within CUDA that simplifies writing kernels for AI workloads by abstracting away low-level details and providing a mini-pytorch-like interface.