HuggingFace built an agent skill that teaches AI coding agents (Claude, Codex) to write production-ready CUDA kernels with PyTorch bindings. The skill packages domain expertise about GPU architectures, memory patterns, and library integration into ~550 tokens of structured guidance. Testing on LTX-Video (diffusers) and Qwen3-8B (transformers) showed the agent-generated RMSNorm kernels achieved 1.88-1.94x speedup over PyTorch baselines, with 6% end-to-end improvement in video generation. The skill integrates with HuggingFace's Kernel Hub for distribution, enabling developers to generate, benchmark, and publish optimized kernels without deep CUDA expertise.
Table of contents
Why a skill for kernels?Installing the skillWhat is in the skillBenchmarking the kernels: Diffusers (LTX-Video on H100)Benchmarking the kernels: Transformers (Qwen3-8B on H100)Publishing your kernel to the HubConclusionResources2 Comments
Sort: