Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

A pull request rewrites Triton's attention kernel to use persistent execution, improving performance at low context lengths. The implementation reveals an interesting quirk: fp8 kernels run ~100 TFLOPS faster when the kernel name contains 'cutlass', due to hardcoded optimizations in NVIDIA's ptxas compiler. While fp8 and low-context scenarios see significant gains, fp16 performance at large contexts decreased due to instruction scheduling issues in the softmax partition.

[Gluon][Tutorial] Persistent attention by Mogball · Pull Request #7298 · triton-lang/triton