A pull request rewrites Triton's attention kernel to use persistent execution, improving performance at low context lengths. The implementation reveals an interesting quirk: fp8 kernels run ~100 TFLOPS faster when the kernel name contains 'cutlass', due to hardcoded optimizations in NVIDIA's ptxas compiler. While fp8 and low-context scenarios see significant gains, fp16 performance at large contexts decreased due to instruction scheduling issues in the softmax partition.

5m read timeFrom github.com
Post cover image

Sort: