In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

NVIDIA achieved single-digit microsecond latency for LSTM inference on the GH200 Grace Hopper Superchip, matching or beating specialized FPGA hardware in the STAC-ML Markets (Inference) Tacana benchmark. Key results include 4.61–4.70 µs p99 latency for LSTM_A and 6.88–7.10 µs for LSTM_B. The post details the custom CUDA kernel techniques behind these results: persistent kernels that keep weights in shared memory/registers, CUDA green contexts for multi-instance serving, GDRCopy for low-overhead CPU-GPU signaling, and a two-phase precompute/inference split. An open source reference implementation (dl-lowlat-infer) targeting the RTX PRO 6000 Blackwell GPU is provided with build and run instructions, achieving 2.5–14.2 µs p99 across small, medium, and large LSTM models.

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

STAC-ML benchmarking in financial services

How to build and run the low-latency LSTM inference reference implementation