Grab migrated their ML model serving platform Catwalk to NVIDIA Triton Inference Server, achieving 50% reduction in tail latency and 20% cost savings. The team built a Triton Manager component to ensure backward compatibility and zero-downtime migration, successfully transitioning over 50% of online deployments within 10 days. Triton's multi-framework support, unified API, and advanced features like dynamic batching addressed performance issues from maintaining multiple legacy inference engines across ONNX, PyTorch, and TensorFlow.

8m read timeFrom engineering.grab.com
Post cover image
Table of contents
IntroductionEvaluation and implementationExploratory benchmark resultsAdopting Triton at scaleRollout resultTriton’s impacts on critical modelsEarly cost impact of the migrationTakeawaysJoin us

Sort: