Grab's data streaming team (Coban) built a Shadow Testing stage into their Apache Flink deployment pipeline to eliminate production downtime caused by deployment failures. The approach deploys a new version of a Flink application (Shadow) in parallel with the current version (Main) in production, using isolated Kubernetes namespaces, distinct consumer group IDs, separate Kafka brokers, and dedicated S3 sinks controlled via an `isShadow` environment variable. The Shadow app runs for a default 1-hour observation window; if stable, the Main deployment proceeds. This reduces Change Failure Rate and increases deployment confidence by catching production-specific issues — such as checkpoint incompatibility or traffic volume problems — before they affect live traffic.

7m read timeFrom engineering.grab.com
Post cover image
Table of contents
IntroductionArchitecture overviewDeployment flowConnector implementationConclusionWhat’s nextJoin us

Sort: