Skyscanner's platform engineering team shares how they manage OpenTelemetry Collector deployments across 24 production Kubernetes clusters running 1,000+ microservices. Key architectural decisions include a centralized DNS endpoint with Istio-based routing, two collector patterns (Gateway ReplicaSet and Agent DaemonSet), and generating platform-level HTTP/gRPC metrics from Istio service mesh spans using the span metrics connector — eliminating the need for application-level instrumentation. The Java-heavy environment uses a shared base Docker image with the OTel Java agent pre-configured, with all instrumentations disabled by default and only a curated set enabled. SDK-generated HTTP/RPC metrics are dropped in favor of lower-cardinality Istio-derived metrics. Rollouts follow a progressive promotion strategy across dev, alpha, beta, and production cluster tiers using Argo CD. Practical advice includes starting simple, adding memory limiters from day one, and using filter processors early to handle false-positive error statuses.

12m read timeFrom opentelemetry.io
Post cover image
Table of contents
Organizational structureOpenTelemetry adoptionArchitecture: centralized routing, distributed collectionConfiguration: start simple, evolve graduallyInstrumentation strategyDeployment and release managementWhat works wellLessons and pain pointsAdvice for othersWhat’s next

Sort: