STCLab migrated from a costly proprietary APM solution to an open-source observability stack using OpenTelemetry and the LGTM stack (Loki, Grafana, Tempo, Mimir), achieving 72% cost reduction and 100% trace coverage across all environments. The implementation uses a centralized backend with distributed OTel Collectors, multi-tenancy via X-Scope-OrgID headers, and per-tenant rate limiting. Key challenges included metric explosion (solved with Target Allocator per-node strategy), version alignment issues between Operator/Collector/Target Allocator components, and OOM problems on small nodes requiring 4GB+ memory for collectors.
Table of contents
The ChallengeObservability Architecture OverviewKey Architectural DecisionsKey ChallengesConclusionSort: