A conference talk by Nike's Observability Platform Engineering Director covering practical ML applications in observability systems at massive scale. Key topics include: the challenge of data volume and cardinality in telemetry (metrics, logs, traces), using ML for intelligent anomaly detection, building event-to-incident funnels to reduce alert noise (Nike processes ~2M alerts/day with a 15:1 noise-to-incident ratio), predictive pre-scaling for traffic spikes (shoe drops, Black Friday, Singles Day), using LLMs for natural language querying of observability data instead of proprietary DSLs like PromQL or SPL, AI-generated dynamic dashboards, automated postmortem generation, and AI-assisted runbook creation. The speaker emphasizes that organizational maturity and cultural readiness are prerequisites for AI-driven automation, and encourages experimenting with open-source ML tools immediately.

41m watch time

Sort: