An OpenAI infrastructure engineer details how the company adopted Temporal Cloud to support long-running agentic workflows at massive scale, including the ChatGPT image generation surge that reached 1 billion images per week. The talk covers the full platform journey: starting with a narrow SDK wrapper, identifying developer friction, then building a 'paved road' with managed worker/workflow scaffolding tools, a Go-based proxy for auth and routing, a Kubernetes operator for config reconciliation, and integrated observability via self-hosted Temporal UI with Datadog links. Key outcomes include reducing time-to-first-workflow from 1–2 weeks to under a day, 60x growth in one year to 700+ namespaces and 100+ workers, all managed by a team of four engineers.
Sort: