A practitioner with 18 years of distributed systems experience shares hard-won lessons from deploying multi-agent AI systems in production. The talk covers why scaling from one to multiple agents creates exponential coordination complexity, illustrated by a real financial services race condition bug caused by stale cache reads. Key patterns covered include: choreography vs. orchestration (with a decision framework), immutable state snapshots with versioning to eliminate race conditions, data contracts between agents, circuit breaker patterns for failure isolation, and saga/compensation patterns for rollback. A reference production architecture using LangGraph, Databricks, Delta Lake, Unity Catalog, and MLflow is presented.

26m watch time

Sort: