Shipping multi-modal AI agents in production requires treating vision, audio, and text as distinct systems with different cost curves, latency budgets, and failure modes. Key lessons include aggressive image preprocessing to control tokenization costs, streaming everything in voice pipelines to meet sub-second latency expectations, handling real-world audio quality degradation, and building pipelines of single-modality steps rather than monolithic multi-modal calls. Evaluation must be built around observed failure modes, not broad coverage, and cost dashboards should track which modality dominates the bill. The core architectural recommendation is decomposition: most tasks are separable into typed single-modality steps that are testable, debuggable, and independently optimizable.
Table of contents
What Counts As Multi-Modal And Why It MattersVision: The Tokenization TrapVision: The Failure Modes Nobody Warns You AboutAudio: Latency Is The First Bill You PayAudio: The Quality Tax On Real RecordingsEvaluation: Multi-Modal Tasks Need Different EvalsCost: Multi-Modal Bills Bend DifferentlyThe Glue: Pipelines, Not Single CallsWhat This Looks Like When It WorksWhere This Is GoingSort: