A production delta-index pipeline for search and ads retrieval was migrated from scheduled batch jobs to micro-batch Spark Structured Streaming. The key insight was that scheduling and orchestration delays, not processing cost, caused freshness lag. Record-level streaming was tried and abandoned due to semantic mismatch with batch-oriented indexing logic. Instead, a time-driven micro-batch model with 30-second triggers was adopted, using partition-based watermarks instead of fragile S3 completion markers. The pipeline always advances to the latest visible partition rather than replaying intermediate ones, relying on overlapping sliding windows for correctness. Planned 24-hour restarts and a watchdog controller address memory pressure and operational predictability. The result was a 50% reduction in end-to-end freshness lag, dropping worst-case delay from ~10 minutes to ~30 seconds.

19m read timeFrom infoq.com
Post cover image
Table of contents
IntroductionSystem Scope and Use CaseBackground: Full Index and Delta Index PipelinesWhy Streaming Was Controversial InternallyFalse Start: Beginning with Record-Level StreamingConverging on Micro-Batch StreamingSource and Sink: Object Storage Was Not OptionalFalse Start: Success Files and Completion MarkersPattern: Deterministic Progress with Rate-Based TriggersPattern: Handling Lag by Choosing FreshnessPattern: Restarting by Jumping to the LatestContinuous Execution and Memory PressurePattern: Planned Restarts as an Operational ToolPattern: Watchdog-Managed Streaming JobsImpact on End-to-End LatencyResults in ProductionConclusionAbout the Author

Sort: