From Batch to Micro-Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline

A production delta-index pipeline for search and ads retrieval was migrated from scheduled batch jobs to micro-batch Spark Structured Streaming. The key insight was that scheduling and orchestration delays, not processing cost, caused freshness lag. Record-level streaming was tried and abandoned due to semantic mismatch with batch-oriented indexing logic. Instead, a time-driven micro-batch model with 30-second triggers was adopted, using partition-based watermarks instead of fragile S3 completion markers. The pipeline always advances to the latest visible partition rather than replaying intermediate ones, relying on overlapping sliding windows for correctness. Planned 24-hour restarts and a watchdog controller address memory pressure and operational predictability. The result was a 50% reduction in end-to-end freshness lag, dropping worst-case delay from ~10 minutes to ~30 seconds.

#backend

#apache-spark

May 04•19m read time•From infoq.com

Table of contents

Introduction System Scope and Use Case Background: Full Index and Delta Index Pipelines Why Streaming Was Controversial Internally False Start: Beginning with Record-Level Streaming Converging on Micro-Batch Streaming Source and Sink: Object Storage Was Not Optional False Start: Success Files and Completion Markers Pattern: Deterministic Progress with Rate-Based Triggers Pattern: Handling Lag by Choosing Freshness Pattern: Restarting by Jumping to the Latest Continuous Execution and Memory Pressure Pattern: Planned Restarts as an Operational Tool Pattern: Watchdog-Managed Streaming Jobs Impact on End-to-End Latency Results in Production Conclusion About the Author

Comment

Bookmark

Copy

Sort: