Shivangi Srivastava, Senior Director at Salesforce/Informatica, explains how Cloud Data Integration (CDI) evolved from a single-node engine to a distributed Spark-on-Kubernetes platform serving 5,500 enterprise customers running 250,000 daily pipelines. Key engineering decisions include extending open-source Spark into 'Spark++' for enterprise features like lineage tracking, preserving backward compatibility for existing graphical pipeline abstractions, and implementing a FinOps automation layer with three components (Cluster Lifecycle Manager, Cluster Tuner, Job Tuner) that reduces infrastructure costs by ~1.65x. The architecture separates the control plane from the data plane to maintain 99.9% availability during compute spikes.

5m read timeFrom engineering.salesforce.com
Post cover image

Sort: