A complete walkthrough of building a high-throughput entity resolution pipeline using Snowflake's newly released CORTEX_SEARCH_BATCH function. The tutorial covers creating a GICS-aligned product taxonomy (60 nodes), classifying 138 raw alt data signals from 13 vendors in ~5 seconds using the Arctic embedding model, building a Streamlit dashboard, and setting up an incremental stream + task pipeline for continuous processing. Key techniques include semantic matching via LATERAL joins, confidence thresholds with METADATA$RANK, blocking rules for large-scale search space reduction, and a cascading multi-pass matching strategy. The same pattern applies to CRM deduplication, resume matching, and supplier catalog reconciliation.

17m read timeFrom medium.com
Post cover image
Table of contents
Snowflake Setup — Golden Source and Raw ProductsBatch Cortex Search — How it WorksGet Willem Reerink’s stories in your inbox

Sort: