A walkthrough of building LLM_API_COMPLETE, a vectorized Python UDF for Snowflake that enables prompt caching and parallel async execution when calling LLMs via the Cortex REST API from SQL. The UDF batches up to 20 rows per invocation, fires concurrent HTTP requests using aiohttp, and automatically applies cache_control headers for Claude models to reduce redundant system prompt token processing by up to 90%. Key implementation details include network egress/ingress rule setup, Programmatic Access Token configuration, model-aware prompt construction, and per-row token usage visibility. A real-world PII de-identification use case demonstrates the pattern, showing how a 1,100-token system prompt repeated across 10,000 rows can be cached so only the first batch pays the write cost while all subsequent batches read at 0.1x the standard token rate.
Table of contents
How I built a vectorized UDF that unlocks Prompt Caching and Parallel Execution to Cortex LLM calls from SQL.Introducing LLM_API_COMPLETEPrompt CachingHow It Works Under the HoodPrompt Caching in PracticeGet Anant Damle’s stories in your inboxCapabilities and LimitationsThe Setup: Network Rules and SecurityReal-World Usage: PII De-identificationWhen to Use LLM_API_COMPLETEGetting StartedSort: