LLM_API_COMPLETE: Optimized LLM inference at scale in Snowflake

A walkthrough of building LLM_API_COMPLETE, a vectorized Python UDF for Snowflake that enables prompt caching and parallel async execution when calling LLMs via the Cortex REST API from SQL. The UDF batches up to 20 rows per invocation, fires concurrent HTTP requests using aiohttp, and automatically applies cache_control headers for Claude models to reduce redundant system prompt token processing by up to 90%. Key implementation details include network egress/ingress rule setup, Programmatic Access Token configuration, model-aware prompt construction, and per-row token usage visibility. A real-world PII de-identification use case demonstrates the pattern, showing how a 1,100-token system prompt repeated across 10,000 rows can be cached so only the first batch pays the write cost while all subsequent batches read at 0.1x the standard token rate.

#python

#snowflake

Mar 13•11m read time•From medium.com

Table of contents

How I built a vectorized UDF that unlocks Prompt Caching and Parallel Execution to Cortex LLM calls from SQL.Introducing LLM_API_COMPLETE Prompt Caching How It Works Under the Hood Prompt Caching in Practice Get Anant Damle’s stories in your inbox Capabilities and Limitations The Setup: Network Rules and Security Real-World Usage: PII De-identification When to Use LLM_API_COMPLETE Getting Started

Comment

Bookmark

Copy

Sort: