Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Prompt caching in the OpenAI API allows reusing repeated parts of LLM inputs (like system prompts) to reduce latency by up to 80% and costs by up to 90%. For caching to activate, the repeated prefix must appear at the start of the prompt and exceed 1,024 tokens. A hands-on Python example demonstrates making two requests with the same long prefix, showing measurable latency improvements on the second call. Key pitfalls include prefixes below the token threshold, inserting dynamic content before the prefix (which breaks caching), and misunderstanding that only the pre-fill phase is cached — not decoding. The `prompt_cache_key` parameter exists in the API spec but is not yet exposed in the Python SDK. Prompt caching is less useful for highly dynamic prompts, one-off requests, or real-time personalized systems.

Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial