
Anthropic Prompt Caching Guide
Anthropic Prompt Caching: Save 90% on API Costs
The Practical Guide to Implementing Prompt Caching for Claude
TL;DR: Anthropic prompt caching lets you cache up to 80k token prefixes at a 50% discount after a 5-minute TTL. Real-world speedups hit 4x, and repeated queries drop costs by up to 90%. Add cache_control to your API calls, integrate with LangChain if needed, and watch your token spend plummet on RAG, multi-turn conversations, and agent workflows.
Quick Takeaways
- Cache up to 80k tokens: System prompts, context blocks, and retrieval results all become reusable in seconds
- Cost drops 50%: Cached tokens cost half as much after the 5-minute minimum TTL threshold
- Speed gains of 4x: Latency cuts up to 75% on cache hits, especially noticeable in agentic loops
- No major refactoring: One parameter addition (cache_control) in your existing API calls gets you started
- Best for repeated patterns: Works hardest on RAG pipelines, multi-turn chats, and batch agent workflows where the same context recycles
- Limits matter: 5-minute TTL minimum, no sub-minute cache refreshes, and you can’t cache user input tokens
- Production-ready today: Available now across Claude 3 Opus, Sonnet, and Haiku with full SDK support
What Is Anthropic Prompt Caching?
Anthropic prompt caching is a feature that stores the processed representation of prompt prefixes (typically your system prompt and static context) on Anthropic’s servers for a set time. When you send a new request with the same prefix, the API retrieves the cached version instead of reprocessing it. You pay for the cache creation once, then get 50% discounts on subsequent reads of cached tokens.
Think of it like this: Your system prompt describing a product support agent is 2,000 tokens. Normally, every request reprocesses those 2,000 tokens at full price. With caching enabled, you pay full price once, then 0.5x price for every request that reuses that prefix within the cache window. If a customer sends 10 messages in a single conversation, you save tokens on messages 2-10.
The mechanism relies on the cache_control parameter in the Anthropic API. You mark specific message blocks as cacheable, and the API handles the rest. Anthropic’s official documentation outlines the full technical flow and token limits. The catch: cache lives for a minimum of 5 minutes, and tokens must be contiguous in your message structure.
Real cost impact: Anthropic reported up to 90% cost reduction for repeated prompts in agentic workflows. That’s not a typo. In production RAG systems where the same retrieval context recycles across dozens of queries, you see those gains.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a product support agent for an e-commerce platform.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "How do I track my order?"
}
]
)
print(response.usage.cache_creation_input_tokens)
print(response.usage.cache_read_input_tokens)
How Prompt Caching Cuts Costs and Boosts Speed
The economics are straightforward. Cached input tokens cost $0.30 per million tokens after the cache is created. Regular input tokens cost $3.00 per million. That’s a 90% discount on reads. But you only get this benefit if the same prefix appears in multiple requests within the cache TTL window.
The latency win matters just as much. Your first request with a new cache prefix takes normal time: the API processes all tokens. Subsequent requests with the same prefix skip that processing and serve from the cache server, cutting latency by up to 75%. Simon Willison’s testing showed 4x speedups in practice across multiple test scenarios with the Python SDK.
For a typical RAG pipeline where you send 100 queries against a 10k-token knowledge base in an hour, here’s the math:
- Without caching: 100 queries × 10k context tokens = 1M tokens at $3.00 = $3.00
- With caching: First query $0.03 + 99 queries at 0.5x = $0.03 + $0.15 = $0.18
- Savings: $2.82 per batch cycle, or 94% reduction
The real-world impact compounds in multi-tenant systems, agent loops, and customer support bots that reuse context across conversations. A single implementation can save thousands monthly.
Step-by-Step Setup: Beginner Implementation
Getting started takes five minutes and three steps.
Step 1: Install or Update the SDK. Make sure you have the latest Anthropic Python package.
pip install --upgrade anthropic
Step 2: Structure Your Prompt with cache_control. Take your existing system prompt and wrap it in a structure that includes the cache_control parameter. Here’s a working example:
from anthropic import Anthropic
client = Anthropic(api_key="your-key")
system_prompt = "You are a helpful AI assistant specialized in Python programming. Provide clear, concise answers with code examples when relevant."
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "Explain list comprehensions in Python"
}
]
)
print(f"Cache creation tokens: {message.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {message.usage.cache_read_input_tokens}")
print(f"Response: {message.content[0].text}")
Step 3: Test Cache Hits. Send the same request twice and watch the metrics. The second request should show cache_read_input_tokens > 0 and zero cache_creation_input_tokens.
That’s it. The API handles expiration automatically after 5 minutes. No explicit cleanup needed. Your usage object now includes three new fields: cache_creation_input_tokens, cache_read_input_tokens, and cache_write_bytes. Monitor these to confirm caching is working.
🦉 Did You Know?
Cache prefixes must be contiguous. You can’t cache the first 100 tokens, skip 50, and cache the next 100. The cache_control parameter applies to one unbroken block, so structure your system messages and static context as single text blocks for maximum efficiency.
Intermediate: Advanced Caching Strategies
Once basic caching works, you can layer in more sophisticated patterns.
Multi-Turn Conversations. Cache the system prompt and conversation history together. Each new user message gets appended without triggering a new cache write. This is powerful for customer support bots where the context (account info, order history) stays static across 10+ turns.
RAG + Caching. Retrieve your context chunk from a vector database, build a system message containing that context, and cache it. LangChain’s prompt caching integration automates this for LCEL chains. The retrieval step still costs time and compute, but the Claude inference becomes nearly free on repeated queries against the same knowledge base.
from anthropic import Anthropic
client = Anthropic()
# Simulate a large static knowledge base
knowledge_base = """
Product: Widget Pro
- Price: $49.99
- Warranty: 2 years
- Shipping: Free US
- Returns: 30 days
...
[This could be 50k tokens in production]
"""
# Cache the knowledge base once
response1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
system=[
{
"type": "text",
"text": f"Use this knowledge base to answer questions:\n\n{knowledge_base}",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "What's the warranty on Widget Pro?"}]
)
print(f"First request - Cache creation: {response1.usage.cache_creation_input_tokens}")
# Second request reuses the cached knowledge base
response2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
system=[
{
"type": "text",
"text": f"Use this knowledge base to answer questions:\n\n{knowledge_base}",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "What's the return policy?"}]
)
print(f"Second request - Cache read: {response2.usage.cache_read_input_tokens}")
Dynamic Cache Invalidation. The 5-minute TTL works for most cases, but if your context updates frequently, plan for it. Anthropic doesn’t support manual cache flushing, so either wait for natural expiration or change your prefix slightly (add a version number) to trigger a new cache write.
Monitoring and Metrics. Integrate cache metrics into your observability stack. Track the ratio of cache_read_input_tokens to cache_creation_input_tokens. A healthy system should see 80%+ of requests hitting cache once warm.
Real-World Examples and Code Snippets
Here’s a production-ready example: a customer support bot that caches product catalogs.
from anthropic import Anthropic
import json
client = Anthropic()
# Load your product catalog (imagine this is 20k tokens)
products = {
"laptop_pro": {"price": 1299, "stock": 45},
"mouse_wireless": {"price": 29, "stock": 156},
"monitor_4k": {"price": 599, "stock": 12}
}
catalog_text = json.dumps(products, indent=2)
def support_response(customer_id, query):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=[
{
"type": "text",
"text": f"You are a product support agent. Use only this catalog:\n{catalog_text}",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": query}
]
)
return {
"response": response.content[0].text,
"cache_created": response.usage.cache_creation_input_tokens,
"cache_read": response.usage.cache_read_input_tokens
}
# First customer: triggers cache creation
result1 = support_response("cust_001", "Do you have the laptop in stock?")
print(f"Customer 1 - Cache creation: {result1['cache_created']}")
# Second customer: same catalog, hits cache
result2 = support_response("cust_002", "What's the price of the wireless mouse?")
print(f"Customer 2 - Cache read: {result2['cache_read']}")
In production, you’d hook this into your customer support platform. The catalog caches once, then thousands of customer interactions reuse it. Calculate the savings: if each customer query saves 5,000 tokens at 90% discount, and you handle 1,000 queries daily, that’s 4.5M tokens saved daily, or roughly $13.50/day.
Troubleshooting and Best Practices
Common issues and fixes:
Cache isn’t being read. Check that your prefix (system message text) is byte-for-byte identical on subsequent requests. Whitespace changes, punctuation shifts, or variable interpolation differences will create a new cache entry instead of reading the old one. Log the exact text you’re sending.
High cache_creation_input_tokens on every request. This usually means your prefix is changing per request (timestamps, user IDs, dynamic content). Move dynamic data to the messages array, not the system prompt. Keep the system prompt static.
Cache hit rates are low. Evaluate whether your use case benefits from caching. If prefixes change constantly or you only see single requests per prefix, caching wastes latency. Use it for workflows with inherent repetition: batch processing, multi-turn chats, or shared knowledge bases.
TTL running out mid-conversation. The 5-minute minimum TTL is intentional for Anthropic’s infrastructure. If your conversations last longer, that’s fine. If you need sub-5-minute freshness, caching isn’t the right tool. Use database-backed context management instead.
Best practices to maximize ROI:
- Cache system prompts and static context separately from dynamic user input
- Consolidate context into the largest possible contiguous blocks to maximize cache efficiency
- Monitor usage metrics weekly to confirm cache performance matches your expectations
- Test cache behavior in staging first with actual production-scale payloads
- Document your cache strategy in team wiki so others know not to refactor prompts mid-feature
Putting This Into Practice
If you’re just starting: Pick one high-volume use case (like your most-called API endpoint) and add cache_control to the system message. Test it with 100 requests from a single script to confirm cache hits and measure latency reduction. Don’t over-engineer it yet. One line of code, one monitoring metric.
To deepen your practice: Integrate caching into a real RAG pipeline or multi-turn conversation handler. Use the Anthropic Python SDK examples as templates, then layer in LangChain or LlamaIndex integration for automatic cache management. Measure before/after costs and latency on actual customer traffic.
For serious exploration: Implement dynamic cache invalidation logic (version numbers, semantic hashing of context), combine caching with fine-tuned models for specialized domains, and scale across multiple Claude models with TTL optimization per model. Build dashboards showing daily savings and cache hit ratios. Consider multi-region deployments if global latency matters.
Cost caching brings ROI within weeks on production systems. A system handling 1M API calls monthly with 5k-token static context sees immediate $50+ monthly savings, scaling to $500+ as traffic grows. Implementation effort is minimal. Start today.
Final Thoughts
Anthropic prompt caching is production-ready and dramatically cuts costs for the right workloads. It’s not a silver bullet. Your use case needs repeated prefixes, stable context, and workflows that tolerate 5-minute cache windows. But if you’re running RAG systems, customer support bots, or agent loops, caching is low-hanging fruit.
Competitive analysis shows Anthropic’s caching implementation leads in pricing and simplicity, making it the practical choice for teams already on Claude. Start with the beginner setup, measure your metrics, and scale from there. The savings will compound.
Frequently Asked Questions
- Q: What is Anthropic prompt caching and how does it work?
- Anthropic prompt caching stores processed prompt prefixes (system prompts and static context) on Anthropic servers for 5 minutes. Subsequent requests with identical prefixes retrieve the cached version at 50% token cost, cutting infrastructure expenses and latency up to 75%.
- Q: How do you enable prompt caching in the Anthropic API?
- Add a cache_control parameter with type ephemeral to your system message or text blocks in the messages array. Use the official Python SDK with cache_control: {type: ephemeral}. Monitor cache_creation_input_tokens and cache_read_input_tokens in the response usage object to confirm caching.
- Q: What are the costs and limits of Anthropic prompt caching?
- Cached input tokens cost $0.30 per million after creation, a 90% discount over regular $3.00 pricing. Cache TTL minimum is 5 minutes, maximum cache prefix is 80k tokens, and user input tokens cannot be cached. Cache has no hard max size limit per se.
- Q: Prompt caching not working? Common troubleshooting tips.
- Ensure system prompt text is identical byte-for-byte across requests. Whitespace and punctuation changes break cache matching. Move dynamic content to the messages array, not system prompts. Verify cache_control parameter syntax and check usage metrics to confirm cache_read_input_tokens > 0.
- Q: Anthropic prompt caching vs OpenAI: Which is better?
- Anthropic prompt caching offers simpler integration and lower cached token pricing. OpenAI’s approach differs in implementation. For intermediate users building production apps, Anthropic’s method requires less refactoring and delivers faster real-world speedups based on independent benchmarks.