Guides

Claude Prompt Caching Guide

ClaudeAIHub. For official prompt caching documentation, visit platform.claude.com.

Prompt caching lets you reuse large, stable sections of your prompt across multiple API requests. Instead of re-sending and re-processing the same system prompt or document on every call, you mark it as cacheable — and the API reuses the processed version. This reduces response latency and lowers token costs for repeated context.

What Prompt Caching Does

Every time you call the Claude API, the model processes your full prompt from scratch. If your prompt includes a 20,000-token legal document, a detailed system prompt, or a long set of tool definitions, you pay to process those tokens on every request — even if the content is identical each time.

Prompt caching solves this by storing a processed version of a prompt prefix on Anthropic’s servers. On the next request, if the prefix matches, the API uses the cached version instead of reprocessing it. You still pay for cache writes and reads, but cache reads cost significantly less than standard input tokens.

Cache Pricing

Pricing for prompt caching depends on which model you use and whether you choose a 5-minute or 1-hour cache lifetime.

ModelStandard Input5m Cache Write1h Cache WriteCache ReadOutput
Claude Fable 5$10/MTok$12.50/MTok$20/MTok$1/MTok$50/MTok
Claude Opus 4.8$5/MTok$6.25/MTok$10/MTok$0.50/MTok$25/MTok
Claude Sonnet 4.6$3/MTok$3.75/MTok$6/MTok$0.30/MTok$15/MTok
Claude Haiku 4.5$1/MTok$1.25/MTok$2/MTok$0.10/MTok$5/MTok

Cache reads are priced at 0.1x the standard input price — a 90% reduction compared to reprocessing the same tokens. Cache writes cost slightly more than standard input (1.25x for 5-minute TTL, 2x for 1-hour TTL) because they involve processing and storing the prefix. Verify current pricing at platform.claude.com/docs/en/about-claude/pricing.

Cache Lifetime Options

There are two TTL options for cached prefixes:

  • 5-minute TTL (default): Cache automatically refreshes at no extra cost on each hit. Best for high-frequency workflows where the same prefix is used repeatedly within short windows.
  • 1-hour TTL: Costs 2x standard input for writes but extends the cache window significantly. Best for prompts used every few minutes rather than continuously.

You select the TTL using the cache_control parameter:

cache_control={"type": "ephemeral"}          # 5-minute TTL (default)
cache_control={"type": "ephemeral", "ttl": "1h"}  # 1-hour TTL

Minimum Token Requirements

Prompt caching only applies when the prompt prefix reaches a minimum token length. Below this threshold, the request processes without caching and no error is returned:

  • 1,024 tokens minimum: Claude Sonnet 4.6
  • 4,096 tokens minimum: Claude Opus 4.8, Claude Haiku 4.5

To verify caching is occurring, check the cache_creation_input_tokens and cache_read_input_tokens fields in the usage response. Both should be non-zero if caching is active.

How to Implement Prompt Caching

There are two ways to use prompt caching in the API.

Automatic Caching (Simplest)

Add cache_control at the top level of the request. The API automatically applies the cache breakpoint to the last cacheable block:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are a legal document analyst specializing in contract review...[long system prompt]",
    messages=[{"role": "user", "content": "Review this contract section."}]
)

Explicit Breakpoints

Place cache_control directly on specific content blocks for fine-grained control over what gets cached:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst...[long, stable system prompt]",
            "cache_control": {"type": "ephemeral"}  # cache this prefix
        }
    ],
    messages=[{"role": "user", "content": "Review this contract."}]
)

You can define up to 4 cache breakpoints per request. The system checks prefixes in order; if a prefix matches a cached entry, it uses that cached version for everything up to that breakpoint.

When Prompt Caching Helps Most

  • Long system prompts: Detailed instructions, personas, or formatting rules that stay the same across all requests.
  • Repeated document context: Uploading the same document (legal contract, research paper, codebase) and asking multiple different questions about it.
  • Agent workflows: Tool definitions and agent instructions that repeat across every step of a multi-turn workflow.
  • Conversational context: Long multi-turn conversations where earlier messages are stable and only the latest turn changes.
  • Few-shot examples: Prompts with many examples or demonstrations that need to be included on every call.

When Caching May Not Help

  • Short prompts: Below the minimum token threshold, caching does not apply.
  • Highly variable prompts: If your prefix changes on every request (dynamic context, per-user instructions), cache writes will never turn into cache hits.
  • Low-frequency requests: If you send a request less than once every five minutes to the same cached prefix, the cache will expire before it’s used again (use 1-hour TTL for these cases).
  • Single requests: Caching only helps when the same prefix appears across multiple requests.

Developer Safety Notes

  • Do not cache sensitive data: Cached content is stored on Anthropic’s servers. Avoid caching API keys, passwords, personal data, or confidential business information in system prompts.
  • Workspace isolation: As of February 2026, caches are isolated per workspace on the Claude API and Claude Platform on AWS. Different workspaces or organizations cannot share caches.
  • Verify cache hits in testing: Before relying on caching in production, confirm your workflow is producing cache hits by checking the usage response fields.
  • Monitor billing: Cache write costs are slightly higher than standard input — verify your usage patterns produce enough cache hits to offset write costs before scaling.
  • Concurrent requests: A cache entry only becomes available after the first response begins. Sending parallel requests to the same prefix simultaneously will not all hit the cache on the first batch.

Supported Models and Platforms

Prompt caching works on all current Claude models (Fable 5, Opus 4.8, Sonnet 4.6, Haiku 4.5) and legacy models. Automatic caching is available on the Claude API and Claude Platform on AWS. Explicit breakpoints are required on Amazon Bedrock (Bedrock does not support automatic caching). Vertex AI supports explicit breakpoints only in some configurations.

Related Resources