9 min read B2B FinOps

Prompt Caching Across Providers: Real Savings in 2026

How prompt caching works on Claude, OpenAI, Gemini and Bedrock in 2026, what each charges for cached reads, and how to wire it up to actually cut your API bill.

Prompt Caching Across Providers: Real Savings in 2026

Searches for “prompt caching” are up 86 percent in three months. The feature is no longer niche. Every major provider now offers some flavor of it, and for any workflow that reuses a long system prompt, document, or codebase snapshot, prompt caching is the difference between a sustainable API bill and a runaway one.

This piece walks through what prompt caching actually is, how each major provider implements it in 2026, what the real per-token savings look like, and a checklist for wiring it into your code without breaking anything.

What prompt caching actually does

Most modern LLM workloads send the same big block of context over and over: a long system prompt, a tool schema, an embedded set of docs, a repo snapshot for a coding agent. Without caching, the provider tokenizes and processes that full prefix on every call. With caching, the provider stores the intermediate KV cache for the repeated prefix and bills you a fraction of the normal input rate when you reuse it.

The mental model: you pay full input rate once to write the cache, then a heavily discounted rate to read it for some window of time. The window is typically 5 minutes to 1 hour depending on provider.

This is not a model improvement. It is a billing optimization that exposes a real underlying compute savings. The provider does less work on cached prefixes, so they pass some of that saving to you in exchange for committing to a stable prefix.

How each major provider charges in 2026

The exact rates shift, but the structure has stabilized across providers. The three numbers that matter per provider are: write multiplier (cost to put something in cache vs the base input rate), read multiplier (cost to read from cache vs the base input rate), and TTL (how long the cache lives without being refreshed).

Anthropic Claude

  • Write: 1.25x of base input rate for a 5-minute TTL, around 2x for a 1-hour TTL.
  • Read: 0.1x of base input rate (90 percent discount).
  • TTL: 5 minutes by default, configurable up to 1 hour on extended-TTL beta.
  • Granularity: explicit, controlled by cache_control: {type: "ephemeral"} markers on message blocks.

For a heavy Opus user with a 50K-token system prompt reused thousands of times per day, this is the most aggressive discount in the market. Practical effective input rate on cached portions sits near $0.30 per million tokens versus $3 base for Opus, or $0.025 per million on Haiku.

OpenAI

  • Write: 1.0x of base input rate. No write surcharge.
  • Read: 0.25x to 0.50x of base input rate depending on tier and model (75 to 50 percent discount).
  • TTL: typically 5 to 10 minutes, opportunistic.
  • Granularity: automatic. The platform decides what to cache once a prefix is reused. The minimum cacheable prefix is around 1024 tokens.

Less aggressive discount than Anthropic, but zero friction. Code that already reuses a stable prefix gets caching savings without changing a single API parameter. The downside is less predictability: you cannot pin a cache, and short pauses between calls can lead to misses.

Google Gemini

  • Write: charged per token at base input rate, plus a small storage fee per hour.
  • Read: substantially discounted, often near 0.25x of base input rate.
  • TTL: explicit and configurable, from 1 minute to several hours via the explicit cache API.
  • Granularity: explicit via the cache content API on Vertex AI and on the Gemini API.

Gemini’s caching is the most predictable and the easiest to budget against, because you explicitly create a cache object with a known TTL and known content. Best for workflows with a known long-lived prefix (large doc analysis pipelines, persistent system prompts shared across users).

Amazon Bedrock

Bedrock exposes prompt caching for both Anthropic and other supported models, with rates that mirror the underlying provider but with Bedrock’s own quotas. For enterprise teams already routing through Bedrock for compliance reasons, this means the same per-call cost math as direct Anthropic, with one fewer vendor to negotiate with.

Where the savings actually live

A back-of-envelope calculation for a typical coding agent workflow:

  • 30K-token system prompt and tool schema.
  • 200 calls per hour during active work.
  • 10 hours per day of active work, 22 days per month.

Without caching, on Claude Sonnet at $3 per million input tokens: 30K * 200 * 10 * 22 = 1.32B input tokens per month from the prefix alone, costing about $3,960.

With 5-minute caching on Anthropic at the 0.1x read rate, assuming a cache hit rate of 85 percent across the active hour: effective cost on the prefix drops to roughly $590 per month. Net saving: about $3,370 per month, or 85 percent of the prefix bill, for a workflow that needed zero behavioral change.

The numbers scale almost linearly with prefix size and call frequency. For agentic loops that pass an entire repo snapshot or a long tool catalog on every call, the savings can be even larger than the example above.

Where caching breaks down

A few common pitfalls that erase the savings:

  • Mutable prefixes. If your system prompt embeds the current timestamp, a user ID, or anything that changes per call, the cache key changes too and you pay full price every time. Fix: pull volatile data out of the cached block and put it in the user message.
  • Cache window misses. On Anthropic and OpenAI, a 5-minute TTL means you lose the cache between sessions. For sporadic workloads, the write surcharge can cost more than the savings. Fix: either batch your work into bursts, or use the extended TTL options.
  • Provider routing. If you load-balance the same logical prefix across Anthropic and OpenAI, you write two separate caches and pay double on writes. Fix: pin caching-sensitive workflows to one provider.
  • Inadvertent invalidation. Tool catalogs, retrieved docs, and few-shot examples often look stable but get reordered between calls. Any ordering change invalidates the prefix. Fix: deterministic serialization of every cached block, sorted keys, stable formatting.

A practical wiring checklist

For any pipeline where you suspect prompt caching could help:

  1. Profile the call. Log token counts for the static prefix versus the per-call user content. If the prefix is more than 1024 tokens and is reused at least 10 times per hour, caching is worth turning on.
  2. Pick the provider where the workflow already runs. Do not change provider just for caching. The 90 percent discount on Anthropic only matters if your code already calls Claude.
  3. Move volatile data out of the cached prefix. Timestamps, user IDs, request IDs, anything mutable belongs in the user message, not in the system prompt.
  4. On Anthropic, add cache_control: {type: "ephemeral"} to the last block of each cacheable section. On OpenAI, do nothing and trust the auto-cache. On Gemini, create a cached content object and reference it.
  5. Measure for one week. Track cache hit rate, total input tokens billed at cached vs base rate, and total monthly cost. If the cache hit rate is below 50 percent, your TTL or your prefix stability needs work.

What this means for your tool budget

Heavy AI users running coding agents, document pipelines, or multi-step workflows can typically cut 40 to 70 percent off their input token bill in the first month of caching, and 70 to 85 percent once they have tuned prefix stability. For founders running on a subscription that already includes generous limits, caching extends those limits in practice by reducing how many tokens each call actually consumes against your quota. For API-billed teams, the savings drop straight to the income statement.

Prompt caching is not the most exciting feature in the 2026 AI stack. It is the most underused one. For any workflow that reuses a long prefix, the work to wire it up is measured in hours and the payoff is measured in months of bills.