The Hidden Rate Limits of Every Major AI API
A practical guide to the real rate limits of OpenAI, Anthropic, Google, xAI, DeepSeek and OpenRouter in 2026, including the ones their docs do not advertise.
Every major AI provider publishes a rate limit table. None of them tell you the full story. The published numbers are the floor of what you can hit before things break, not the ceiling of what the system will actually let you do, and not the list of every silent throttle sitting between you and your next response.
If you have ever stared at a 429 error from a provider whose dashboard said you were nowhere near your quota, you already know the gap exists. This is a practical map of where the real walls are in 2026, provider by provider, and what to do when you hit them.
How rate limits actually work in 2026
Before going provider by provider, three things are worth stating up front because they apply almost everywhere.
First, almost every major provider now uses tier-based limits. You start at a low tier with tight RPM (requests per minute) and TPM (tokens per minute) caps, and you graduate to higher tiers based on cumulative spend, payment history, or sometimes a manual review. The published “max” limits assume you are at the top tier. Most accounts are not.
Second, rate limits are not just RPM and TPM. There are also concurrency limits, daily token caps, input token caps per request, output token caps per request, batch queue limits, and per-organization aggregates that ride on top of per-key limits. Several providers also throttle prompt caching reads and tool calls separately from raw generation tokens.
Third, soft limits exist. When the system is under load, your request can be rejected, slowed down, or quietly degraded (smaller context window, lower priority queue) without that showing up in any published table. This is the part you cannot plan around by reading docs. You only see it in production.
With that out of the way, let us go through the providers.
OpenAI
OpenAI runs five usage tiers (Tier 1 through Tier 5) plus a Free tier. You move up by spending money and waiting. Tier 1 unlocks after a $5 payment and 7 days; Tier 5 requires $1,000 paid and 30 days. Each tier sets different RPM, TPM, and batch queue limits per model.
The published numbers you should know:
- Tier 1 on GPT-5 class models: 500 RPM, 30,000 TPM.
- Tier 5 on the same models: 10,000 RPM, 30,000,000 TPM.
- Image generation models have their own much tighter limits, often 5 to 50 images per minute even at high tiers.
- Embeddings have a separate TPM budget, usually 1,000,000 at Tier 1 and 10,000,000 at Tier 5.
The hidden parts:
- Daily token caps exist on lower tiers, but they are not always shown in the dashboard. A Tier 1 account on a reasoning model can hit a per-day cap that the per-minute display will never warn about.
- Reasoning models bill output differently. Hidden reasoning tokens count against your TPM. A request that returns 200 visible tokens can consume 8,000 against your minute budget.
- Org-wide aggregation. If you split traffic across multiple API keys in the same organization, you do not get more capacity. The cap is org-level, not key-level.
- Background priority. The Batch API offers 50 percent discount but has its own queue limits, typically 90,000,000 enqueued tokens at Tier 5. Hit that and new batches sit until older ones drain.
The recovery path is straightforward: pay more, wait longer. There is no support ticket that meaningfully accelerates this for solo developers.
Anthropic
Anthropic also uses tiers, named Tier 1 through Tier 4 plus a custom Enterprise tier. Movement up requires both cumulative deposit and elapsed time, same pattern as OpenAI.
Published numbers:
- Tier 1 on Claude Sonnet: 50 RPM, 30,000 input TPM, 8,000 output TPM.
- Tier 4 on Claude Sonnet: 4,000 RPM, 400,000 input TPM, 80,000 output TPM.
- Opus models have lower TPM at every tier, often half of Sonnet’s.
- Claude Haiku gets higher RPM but the same TPM, which matters when your workload is many short calls.
The hidden parts:
- Input and output TPM are tracked separately. This is unusual. Most providers count everything in one bucket. With Anthropic, a long-context summarization workload (heavy input, light output) will hit the input ceiling while output TPM sits at 5 percent used. The dashboard will show you healthy headroom on one number and a wall on the other.
- Prompt caching has its own quota. Cached reads do not count the same as fresh input tokens, but cache writes do, and the system enforces a separate cache size budget per organization that is not in the public docs.
- Concurrent request limits. Even when you are under RPM, you can be capped on simultaneous in-flight requests. Long-running Claude Code sessions on Opus can saturate this without ever approaching RPM.
- The Messages API and Bedrock paths have different limits. Same model, same prompt, different ceiling depending on whether you call Anthropic direct or go through AWS Bedrock or Google Vertex. Vertex tends to be the loosest of the three for high-volume workloads in 2026.
If you hit Anthropic limits often, the practical move is to add a Bedrock or Vertex fallback for spillover rather than wait for a tier upgrade.
Google (Gemini API and Vertex AI)
Google publishes two distinct surfaces: the AI Studio Gemini API (developer-friendly, free tier, smaller limits) and Vertex AI (cloud-grade, project-level quotas, higher ceilings).
Published numbers on AI Studio Gemini API for the 2.5 and 3.x Pro lines:
- Free tier: 5 RPM, 250,000 TPM, 100 requests per day.
- Tier 1 (linked billing): 1,000 RPM, 2,000,000 TPM, 10,000 RPD.
- Tier 2 and Tier 3 raise these by 5x and 10x respectively.
On Vertex AI, quotas are project-level and adjustable through quota requests in the Google Cloud console. Default new-project quotas are surprisingly low (often 60 QPM per model per region) and need a quota request to scale.
The hidden parts:
- Per-day caps are real and brutal on the free tier. “5 RPM” sounds usable; “100 RPD” means a single test loop can burn your daily budget in two minutes.
- Region matters. Vertex quotas are per region. If you only ask for higher quota in us-central1, your europe-west4 deployment is still at default.
- The Live API (streaming voice / video) has separate session-minute quotas that are much tighter than the text API. Easy to miss if you scaled out a multimodal product on Tier 1 limits.
- Context caching on Gemini has minimum lifetime billing. This is not technically a rate limit, but it is a hidden floor on cost that shows up the moment you cache a 1M-token context for “just one query.”
xAI (Grok)
xAI is the youngest of the major providers and has the most volatile limits in 2026. The published table on docs.x.ai shows RPM and TPM by model, but the numbers move more often here than anywhere else, and the API frequently runs hotter during product launches.
Typical published numbers for Grok 4 class models:
- 60 to 480 RPM depending on account history.
- 2,000,000 TPM ceiling on paid tiers.
The hidden parts:
- Launch-day throttling. When a new Grok model ships, paid customers routinely see effective limits well below the published ones for the first 48 to 72 hours. There is no dashboard signal for this.
- Live Search adds its own per-request quota. Search-enabled completions are billed and rate-limited differently from base completions.
- No formal tier system means manual escalation matters more. Talking to xAI support is one of the few cases where a polite enterprise email actually changes the number.
DeepSeek
DeepSeek’s pricing remains the cheapest of the major providers in 2026, but the rate limit profile is unusual.
Published position: DeepSeek officially advertises “no published rate limits, throttled based on load.” This sounds generous. It is actually a warning.
The hidden parts:
- Dynamic throttling. During Asia business hours, the API is noticeably slower and 429s appear more frequently. Off-peak (roughly 22:00 to 06:00 UTC) the same workload sails through.
- No tier escalation path. There is no way to “pay for higher priority.” You get what the system gives you that hour.
- Context length surprises. The advertised context window is honored, but very long contexts (above 100K tokens) get queued or refused first under load.
For batch workloads that are not time-sensitive, DeepSeek is excellent. For interactive product traffic that must respond within 2 seconds, it is risky as a sole provider.
OpenRouter
OpenRouter is a router, not a model host, so its rate limits are a composite of its own infrastructure limits plus whichever underlying provider it routes you to.
Published numbers:
- Free tier (BYOK or routed free models): 20 RPM, 50 RPD if you have less than $10 in credits, 1,000 RPD if you do.
- Paid: limited primarily by the upstream provider’s limits applied to OpenRouter’s shared pool.
The hidden parts:
- Shared pool dilution. When you hit Claude or GPT through OpenRouter, you are sharing OpenRouter’s organization-level capacity with thousands of other users. During peak hours this can be lower effective throughput than calling Anthropic or OpenAI direct on a Tier 2 account.
- Provider routing changes silently. OpenRouter may switch you from Anthropic direct to Bedrock mid-session based on availability. Both work, but observability gets harder because your latency profile shifts.
- Free models throttle aggressively. The “free” tier on hosted open models is rate-limited far below what the docs imply, and the limits change weekly.
OpenRouter is the right choice for fallback diversity and experimentation. It is a poor choice as the single dependency for a production workload that needs predictable throughput.
Mistral, Cohere, Together, Fireworks (quick takes)
- Mistral: tiered like Anthropic, with separate input and output TPM. La Plateforme limits are roughly half of what AWS Bedrock allows for the same models.
- Cohere: trial vs production keys. Trial keys have hard daily caps (1,000 calls/month historically); production keys negotiate limits per account.
- Together AI: per-model RPM/TPM with relatively generous defaults for fine-tuned models. Hidden cost is the inference engine choice (vLLM vs TGI vs proprietary) which affects effective throughput more than the published rate limit.
- Fireworks: explicit “serverless” vs “dedicated” tiers. Serverless is rate-limited per project; dedicated removes the limit but you pay for idle capacity.
What to do about it
Rate limits will not disappear. The realistic options are:
- Spread load across providers. A multi-provider setup with a router (homebuilt or OpenRouter for non-critical paths) absorbs spikes that any single provider would 429 on.
- Cache aggressively. Both prompt caching (Anthropic, OpenAI, Gemini) and response caching for repeated queries pull massive load off your rate limit budget.
- Batch what can be batched. Most providers offer batch APIs at 50 percent discount with looser limits. Anything non-interactive (summaries, classification jobs, embeddings backfills) belongs there.
- Observe upstream signals. The
x-ratelimit-remaining-*headers (or equivalents) are present on almost every provider. Surface them. Most teams discover their real ceiling only after their first outage because nothing was plotting these headers. - Know your tier honestly. Tier 1 is not a production tier on any major provider. If you are running a real workload at Tier 1, you are one viral moment away from a 429 cascade.
This is also where having a single view of your usage across providers becomes useful rather than nice-to-have. The kind of dashboard tokenkarma builds exists exactly because rate limit data lives in six different consoles, in six different formats, and no provider has any incentive to tell you “you are about to hit a wall on the other guy’s API.” The numbers you actually need to make a decision are scattered, and the only way to act on them in time is to put them in one place.
The short version
The published rate limit table is the start of the conversation, not the end. The real ceiling is shaped by your tier, your region, your context length, your input-to-output ratio, whether the provider is having a load spike, and which gateway you go through. Treat every published number as optimistic, plot the response headers, and never run a serious workload against a single provider’s lowest tier.
The providers will not warn you before you hit the wall. The wall is there anyway.