Understanding Rate Limits Across AI Coding Tools
Nothing kills flow state faster than hitting a rate limit mid-session. You're deep in a refactor, the AI is producing great suggestions, and then -- "You've reached your usage limit. Please wait." No estimate of when it resets. No clarity on what you did to trigger it. Just a wall.
Rate limits are the most frustrating aspect of AI coding tools, and they're also the least well-documented. Every provider measures usage differently, communicates limits in different terms, and resets on different schedules. This guide lays out exactly how rate limits work across every major AI coding CLI so you can plan your workflow instead of reacting to interruptions.
Why Rate Limits Exist
Before we get into the specifics, it's worth understanding why these limits exist at all. AI inference is expensive. Running a large language model to generate code requires significant GPU compute, and that compute has a real cost that scales with usage.
Rate limits serve three purposes:
-
Cost management. Providers need to ensure that subscription revenue covers inference costs. Without limits, heavy users would create a classic "tragedy of the commons" where a small percentage of users consume a disproportionate share of resources.
-
Infrastructure protection. Sudden spikes in usage can overwhelm inference infrastructure. Rate limits smooth out demand and prevent cascading failures that would degrade the service for everyone.
-
Tier differentiation. Limits create the pricing gradient between free, mid-tier, and premium plans. If every tier had unlimited access, there'd be no reason to upgrade.
The frustration isn't that limits exist -- it's that they're often opaque, inconsistent, and poorly communicated. Let's fix the communication part.
How Each Tool Measures Usage
Claude Code
Claude Code uses a token-weighted rolling window system. This is fundamentally different from message counts or monthly quotas.
How it works: Your usage is measured in tokens (both input and output), tracked over a rolling 5-hour window with a weekly cap as a secondary limit. When your 5-hour window fills up, you wait for the oldest usage to roll off. When your weekly cap fills up, you wait until the next week.
Approximate allocations by tier:
| Plan | ~Tokens per 5-Hour Window | Relative Scale |
|---|---|---|
| Pro | ~44,000 | 1x |
| Max 5x | ~88,000 | 2x |
| Max 20x | ~220,000 | 5x |
What makes it confusing: Anthropic doesn't publish exact token numbers. The figures above are community-estimated from observed usage. The token counting includes both your input (prompts, files loaded into context) and the model's output, which means a single interaction with a large codebase loaded can consume a significant chunk of your window. Loading a 50K-token file into context and getting a 5K-token response uses 55K tokens from your budget -- more than an entire Pro 5-hour window in a single interaction.
The rolling window is both a feature and a frustration. Unlike monthly quotas that force you to ration across 30 days, rolling windows let you use your full allocation whenever you need it -- you just can't sustain maximum throughput for more than 5 hours continuously. For sprint-style coding, this is actually better than monthly limits. For all-day-every-day usage, it requires pacing.
How the weekly cap interacts with 5-hour windows: Think of the weekly cap as a safety net preventing you from running at full 5-hour throughput around the clock. If you're a 9-to-5 developer who uses Claude Code intensively during work hours, the weekly cap likely won't affect you. If you're coding 12+ hours a day, 7 days a week, the weekly cap becomes the binding constraint.
For detailed pricing breakdowns, see our Claude Code pricing guide.
Gemini CLI
Gemini CLI has the most transparent rate limit structure of any major AI coding CLI.
Google Account (free tier):
| Limit | Value |
|---|---|
| Requests per minute | 60 |
| Requests per day | 1,000 |
| Available models | Flash only |
API key tiers scale from free (with lower limits) through pay-as-you-go (effectively unlimited). With a billing-enabled API key, you can access Pro-level models and higher rate limits.
What makes it straightforward: Gemini measures in simple request counts with clear per-minute and per-day limits. There's no token weighting, no rolling windows, no multiplier confusion. You can calculate exactly how many requests you have left at any point. If you've made 400 requests today, you have 600 left. No ambiguity.
What makes it limiting: The free tier is restricted to Flash models. For Pro-level models, you need an API key with paid billing. And 1,000 requests per day sounds generous until you're in a heavy agentic session where a single task might generate dozens of API calls behind the scenes. A "simple" request like "refactor this module" might trigger 15-20 internal API calls as the agent reads files, plans changes, writes code, and verifies results.
The 60 RPM (requests per minute) limit matters less for interactive use but becomes relevant for automated workflows and CI/CD integration where you might batch multiple requests in rapid succession.
The Gemini CLI free tier guide covers the full breakdown of what you get without paying.
Codex CLI
Codex CLI uses 5-hour windows plus weekly quotas, structurally similar to Claude Code but with different specifics.
The current state: All paid tiers are running at 2x promotional limits. The Pro tier ($200/month) is advertised as having 6x the Plus rate, but during the promotional period, it's capped at 2x -- identical to Plus and Business. This means the entire tier structure is compressed, and the value differentiation between a $20/month and $200/month plan is minimal for Codex specifically.
The pain point: Pro users report hitting 5-hour rate limits in under 2 hours of sustained coding. When you're paying $200/month and getting throttled before lunch, the value proposition gets shaky. This is a direct consequence of the promotional rate limits compressing the tier differences.
Weekly quotas act as a secondary ceiling. Even if individual 5-hour windows aren't an issue, sustained heavy usage across the week can trigger the weekly limit. Community reports suggest the weekly limit roughly corresponds to 10-12 full 5-hour windows worth of usage, though this isn't officially confirmed.
The sandbox factor: One thing to keep in mind is that Codex CLI's sandboxed execution model means the tool does a lot of work per user request. Running code, testing it, iterating on fixes -- this generates more token usage than tools that just produce code without executing it. The rate limits need to account for this higher per-request overhead.
For the full Codex pricing breakdown, see our Codex subscription guide.
GitHub Copilot
Copilot uses monthly premium request quotas with a multiplier system that makes the effective limits highly variable.
| Plan | Premium Requests/Month | Overage Rate |
|---|---|---|
| Free | 50 | Hard cap |
| Pro | 300 | $0.04/request |
| Pro+ | 1,500 | $0.04/request |
| Enterprise | 1,000/user | $0.04/request |
The multiplier problem: Not all premium requests cost 1 request. Using GPT-4.5 in chat costs 50x, meaning a single interaction consumes 50 premium requests. Claude Opus 4 in chat costs 10x. Coding agent sessions cost 1x per session regardless of length.
What this means in practice: A Pro user with 300 premium requests per month gets:
- 300 coding agent sessions, OR
- 30 Claude Opus 4 chat interactions, OR
- 6 GPT-4.5 chat interactions
Those are wildly different experiences masked behind the same "300 premium requests" number. A developer who primarily uses the coding agent has essentially unlimited premium access. A developer who loves GPT-4.5 for chat burns through their allocation in less than a week.
The reset timing complication: Quotas reset on the 1st of every month, not your billing date. This creates an awkward mismatch -- you might be billed on the 15th but your quota resets on the 1st. It also means late-month subscribers get a partial first month at full price.
The Copilot premium requests guide breaks down strategies for navigating this system.
Side-by-Side Comparison
Here's how the four tools stack up on key rate limit characteristics:
| Characteristic | Claude Code | Gemini CLI | Codex CLI | GitHub Copilot |
|---|---|---|---|---|
| Measurement unit | Tokens (weighted) | Requests (counted) | Tokens (opaque) | Premium requests (multiplied) |
| Reset cycle | Rolling 5-hour + weekly | Per-minute + per-day | Rolling 5-hour + weekly | Monthly (1st of month) |
| Transparency | Low (estimated) | High (published) | Low (estimated) | Medium (published but multipliers confuse) |
| Free tier limits | N/A (no free tier) | 60 RPM, 1K/day | Limited-time, minimal | 50 premium/month |
| Overage option | Upgrade tier | Pay-as-you-go API | Upgrade tier | $0.04/request |
| Biggest frustration | Opaque token counting | Flash-only on free | Pro throttled during promo | 50x GPT-4.5 multiplier |
Common Frustrations Across All Tools
Before diving into scenarios, it's worth naming the frustrations that developers report across every tool:
Opaque measurement. Whether it's tokens, requests, or premium request multipliers, most developers can't predict how much a given coding task will "cost" in rate limit terms. You don't know if your next question will push you over the limit until it's too late.
Unpredictable depletion. Rolling windows mean your available capacity changes continuously. Monthly quotas mean your budget depends on the calendar. Neither is intuitive. Developers want to know "can I finish this task?" and the answer is always "it depends."
Rolling window confusion. The concept of a "rolling 5-hour window" is genuinely confusing. Does the window start from my first request? Does it move continuously? What happens if I don't use it for 3 hours -- do I get more capacity? (Yes, the oldest usage rolls off, freeing up capacity. But this isn't obvious.)
Multiplier math. GitHub Copilot's multiplier system requires developers to do mental arithmetic before every interaction. "If I use Claude Opus 4 for this question, that's 10 premium requests, and I've used 180 this month, so I have 120 left, which means 12 more Opus interactions..." Nobody should have to do this during a debugging session.
Real-World Impact
Let's translate these limits into actual development scenarios to understand what they mean for daily work.
Scenario: 4-Hour Feature Sprint
You're implementing a medium-sized feature -- new API endpoint, business logic, tests, and documentation. It's a focused 4-hour block.
-
Claude Code (Max 5x): Comfortable. ~88K tokens per 5-hour window gives you substantial headroom for loading files, iterating on implementations, and generating tests. You'll probably use 50-70% of your window, leaving room for unexpected debugging.
-
Gemini CLI (free): Tight but possible. 1,000 daily requests sounds like a lot, but agentic sessions generate many API calls per user action. You might hit daily limits toward the end of the sprint. The 60 RPM limit is unlikely to be an issue for interactive use. Plan for roughly 200-300 effective "user-level" interactions, which should cover a 4-hour sprint if you're not wasteful.
-
Codex CLI (Plus): Risky. Users report hitting 5-hour limits in under 2 hours of sustained use. You might need to take forced breaks mid-sprint, which destroys productivity and context. Having another tool ready as a fallback is advisable.
-
GitHub Copilot (Pro): Depends on model choice. With included models (GPT-4.1, GPT-5 mini), you're unlimited for inline suggestions. If you're relying on premium models for chat guidance alongside coding, 300 monthly requests might take a noticeable hit from a single sprint. A 4-hour sprint with 10-15 Opus interactions would consume 100-150 premium requests -- half your monthly budget in one afternoon.
Scenario: Debugging a Production Issue
It's 2 AM, production is down, and you need AI assistance to trace through logs, identify the root cause, and implement a fix. Time pressure is high, and you're going to be rapid-firing prompts.
-
Claude Code: The rolling window works in your favor here. If you haven't been coding for the past few hours, your 5-hour window is fresh. You have your full allocation for this emergency session. This is one of the best aspects of rolling windows -- they reward bursty usage patterns.
-
Gemini CLI: Per-minute limits (60 RPM) are generous enough for rapid debugging. Daily limits reset at midnight Pacific, so if this is a 2 AM emergency, you've got a fresh 1,000 requests. The free tier might actually be your best friend in a crisis.
-
Codex CLI: Same rolling window advantage as Claude Code -- if you haven't been using it, your window is full. But with the promotional 2x limits, a sustained debugging session could still hit the ceiling. The sandbox environment is useful here since you can safely test fixes without touching production.
-
GitHub Copilot: Monthly limits mean your available budget depends on what day of the month it is. If it's the 28th and you've been using premium models all month, you might be working with a nearly depleted quota. At $0.04 per overage, the cost of a debugging emergency is manageable -- maybe $5-10 in overages -- but having to think about cost during an outage is the wrong cognitive load at the wrong time.
Scenario: All-Day AI-Assisted Development
You want to use your AI coding tool as a persistent pair programmer for an entire 8-hour workday.
-
Claude Code (Pro): You will hit limits. ~44K tokens per 5-hour window means you'll be throttled at least once during an 8-hour day, with a forced slowdown while your oldest usage rolls off. The slowdown typically lasts 30-60 minutes as old usage ages out. Upgrading to Max 5x or Max 20x is the intended path for this use case.
-
Gemini CLI (free): 1,000 requests might last through an 8-hour day of casual use, but not heavy agentic use. For sustained pair programming, you'll want an API key with pay-as-you-go billing. Budget roughly 125 user-level requests per hour for heavy use, which exhausts 1,000 in about 8 hours of moderate activity.
-
Codex CLI (Plus): Likely two or three forced breaks during an 8-hour day based on reported 5-hour limit depletion rates. The weekly quota might also become a factor by Thursday if you're doing this daily. By Friday, you might find yourself rationing usage or switching to another tool.
-
GitHub Copilot (Pro): Inline suggestions are unlimited, so the pair-programming experience for tab completions is uninterrupted all day, every day. Premium model usage needs to be rationed across the month. If you rely on premium chat for 5-10 interactions per day, that's 100-200 requests per month -- feasible on Pro but tight on months with heavy debugging.
Strategies to Avoid Hitting Limits
1. Monitor Before You're Blocked
Every tool has some form of usage indicator, though quality varies:
- Claude Code shows usage status in the terminal interface with a progress indicator
- Gemini CLI provides clear request count information in API responses
- Codex CLI shows usage in the CLI interface
- GitHub Copilot has a usage dashboard on GitHub.com under Settings > Copilot
Check your usage at the start of each coding session so you know your budget. It takes 10 seconds and can save you the frustration of an unexpected limit mid-task.
2. Right-Size Your Context
As we covered in our context windows guide, larger context means more tokens consumed per interaction. Loading your entire codebase when you only need three files wastes tokens and accelerates rate limit depletion.
Be deliberate about what you include. Most AI coding CLIs let you specify files or directories. Use that control to minimize per-interaction token cost. A well-targeted 30K-token interaction gives you 2-3x more interactions per window than a sprawling 80K-token interaction with the same model.
3. Batch Your Heavy Work
If you know you have a token-intensive task (large refactor, comprehensive code review, full-feature implementation), schedule it for when your rate limits are fresh. For rolling-window tools like Claude Code and Codex, this means starting after a period of low usage. For monthly-quota tools like Copilot, this means early in the month.
Some developers adopt a rhythm: heavy AI-assisted coding in the morning when limits are fresh, manual coding and reviews in the afternoon while limits recover. This isn't ideal -- your tools should work when you need them -- but it's a practical adaptation to current constraints.
4. Use the Right Model for the Task
Not every task needs the most powerful model. Quick questions, simple refactors, and boilerplate generation work fine with smaller, faster models that consume fewer resources. Save the heavy models for tasks that genuinely require them -- complex architectural decisions, subtle bug hunting, and nuanced code review.
This applies especially to GitHub Copilot's multiplier system. Using Claude Opus 4 (10x) instead of GPT-4.5 (50x) for chat gives you 5x more interactions. Using included models like GPT-4.1 (0x) gives you unlimited interactions. The quality difference between GPT-4.1 and GPT-4.5 is real but not always relevant for the task at hand.
5. Write Better Prompts
This sounds obvious, but it has a direct impact on rate limits. A well-crafted prompt that produces the right result on the first try consumes half the tokens of a vague prompt that requires three rounds of clarification. Include relevant context, be specific about what you want, and provide examples when the task is ambiguous.
Specific prompt improvements that reduce rate limit consumption:
- State the desired output format ("generate a TypeScript function that..." vs. "help me with this code")
- Include constraints upfront ("must handle null inputs and return early" vs. discovering this after two rounds)
- Provide examples of the pattern you want (one good example saves three clarification rounds)
The best practices guide covers prompt engineering strategies specific to AI coding CLIs.
6. Cache and Reuse
If your tool supports it, take advantage of context caching. Gemini's 75% caching discount means repeated context is nearly free. For tools without explicit caching, you can still reduce per-interaction cost by maintaining a consistent project context across sessions rather than rebuilding it each time.
The Multi-Tool Approach
Here's the strategy an increasing number of developers are adopting: don't rely on a single tool.
Each AI coding CLI has different rate limit structures, different strengths, and different billing cycles. By distributing your workload across multiple tools, you can:
- Avoid hitting any single tool's ceiling. When Claude Code's 5-hour window runs out, switch to Gemini CLI for the next hour while Claude's oldest usage rolls off.
- Match tools to tasks. Use GitHub Copilot's unlimited inline suggestions for tab completions, Claude Code for deep agentic work, and Gemini CLI for quick explorations on the free tier.
- Reduce total cost. Instead of upgrading to the highest tier of one tool, a mid-tier subscription to two tools might give you more total throughput at lower cost.
- Eliminate single points of failure. If one provider has an outage or changes their rate limits, your workflow isn't completely blocked.
Example multi-tool day:
- 9:00-11:00 AM: Claude Code for a focused feature implementation (uses ~60% of 5-hour window)
- 11:00-12:00 PM: Gemini CLI for code review and documentation (uses ~100 of 1,000 daily requests)
- 1:00-3:00 PM: Copilot inline suggestions for implementation work (unlimited, no cost)
- 3:00-5:00 PM: Claude Code for testing and debugging (window has recovered from morning usage)
Total cost: one Claude Code subscription + zero for Gemini free tier + one Copilot subscription. Total rate limit interruptions: zero.
The engineering manager's guide to multi-tool workflows covers how to implement this strategy at the team level, including Git workflow considerations for teams using multiple AI tools.
What Providers Should Fix
As practitioners, we shouldn't just accept opaque rate limits as the cost of doing business. Here's what the industry needs:
Publish exact numbers. Community estimates of token budgets shouldn't be necessary. If the limit is 44,000 tokens per 5-hour window, say so. Let developers plan. Rough estimates lead to rough planning.
Show real-time usage. A progress bar showing "you've used 73% of your current window" is table stakes. Most tools offer some version of this, but it's often buried or imprecise. This should be a first-class UI element, visible at a glance.
Warn before cutting off. "You're at 90% of your 5-hour window" is infinitely better than "limit reached." Let developers wrap up their current task before hitting the wall. A 30-second warning is all it takes.
Explain what counts. Token-weighted systems should show per-interaction token costs. Multiplier systems should display the effective cost before you hit Send. Transparency builds trust, and trust builds loyalty.
Provide graceful degradation. Instead of a hard stop, offer a reduced-quality fallback. "You've hit your Opus limit, switching to Sonnet" is better than "You've hit your limit, goodbye." Some tools are starting to implement this, and it's a pattern the entire industry should adopt.
The Bottom Line
Rate limits are a fact of life with AI coding tools, but understanding how they work turns them from random interruptions into predictable constraints you can plan around.
If you want maximum transparency, Gemini CLI's published per-minute and per-day limits are the clearest in the industry. You always know where you stand.
If you want maximum throughput, Claude Code's Max 20x tier at ~220K tokens per 5-hour window offers the highest sustained ceiling for individual developers.
If you want the best free experience, Gemini CLI's 1,000 requests per day dwarfs every other free offering. See the free tier guide for maximizing it.
If you want to avoid rate limits entirely, the multi-tool approach distributes your usage across multiple providers, each with independent limits and reset cycles. It requires more setup but delivers more total AI coding capacity than any single subscription.
The complete CLI pricing guide puts rate limits in the context of overall cost, and the CLI comparison evaluates how rate limits interact with each tool's capabilities. Rate limits are just one factor -- but they're the one that interrupts your work, which makes them worth understanding thoroughly.