TL;DR

OpenAI rate limits are enforced per organization, per key, and per model. To set them in production:

  1. Open platform.openai.com → Settings → Limits and choose your tier or request an upgrade.
  2. Per-API-key, set monthly hard caps and soft alerts under Settings → Billing → Usage limits.
  3. In code, treat 429s as expected, ead Retry-After, back off exponentially, and surface the wait in your agent’s planner.
  4. Track per-tenant token spend in your own database; cut off agents whose hourly burn exceeds budget before OpenAI does.

What rate limits OpenAI actually enforces

For each model, OpenAI enforces:

Tier 1 starts low. Tier 5 is generous but requires sustained spend. Always check your real limits in the dashboard, not the docs, hey evolve.

Step-by-step: hard limits in the OpenAI dashboard

  1. Sign in at platform.openai.com.
  2. Settings → Limits: see your current per-model TPM/RPM. Submit an increase request if you need more, pproval is faster with usage history.
  3. Settings → Billing → Usage limits: set a Hard limit (cuts off API calls at this dollar amount) and a Soft limit (sends an email alert). Hard limits are the seatbelt that prevents a buggy agent from spending $40K overnight.
  4. API keys: rotate keys per environment (dev / staging / prod). Each key inherits the org-level limits but is logged separately, which makes diagnosis far easier.

Code: handling 429s without crashing your agent

When the API returns 429, three rules:

  1. Always read Retry-After if present, otherwise back off exponentially with jitter (e.g., 2^attempt × random(0.5, 1.5) seconds).
  2. Cap retries at 5, eyond that, surface the failure to the agent’s planner and pause the workflow rather than retry indefinitely.
  3. Distinguish RPM from TPM, PM throttling clears in seconds; TPM throttling can take a full minute. Different waits.
def call_with_backoff(client, **kwargs):
    for attempt in range(5):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError as e:
            wait = e.retry_after or min(60, 2 ** attempt + random.uniform(0, 1))
            time.sleep(wait)
    raise RuntimeError("rate limited 5 times, aused")

Per-tenant rate limits inside your own app

OpenAI’s limit is across your whole org. If you serve multiple tenants, you need a second limit per tenant so one customer’s runaway agent can’t starve everyone else.

Pattern: store tokens_used_in_last_hour per tenant in Redis with a sliding window. Reject calls when a tenant exceeds their budget; tell the user “rate-limited by Musketeers, retry in N seconds.” This is the difference between graceful degradation and a 3am incident.

What to monitor

April 24, 2026 Musketeers Tech Musketeers Tech
← Back