TL;DR

OpenAI rate limits are enforced per organization, per key, and per model. To set them in production:

  1. Open platform.openai.com → Settings → Limits and choose your tier or request an upgrade.
  2. Per-API-key, set monthly hard caps and soft alerts under Settings → Billing → Usage limits.
  3. In code, treat 429s as expected — read Retry-After, back off exponentially, and surface the wait in your agent’s planner.
  4. Track per-tenant token spend in your own database; cut off agents whose hourly burn exceeds budget before OpenAI does.

What rate limits OpenAI actually enforces

For each model, OpenAI enforces:

Tier 1 starts low. Tier 5 is generous but requires sustained spend. Always check your real limits in the dashboard, not the docs — they evolve.

Step-by-step: hard limits in the OpenAI dashboard

  1. Sign in at platform.openai.com.
  2. Settings → Limits: see your current per-model TPM/RPM. Submit an increase request if you need more — approval is faster with usage history.
  3. Settings → Billing → Usage limits: set a Hard limit (cuts off API calls at this dollar amount) and a Soft limit (sends an email alert). Hard limits are the seatbelt that prevents a buggy agent from spending $40K overnight.
  4. API keys: rotate keys per environment (dev / staging / prod). Each key inherits the org-level limits but is logged separately, which makes diagnosis far easier.

Code: handling 429s without crashing your agent

When the API returns 429, three rules:

  1. Always read Retry-After if present, otherwise back off exponentially with jitter (e.g., 2^attempt × random(0.5, 1.5) seconds).
  2. Cap retries at 5 — beyond that, surface the failure to the agent’s planner and pause the workflow rather than retry indefinitely.
  3. Distinguish RPM from TPM — RPM throttling clears in seconds; TPM throttling can take a full minute. Different waits.
def call_with_backoff(client, **kwargs):
    for attempt in range(5):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError as e:
            wait = e.retry_after or min(60, 2 ** attempt + random.uniform(0, 1))
            time.sleep(wait)
    raise RuntimeError("rate limited 5 times — paused")

Per-tenant rate limits inside your own app

OpenAI’s limit is across your whole org. If you serve multiple tenants, you need a second limit per tenant so one customer’s runaway agent can’t starve everyone else.

Pattern: store tokens_used_in_last_hour per tenant in Redis with a sliding window. Reject calls when a tenant exceeds their budget; tell the user “rate-limited by Musketeers, retry in N seconds.” This is the difference between graceful degradation and a 3am incident.

What to monitor

April 24, 2026 Musketeers Tech Musketeers Tech
← Back