How to Set OpenAI API Rate Limits in Production

TL;DR

OpenAI rate limits are enforced per organization, per key, and per model. To set them in production:

Open platform.openai.com → Settings → Limits and choose your tier or request an upgrade.
Per-API-key, set monthly hard caps and soft alerts under Settings → Billing → Usage limits.
In code, treat 429s as expected — read Retry-After, back off exponentially, and surface the wait in your agent’s planner.
Track per-tenant token spend in your own database; cut off agents whose hourly burn exceeds budget before OpenAI does.

What rate limits OpenAI actually enforces

For each model, OpenAI enforces:

Requests per minute (RPM) — total API calls
Tokens per minute (TPM) — combined input + output tokens
Tokens per day (TPD) — daily ceiling for some tiers
Images per minute — for image-generation models

Tier 1 starts low. Tier 5 is generous but requires sustained spend. Always check your real limits in the dashboard, not the docs — they evolve.

Step-by-step: hard limits in the OpenAI dashboard

Sign in at platform.openai.com.
Settings → Limits: see your current per-model TPM/RPM. Submit an increase request if you need more — approval is faster with usage history.
Settings → Billing → Usage limits: set a Hard limit (cuts off API calls at this dollar amount) and a Soft limit (sends an email alert). Hard limits are the seatbelt that prevents a buggy agent from spending $40K overnight.
API keys: rotate keys per environment (dev / staging / prod). Each key inherits the org-level limits but is logged separately, which makes diagnosis far easier.

Code: handling 429s without crashing your agent

When the API returns 429, three rules:

Always read Retry-After if present, otherwise back off exponentially with jitter (e.g., 2^attempt × random(0.5, 1.5) seconds).
Cap retries at 5 — beyond that, surface the failure to the agent’s planner and pause the workflow rather than retry indefinitely.
Distinguish RPM from TPM — RPM throttling clears in seconds; TPM throttling can take a full minute. Different waits.

def call_with_backoff(client, **kwargs):
    for attempt in range(5):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError as e:
            wait = e.retry_after or min(60, 2 ** attempt + random.uniform(0, 1))
            time.sleep(wait)
    raise RuntimeError("rate limited 5 times — paused")

Per-tenant rate limits inside your own app

OpenAI’s limit is across your whole org. If you serve multiple tenants, you need a second limit per tenant so one customer’s runaway agent can’t starve everyone else.

Pattern: store tokens_used_in_last_hour per tenant in Redis with a sliding window. Reject calls when a tenant exceeds their budget; tell the user “rate-limited by Musketeers, retry in N seconds.” This is the difference between graceful degradation and a 3am incident.

What to monitor

429 rate over time — sustained 429s mean you should request a tier increase.
Average TPM utilization — keep below 80% of cap; spikes happen.
Per-tenant token spend — flag any tenant burning >5× their median hour.
Cost per workflow — if an agent is suddenly 10× more expensive per task, model regression or prompt drift is the usual cause.

April 24, 2026 Musketeers Tech Musketeers Tech

← Back