TL;DR
OpenAI rate limits are enforced per organization, per key, and per model. To set them in production:
- Open platform.openai.com → Settings → Limits and choose your tier or request an upgrade.
- Per-API-key, set monthly hard caps and soft alerts under Settings → Billing → Usage limits.
- In code, treat 429s as expected — read
Retry-After, back off exponentially, and surface the wait in your agent’s planner. - Track per-tenant token spend in your own database; cut off agents whose hourly burn exceeds budget before OpenAI does.
What rate limits OpenAI actually enforces
For each model, OpenAI enforces:
- Requests per minute (RPM) — total API calls
- Tokens per minute (TPM) — combined input + output tokens
- Tokens per day (TPD) — daily ceiling for some tiers
- Images per minute — for image-generation models
Tier 1 starts low. Tier 5 is generous but requires sustained spend. Always check your real limits in the dashboard, not the docs — they evolve.
Step-by-step: hard limits in the OpenAI dashboard
- Sign in at platform.openai.com.
- Settings → Limits: see your current per-model TPM/RPM. Submit an increase request if you need more — approval is faster with usage history.
- Settings → Billing → Usage limits: set a Hard limit (cuts off API calls at this dollar amount) and a Soft limit (sends an email alert). Hard limits are the seatbelt that prevents a buggy agent from spending $40K overnight.
- API keys: rotate keys per environment (dev / staging / prod). Each key inherits the org-level limits but is logged separately, which makes diagnosis far easier.
Code: handling 429s without crashing your agent
When the API returns 429, three rules:
- Always read
Retry-Afterif present, otherwise back off exponentially with jitter (e.g., 2^attempt × random(0.5, 1.5) seconds). - Cap retries at 5 — beyond that, surface the failure to the agent’s planner and pause the workflow rather than retry indefinitely.
- Distinguish RPM from TPM — RPM throttling clears in seconds; TPM throttling can take a full minute. Different waits.
def call_with_backoff(client, **kwargs):
for attempt in range(5):
try:
return client.chat.completions.create(**kwargs)
except RateLimitError as e:
wait = e.retry_after or min(60, 2 ** attempt + random.uniform(0, 1))
time.sleep(wait)
raise RuntimeError("rate limited 5 times — paused")
Per-tenant rate limits inside your own app
OpenAI’s limit is across your whole org. If you serve multiple tenants, you need a second limit per tenant so one customer’s runaway agent can’t starve everyone else.
Pattern: store tokens_used_in_last_hour per tenant in Redis with a sliding window. Reject calls when a tenant exceeds their budget; tell the user “rate-limited by Musketeers, retry in N seconds.” This is the difference between graceful degradation and a 3am incident.
What to monitor
- 429 rate over time — sustained 429s mean you should request a tier increase.
- Average TPM utilization — keep below 80% of cap; spikes happen.
- Per-tenant token spend — flag any tenant burning >5× their median hour.
- Cost per workflow — if an agent is suddenly 10× more expensive per task, model regression or prompt drift is the usual cause.