Rate limits

Two layers of limits apply to every request.

Per-key limits (yours to control)

Limit	Default	Where to change
Requests per minute (RPM)	60	Dashboard > Keys > Edit
Requests per day (RPD)	10,000	Dashboard > Keys > Edit
Concurrent requests	10	Dashboard > Keys > Edit
Monthly spend cap	Rp1,000,000	Dashboard > Keys > Edit

Exceeding any of these returns HTTP 429 with a code indicating which limit was hit.

We can raise these in the dashboard at any time. For production workloads, typical settings:

Small app: 60 / 10K / 10 / Rp1,000,000 (default)
Medium app: 300 / 100K / 32 / Rp5,000,000
Large app: 1000 / 1M / 64 / Rp20,000,000+

Email hello@epithre.com if you need beyond 1000 RPM.

Backend capacity (shared)

Each chat model has an aggregate concurrent cap across all Epithre customers, sized to real serving capacity and reserving headroom for IsonAI internal services. Behavior on saturation differs per model:

Model	On saturation
`epithre-omni`	Queued for up to 45 seconds, then `HTTP 429 backend_busy` if no slot freed
`epithre-prme`	Immediate `HTTP 429 backend_busy` (long-tail generations, queueing rarely helps)
`epithre-lyt`	Immediate `HTTP 429 backend_busy`

Practical effect for Omni: short customer-side bursts (most completions take 10-30s) are absorbed transparently as a small first-byte latency increase instead of a 429. You should rarely see backend_busy on Omni unless saturation lasts longer than 45 seconds.

If you do see backend_busy regularly on Omni, email us; sustained saturation means the pool cap needs raising rather than queueing harder.

Header inspection

Currently we don't surface X-RateLimit-* headers in responses. The recommended pattern is:

Get a 429 response.
Check error.code: rpm_exceeded vs rpd_exceeded vs concurrency_exceeded vs backend_busy.
Backoff per the recovery strategy:
rpm_exceeded: short backoff (1-2s, then retry).
rpd_exceeded: long backoff. Raise daily cap or wait until UTC midnight.
concurrency_exceeded: reduce parallelism client-side.
backend_busy: on Omni this means sustained saturation (the 45s queue already expired) — back off 5-10s. On PRME/LYT, retry after 1-3s.

Multiple keys

Create as many keys as you want from the dashboard. Every key on the same account draws from a single shared credit balance, and each key has its own independent RPM / RPD / concurrency / monthly cap. Useful for:

Sharding traffic across keys to bypass per-key concurrency caps
Per-environment keys (prod / staging / canary) with independent caps but shared billing
Per-team keys with attribution via the name field (shows up in usage events for spend slicing)

Keys are independent for revocation. If one leaks, revoke only that one - other keys keep working with the same balance.

Rate limits

Per-key limits (yours to control)

Backend capacity (shared)

Header inspection

Multiple keys

See also