Rate limits
Two layers of limits apply to every request.
Per-key limits (yours to control)
| Limit | Default | Where to change |
|---|---|---|
| Requests per minute (RPM) | 60 | Dashboard > Keys > Edit |
| Requests per day (RPD) | 10,000 | Dashboard > Keys > Edit |
| Concurrent requests | 10 | Dashboard > Keys > Edit |
| Monthly spend cap | Rp1,000,000 | Dashboard > Keys > Edit |
Exceeding any of these returns HTTP 429 with a code indicating which limit was hit.
We can raise these in the dashboard at any time. For production workloads, typical settings:
- Small app: 60 / 10K / 10 / Rp1,000,000 (default)
- Medium app: 300 / 100K / 32 / Rp5,000,000
- Large app: 1000 / 1M / 64 / Rp20,000,000+
Email hello@epithre.com if you need beyond 1000 RPM.
Backend capacity (shared)
Each chat model has an aggregate concurrent cap across all Epithre customers, sized to real serving capacity and reserving headroom for IsonAI internal services. Behavior on saturation differs per model:
| Model | On saturation |
|---|---|
epithre-omni |
Queued for up to 45 seconds, then HTTP 429 backend_busy if no slot freed |
epithre-prme |
Immediate HTTP 429 backend_busy (long-tail generations, queueing rarely helps) |
epithre-lyt |
Immediate HTTP 429 backend_busy |
Practical effect for Omni: short customer-side bursts (most completions take 10-30s) are absorbed transparently as a small first-byte latency increase instead of a 429. You should rarely see backend_busy on Omni unless saturation lasts longer than 45 seconds.
If you do see backend_busy regularly on Omni, email us; sustained saturation means the pool cap needs raising rather than queueing harder.
Header inspection
Currently we don't surface X-RateLimit-* headers in responses. The recommended pattern is:
- Get a 429 response.
- Check
error.code:rpm_exceededvsrpd_exceededvsconcurrency_exceededvsbackend_busy. - Backoff per the recovery strategy:
rpm_exceeded: short backoff (1-2s, then retry).rpd_exceeded: long backoff. Raise daily cap or wait until UTC midnight.concurrency_exceeded: reduce parallelism client-side.backend_busy: on Omni this means sustained saturation (the 45s queue already expired) — back off 5-10s. On PRME/LYT, retry after 1-3s.
Multiple keys
Create as many keys as you want from the dashboard. Every key on the same account draws from a single shared credit balance, and each key has its own independent RPM / RPD / concurrency / monthly cap. Useful for:
- Sharding traffic across keys to bypass per-key concurrency caps
- Per-environment keys (prod / staging / canary) with independent caps but shared billing
- Per-team keys with attribution via the
namefield (shows up in usage events for spend slicing)
Keys are independent for revocation. If one leaks, revoke only that one - other keys keep working with the same balance.