Best practices
What we've seen work (and what we've seen fail) in production deployments.
Set a system prompt for your application
Your role: "system" message is where you shape:
- Persona and branding - how the assistant introduces itself in your product.
- Domain framing - hotel ops, legal research, customer support, internal tooling all need different tone, register, and topical bounds.
- Tone and style specific to your audience.
If you don't provide a system message, the platform applies a generic helpful-assistant default. That's fine for prototyping but in production you almost always want your own. See prompt engineering for Bahasa Indonesia for register-handling patterns.
A baseline content policy applies on all chat completions regardless of your system prompt. See the Acceptable Use Policy for what's covered.
Retry pattern
Standard exponential backoff with jitter. Retry only on transient errors.
import time, random
from openai import APIError, RateLimitError, APITimeoutError, APIConnectionError
def call_with_retry(call_fn, max_retries=4):
for attempt in range(max_retries):
try:
return call_fn()
except RateLimitError as e:
# 429: respect retry-after if present, otherwise exponential
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
except (APITimeoutError, APIConnectionError) as e:
# 504, network failure: short backoff
if attempt == max_retries - 1:
raise
time.sleep(1 + attempt)
except APIError as e:
# 5xx, retry
if e.status_code >= 500 and attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raise # 4xx (except 429): don't retry
raise Exception("retry exhausted")
What NOT to retry:
400 invalid_request_error- your code is wrong, retrying won't fix.401 authentication_error- bad key, no point.402 insufficient_quota- balance is 0, top up first.403 permission_error- account state issue.422 invalid_request_error- validation failure, fix the input.
Prompt caching playbook
Cache anything that's stable across multiple requests. Order of priority:
- System prompts - mark with
cache_controlif the system content is the same across calls. Highest ROI. - Few-shot examples - if you have stable demonstration pairs, mark the last one with
cache_control. The cache covers everything before. - Long retrieved context - in RAG, when the same documents stay relevant across follow-up questions, cache the context block.
Don't cache:
- Tiny prefixes (<100 tokens). Below the minimum, the marker is ignored anyway.
- Content that changes per call (user messages, session-specific data).
- Prefixes you'll only use once. The 1.25x write cost makes single-use caching worse than no caching.
Batch vs realtime
| Workload | Use | Why |
|---|---|---|
| Embed an existing corpus once | Batch | 50% off, no latency requirement. |
| Re-embed after model change | Batch | Same reason. |
| Bulk classification of historical data | Batch | Cost-sensitive, async OK. |
| Periodic summarization (e.g. daily emails) | Batch | Schedule overnight; results next morning. |
| Real-time user chat | Realtime | Latency required. |
| Interactive RAG queries | Realtime | Latency required. |
| Synchronous API call from your product | Realtime | Latency required. |
Stacking: batch + prompt cache compounds. A cache-read inside a batch bills at 0.1x * 0.5x = 0.05x base rate. 20x off for stable-prefix bulk workloads.
Concurrency control
Per-key concurrency limit is 10 by default. Use a semaphore client-side to keep yourself under, otherwise you'll see 429.
import asyncio
sem = asyncio.Semaphore(8) # stay 2 below the 10 cap
async def call_one(input_text):
async with sem:
return await client.embeddings.create(model="epithre-embed", input=[input_text])
results = await asyncio.gather(*(call_one(t) for t in texts))
If you need more than 10 concurrent: raise the cap in the dashboard. We can support up to ~100 concurrent per key without backend pressure for chat/embed.
Rate limit hygiene
The default per-key limits (60 RPM, 10K RPD, 10 concurrent) are conservative for B2B workloads. Raise them in the dashboard if you have:
- Steady traffic above 1 RPS.
- Burst traffic over 10 concurrent.
- Daily volume above 10K requests.
Watch for the backend-busy 429:
{"error": {"type": "rate_limit_error", "code": "backend_busy"}}
This is shared-pool back-pressure (all customers, all keys). Solution: short retry (1-3s). It clears quickly.
If you see backend_busy more than rarely, email us; we'll raise the backend pool cap for that model.
Cost monitoring
Track three metrics in your own system:
- Daily spend per endpoint: from the dashboard
#/usagepage or the/dashboard/usage/eventsAPI. - Per-feature cost attribution: tag your requests with
metadata(where available) so you can group cost by feature in your DB if you replicateusage_events. - Cache hit rate: from
usage.cache_read_input_tokens/ (usage.cache_creation_input_tokens+usage.cache_read_input_tokens+ regularprompt_tokensof cacheable size). Low hit rate means caching isn't paying off; revisit the marker placement.
Set up a monthly cap on each key. The dashboard #/keys page has per-key monthly_idr_cap. Cap a runaway 100x over your expected spend (e.g., expected Rp500,000/mo -> cap Rp50,000,000) so the cap fires only on genuine runaway.
Idempotency
Epithre doesn't currently support idempotency keys explicitly. The recommended pattern:
- For chat: pass a deterministic
seedparameter. The model still has slight non-determinism (it's a probabilistic system), but consecutive identical requests with the same seed are highly similar. - For embed: deterministic for identical input.
- For image generation: deterministic with same
seed. - For batch: each line has a
custom_idyou control. On retry, dedupe bycustom_id.
If you accidentally double-fire a request (network retry on a response that actually succeeded), the cost is paid twice. Worth handling at the application layer: store request IDs, check for duplicates before submitting.
Error handling per endpoint
Common patterns:
try:
resp = client.chat.completions.create(...)
except RateLimitError:
# back off, retry
except APITimeoutError:
# the inference took too long; reduce max_tokens or simplify prompt
except APIError as e:
if e.status_code == 402:
# balance hit zero
alert_team()
elif e.status_code == 401:
# key got revoked; rotate to backup key
switch_to_fallback_key()
elif e.status_code >= 500:
# upstream issue, retry
...
else:
# 4xx other than the above: bug in your code
log_and_raise(e)
Track request_id from response headers X-Request-ID (or chatcmpl-... from response body) in your logs. Email us with the ID when reporting issues.
Token estimation
Char-based heuristic: tokens ~= chars / 4 for English, tokens ~= chars / 3 for dense Indonesian legal/finance text. Always over-estimate for budget reasons.
Concrete: a 6000-char Indonesian legal document is roughly 2000 tokens. Plan max_tokens accordingly to leave room for the response.
For exact counts, use the usage field in the response. There's no tokenizer API on Epithre yet.
Mixed-model strategies
Don't use the same model for everything. Common cost-optimized routing:
| Task | Best model | Rationale |
|---|---|---|
| User-facing chat | epithre-omni |
Quality matters most. |
| Classification, tagging, simple extraction | epithre-lyt |
6x cheaper, fast. |
| Long document analysis (>32K input) | epithre-prme |
Only model with 180K context. |
| Embedding for retrieval | epithre-embed |
Only embedding option. |
| Re-ranking after embed search | epithre-rerank |
Cheap and substantially boosts retrieval quality. |
| Image generation | epithre-iris |
Only image option. |
For agentic tool-use chains: use epithre-omni for the planner step, epithre-lyt for cheaper sub-tasks (per-document classification, etc).
Logging and observability
Log per request:
modelusedprompt_tokens,completion_tokens,cache_read_input_tokens,cache_creation_input_tokenslatency_ms(your client-measured)request_id- The user/session that triggered
This lets you trace cost spikes back to causes and triage support issues.
What to do when something breaks
- Check the response error envelope. Often the message tells you exactly what's wrong.
- Check the Epithre status page (link to be added when live). If we're degraded, just wait.
- Try a minimal repro: same request from
curlor Postman. Eliminates SDK / proxy issues. - Email
hello@epithre.comwith: request ID, timestamp, error response. We can look up in our usage_events table.