Best practices

What we've seen work (and what we've seen fail) in production deployments.

Set a system prompt for your application

Your role: "system" message is where you shape:

Persona and branding - how the assistant introduces itself in your product.
Domain framing - hotel ops, legal research, customer support, internal tooling all need different tone, register, and topical bounds.
Tone and style specific to your audience.

If you don't provide a system message, the platform applies a generic helpful-assistant default. That's fine for prototyping but in production you almost always want your own. See prompt engineering for Bahasa Indonesia for register-handling patterns.

A baseline content policy applies on all chat completions regardless of your system prompt. See the Acceptable Use Policy for what's covered.

Retry pattern

Standard exponential backoff with jitter. Retry only on transient errors.

import time, random
from openai import APIError, RateLimitError, APITimeoutError, APIConnectionError

def call_with_retry(call_fn, max_retries=4):
    for attempt in range(max_retries):
        try:
            return call_fn()
        except RateLimitError as e:
            # 429: respect retry-after if present, otherwise exponential
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            # 504, network failure: short backoff
            if attempt == max_retries - 1:
                raise
            time.sleep(1 + attempt)
        except APIError as e:
            # 5xx, retry
            if e.status_code >= 500 and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise  # 4xx (except 429): don't retry
    raise Exception("retry exhausted")

What NOT to retry:

400 invalid_request_error - your code is wrong, retrying won't fix.
401 authentication_error - bad key, no point.
402 insufficient_quota - balance is 0, top up first.
403 permission_error - account state issue.
422 invalid_request_error - validation failure, fix the input.

Prompt caching playbook

Cache anything that's stable across multiple requests. Order of priority:

System prompts - mark with cache_control if the system content is the same across calls. Highest ROI.
Few-shot examples - if you have stable demonstration pairs, mark the last one with cache_control. The cache covers everything before.
Long retrieved context - in RAG, when the same documents stay relevant across follow-up questions, cache the context block.

Don't cache:

Tiny prefixes (<100 tokens). Below the minimum, the marker is ignored anyway.
Content that changes per call (user messages, session-specific data).
Prefixes you'll only use once. The 1.25x write cost makes single-use caching worse than no caching.

Batch vs realtime

Workload	Use	Why
Embed an existing corpus once	Batch	50% off, no latency requirement.
Re-embed after model change	Batch	Same reason.
Bulk classification of historical data	Batch	Cost-sensitive, async OK.
Periodic summarization (e.g. daily emails)	Batch	Schedule overnight; results next morning.
Real-time user chat	Realtime	Latency required.
Interactive RAG queries	Realtime	Latency required.
Synchronous API call from your product	Realtime	Latency required.

Stacking: batch + prompt cache compounds. A cache-read inside a batch bills at 0.1x * 0.5x = 0.05x base rate. 20x off for stable-prefix bulk workloads.

Concurrency control

Per-key concurrency limit is 10 by default. Use a semaphore client-side to keep yourself under, otherwise you'll see 429.

import asyncio

sem = asyncio.Semaphore(8)  # stay 2 below the 10 cap

async def call_one(input_text):
    async with sem:
        return await client.embeddings.create(model="epithre-embed", input=[input_text])

results = await asyncio.gather(*(call_one(t) for t in texts))

If you need more than 10 concurrent: raise the cap in the dashboard. We can support up to ~100 concurrent per key without backend pressure for chat/embed.

Rate limit hygiene

The default per-key limits (60 RPM, 10K RPD, 10 concurrent) are conservative for B2B workloads. Raise them in the dashboard if you have:

Steady traffic above 1 RPS.
Burst traffic over 10 concurrent.
Daily volume above 10K requests.

Watch for the backend-busy 429:

{"error": {"type": "rate_limit_error", "code": "backend_busy"}}

This is shared-pool back-pressure (all customers, all keys). Solution: short retry (1-3s). It clears quickly.

If you see backend_busy more than rarely, email us; we'll raise the backend pool cap for that model.

Cost monitoring

Track three metrics in your own system:

Daily spend per endpoint: from the dashboard #/usage page or the /dashboard/usage/events API.
Per-feature cost attribution: tag your requests with metadata (where available) so you can group cost by feature in your DB if you replicate usage_events.
Cache hit rate: from usage.cache_read_input_tokens / (usage.cache_creation_input_tokens + usage.cache_read_input_tokens + regular prompt_tokens of cacheable size). Low hit rate means caching isn't paying off; revisit the marker placement.

Set up a monthly cap on each key. The dashboard #/keys page has per-key monthly_idr_cap. Cap a runaway 100x over your expected spend (e.g., expected Rp500,000/mo -> cap Rp50,000,000) so the cap fires only on genuine runaway.

Idempotency

Epithre doesn't currently support idempotency keys explicitly. The recommended pattern:

For chat: pass a deterministic seed parameter. The model still has slight non-determinism (it's a probabilistic system), but consecutive identical requests with the same seed are highly similar.
For embed: deterministic for identical input.
For image generation: deterministic with same seed.
For batch: each line has a custom_id you control. On retry, dedupe by custom_id.

If you accidentally double-fire a request (network retry on a response that actually succeeded), the cost is paid twice. Worth handling at the application layer: store request IDs, check for duplicates before submitting.

Error handling per endpoint

Common patterns:

try:
    resp = client.chat.completions.create(...)
except RateLimitError:
    # back off, retry
except APITimeoutError:
    # the inference took too long; reduce max_tokens or simplify prompt
except APIError as e:
    if e.status_code == 402:
        # balance hit zero
        alert_team()
    elif e.status_code == 401:
        # key got revoked; rotate to backup key
        switch_to_fallback_key()
    elif e.status_code >= 500:
        # upstream issue, retry
        ...
    else:
        # 4xx other than the above: bug in your code
        log_and_raise(e)

Track request_id from response headers X-Request-ID (or chatcmpl-... from response body) in your logs. Email us with the ID when reporting issues.

Token estimation

Char-based heuristic: tokens ~= chars / 4 for English, tokens ~= chars / 3 for dense Indonesian legal/finance text. Always over-estimate for budget reasons.

Concrete: a 6000-char Indonesian legal document is roughly 2000 tokens. Plan max_tokens accordingly to leave room for the response.

For exact counts, use the usage field in the response. There's no tokenizer API on Epithre yet.

Mixed-model strategies

Don't use the same model for everything. Common cost-optimized routing:

Task	Best model	Rationale
User-facing chat	`epithre-omni`	Quality matters most.
Classification, tagging, simple extraction	`epithre-lyt`	6x cheaper, fast.
Long document analysis (>32K input)	`epithre-prme`	Only model with 180K context.
Embedding for retrieval	`epithre-embed`	Only embedding option.
Re-ranking after embed search	`epithre-rerank`	Cheap and substantially boosts retrieval quality.
Image generation	`epithre-iris`	Only image option.

For agentic tool-use chains: use epithre-omni for the planner step, epithre-lyt for cheaper sub-tasks (per-document classification, etc).

Logging and observability

Log per request:

model used
prompt_tokens, completion_tokens, cache_read_input_tokens, cache_creation_input_tokens
latency_ms (your client-measured)
request_id
The user/session that triggered

This lets you trace cost spikes back to causes and triage support issues.

What to do when something breaks

Check the response error envelope. Often the message tells you exactly what's wrong.
Check the Epithre status page (link to be added when live). If we're degraded, just wait.
Try a minimal repro: same request from curl or Postman. Eliminates SDK / proxy issues.
Email hello@epithre.com with: request ID, timestamp, error response. We can look up in our usage_events table.