Best practices

What we've seen work (and what we've seen fail) in production deployments.

Set a system prompt for your application

Your role: "system" message is where you shape:

If you don't provide a system message, the platform applies a generic helpful-assistant default. That's fine for prototyping but in production you almost always want your own. See prompt engineering for Bahasa Indonesia for register-handling patterns.

A baseline content policy applies on all chat completions regardless of your system prompt. See the Acceptable Use Policy for what's covered.

Retry pattern

Standard exponential backoff with jitter. Retry only on transient errors.

import time, random
from openai import APIError, RateLimitError, APITimeoutError, APIConnectionError

def call_with_retry(call_fn, max_retries=4):
    for attempt in range(max_retries):
        try:
            return call_fn()
        except RateLimitError as e:
            # 429: respect retry-after if present, otherwise exponential
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            # 504, network failure: short backoff
            if attempt == max_retries - 1:
                raise
            time.sleep(1 + attempt)
        except APIError as e:
            # 5xx, retry
            if e.status_code >= 500 and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise  # 4xx (except 429): don't retry
    raise Exception("retry exhausted")

What NOT to retry:

Prompt caching playbook

Cache anything that's stable across multiple requests. Order of priority:

  1. System prompts - mark with cache_control if the system content is the same across calls. Highest ROI.
  2. Few-shot examples - if you have stable demonstration pairs, mark the last one with cache_control. The cache covers everything before.
  3. Long retrieved context - in RAG, when the same documents stay relevant across follow-up questions, cache the context block.

Don't cache:

Batch vs realtime

Workload Use Why
Embed an existing corpus once Batch 50% off, no latency requirement.
Re-embed after model change Batch Same reason.
Bulk classification of historical data Batch Cost-sensitive, async OK.
Periodic summarization (e.g. daily emails) Batch Schedule overnight; results next morning.
Real-time user chat Realtime Latency required.
Interactive RAG queries Realtime Latency required.
Synchronous API call from your product Realtime Latency required.

Stacking: batch + prompt cache compounds. A cache-read inside a batch bills at 0.1x * 0.5x = 0.05x base rate. 20x off for stable-prefix bulk workloads.

Concurrency control

Per-key concurrency limit is 10 by default. Use a semaphore client-side to keep yourself under, otherwise you'll see 429.

import asyncio

sem = asyncio.Semaphore(8)  # stay 2 below the 10 cap

async def call_one(input_text):
    async with sem:
        return await client.embeddings.create(model="epithre-embed", input=[input_text])

results = await asyncio.gather(*(call_one(t) for t in texts))

If you need more than 10 concurrent: raise the cap in the dashboard. We can support up to ~100 concurrent per key without backend pressure for chat/embed.

Rate limit hygiene

The default per-key limits (60 RPM, 10K RPD, 10 concurrent) are conservative for B2B workloads. Raise them in the dashboard if you have:

Watch for the backend-busy 429:

{"error": {"type": "rate_limit_error", "code": "backend_busy"}}

This is shared-pool back-pressure (all customers, all keys). Solution: short retry (1-3s). It clears quickly.

If you see backend_busy more than rarely, email us; we'll raise the backend pool cap for that model.

Cost monitoring

Track three metrics in your own system:

  1. Daily spend per endpoint: from the dashboard #/usage page or the /dashboard/usage/events API.
  2. Per-feature cost attribution: tag your requests with metadata (where available) so you can group cost by feature in your DB if you replicate usage_events.
  3. Cache hit rate: from usage.cache_read_input_tokens / (usage.cache_creation_input_tokens + usage.cache_read_input_tokens + regular prompt_tokens of cacheable size). Low hit rate means caching isn't paying off; revisit the marker placement.

Set up a monthly cap on each key. The dashboard #/keys page has per-key monthly_idr_cap. Cap a runaway 100x over your expected spend (e.g., expected Rp500,000/mo -> cap Rp50,000,000) so the cap fires only on genuine runaway.

Idempotency

Epithre doesn't currently support idempotency keys explicitly. The recommended pattern:

If you accidentally double-fire a request (network retry on a response that actually succeeded), the cost is paid twice. Worth handling at the application layer: store request IDs, check for duplicates before submitting.

Error handling per endpoint

Common patterns:

try:
    resp = client.chat.completions.create(...)
except RateLimitError:
    # back off, retry
except APITimeoutError:
    # the inference took too long; reduce max_tokens or simplify prompt
except APIError as e:
    if e.status_code == 402:
        # balance hit zero
        alert_team()
    elif e.status_code == 401:
        # key got revoked; rotate to backup key
        switch_to_fallback_key()
    elif e.status_code >= 500:
        # upstream issue, retry
        ...
    else:
        # 4xx other than the above: bug in your code
        log_and_raise(e)

Track request_id from response headers X-Request-ID (or chatcmpl-... from response body) in your logs. Email us with the ID when reporting issues.

Token estimation

Char-based heuristic: tokens ~= chars / 4 for English, tokens ~= chars / 3 for dense Indonesian legal/finance text. Always over-estimate for budget reasons.

Concrete: a 6000-char Indonesian legal document is roughly 2000 tokens. Plan max_tokens accordingly to leave room for the response.

For exact counts, use the usage field in the response. There's no tokenizer API on Epithre yet.

Mixed-model strategies

Don't use the same model for everything. Common cost-optimized routing:

Task Best model Rationale
User-facing chat epithre-omni Quality matters most.
Classification, tagging, simple extraction epithre-lyt 6x cheaper, fast.
Long document analysis (>32K input) epithre-prme Only model with 180K context.
Embedding for retrieval epithre-embed Only embedding option.
Re-ranking after embed search epithre-rerank Cheap and substantially boosts retrieval quality.
Image generation epithre-iris Only image option.

For agentic tool-use chains: use epithre-omni for the planner step, epithre-lyt for cheaper sub-tasks (per-document classification, etc).

Logging and observability

Log per request:

This lets you trace cost spikes back to causes and triage support issues.

What to do when something breaks

  1. Check the response error envelope. Often the message tells you exactly what's wrong.
  2. Check the Epithre status page (link to be added when live). If we're degraded, just wait.
  3. Try a minimal repro: same request from curl or Postman. Eliminates SDK / proxy issues.
  4. Email hello@epithre.com with: request ID, timestamp, error response. We can look up in our usage_events table.