Prompt caching

If you send the same system prompt or few-shot examples on every request, prompt caching saves you up to 90% on those input tokens. Mark the cacheable portion with cache_control; subsequent requests within 5 minutes hit cache.

Two layers operate independently:

Explicit cache (this guide): you mark cacheable content with cache_control. You get the 1.25x write / 0.1x read billing rates and a 5-minute Redis TTL keyed per (api_key, model, prefix).
Automatic backend prefix cache: the inference backend opportunistically reuses KV state when a recent request shared a prompt prefix. This reduces latency but has no billing discount — you still pay full input rate. It surfaces in usage.cache_read_input_tokens for observability when no explicit marker was used. Best-effort, no TTL guarantees.

The two stack naturally: an explicit cache hit will almost always also be a backend prefix-cache hit, so explicit caching gives you both billing savings and the latency win.

Why it matters

Common workload: customer support bot with a 2000-token system prompt + 500-token user message. Every request pays for the full 2500 input tokens.

With caching:

First call: 1.25x rate on the 2000 cached tokens (write cost), 1x on the 500 user tokens.
Calls 2+ within 5 minutes: 0.1x rate on the 2000 cached tokens (read), 1x on the user tokens.
Net savings on call 2: (2000 * 0.9) / 2500 = 72% off input.
Over a steady-state load of 100 requests/hour, you pay 1 write + 99 reads = 90% savings on the cached portion.

How to use it

Mark the cacheable content with cache_control: {"type": "ephemeral"} on the last content block of the prefix you want cached.

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[
        {"role": "system", "content": [
            {"type": "text",
             "text": "<long stable system prompt with few-shot examples>",
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": "Pertanyaan user yang berubah-ubah..."},
    ],
)

print(resp.usage)
# {
#   "prompt_tokens": 2450,
#   "completion_tokens": 180,
#   "total_tokens": 2630,
#   "cache_creation_input_tokens": 2400,  # written to cache on this call
#   "cache_read_input_tokens": 0          # this is the first call
# }

Second call within 5 minutes with the same prefix:

# Same messages, just a different user content
# usage on the response:
# {"prompt_tokens": 2450, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 2400}

The cached tokens bill at 0.1x input rate.

Marker placement: what the gateway actually parses

The cache layer only sees markers placed at content-block level inside a list-form content field. Anything else is silently ignored — you'll still get a successful response, just no caching.

Required shape:

{"role": "system", "content": [
    {"type": "text", "text": "<prefix>", "cache_control": {"type": "ephemeral"}}
]}

Ignored shapes (response succeeds, marker silently dropped, full input rate billed):

// Plain-string content - content must be a list, not a string
{"role": "system", "content": "<prefix>", "cache_control": {"type": "ephemeral"}}

// Message-level field outside content - parser only inspects content blocks
{"role": "system", "cache_control": {"type": "ephemeral"}, "content": [...]}

// Wrong type value - only "ephemeral" is recognized
{"type": "text", "text": "...", "cache_control": {"type": "persistent"}}

If you have an existing string-content system message, convert it to list form:

# Before
{"role": "system", "content": SYSTEM_PROMPT}

# After
{"role": "system", "content": [
    {"type": "text", "text": SYSTEM_PROMPT,
     "cache_control": {"type": "ephemeral"}}
]}

What can be cached

The marker covers everything from messages[0] up to and including the message containing the marker. Concretely:

System messages with long stable content
Few-shot examples in early user/assistant turns
Document context that doesn't change between calls
RAG context (the retrieved docs portion, before the query)

What can't be cached:

The top-level tools field. Tool definitions sent as payload["tools"] are billed at full input rate every call, even with a marker present elsewhere — the cache layer only inspects messages. Workaround for large tool specs: duplicate the tool documentation as plain text into your cached system message, and keep the structured tools field for function-call schema enforcement. The marker then covers the documentation copy; the structured spec stays full-rate (typically ~10% of the documentation token cost on JSON Schema). See Pattern 4: agentic loop with tools.
Anything after the marker. Content after the marker is full-rate input — by design, so per-call variable content (user queries, iteration turns) stays uncached.
Image content blocks, currently. Marker on a message containing an image_url block is parsed normally, but vision-token billing doesn't carry a cache discount yet. Roadmap.

Cache rules

Minimum prefix length: 100 tokens (~400 chars). Below this, the marker is ignored and you bill at full rate. There's no point caching tiny prefixes.
TTL: 5 minutes from last access. Cache hit refreshes the TTL — long-running loops stay warm as long as gaps between calls are under 5 minutes.
Cache key scope: (api_key_id, model, prefix_content) — per-key, not per-account. Two API keys on the same account with identical prefixes will each MISS on their first call. Use one consistent key per workload to maximize hit rate. No cross-key sharing means no cross-tenant billing contamination either.
Backend prefix cache is separate and shared: the inference backend's KV cache (a latency optimization) IS shared across tenants. You may see prompt_tokens_details.cached_tokens > 0 even on a cold explicit-cache miss because another tenant primed the same prefix at the backend. This affects latency only — your billing is still based on the explicit-cache (api_key_id, ...) key.
One breakpoint per request currently. Multi-breakpoint support is on the roadmap.
Identical prefix wins: cache lookup is exact hash match on prefix content. One typo, a timestamp baked into the system prompt, a session ID injected per-call — all break the cache. Pull dynamic content out of the cached prefix.

Cost math

Base input rate is the model's input_per_mtok price. The multipliers:

Mode	Multiplier	Example on `epithre-omni` (Rp7,000 / 1M tok input)
Regular input	1.0x	Rp7,000 / 1M cached-prefix-equivalent tokens
Cache write	1.25x	Rp8,750 / 1M
Cache read	0.1x	Rp700 / 1M

Break-even is 1 hit: write cost = (1.25 - 1.0) = 0.25 of input cost. Reading once saves you (1.0 - 0.1) = 0.9 of input cost. So 0.25 < 0.9 means even one re-use is worth it.

Stack with Batch API for max savings: cache-read inside a batch = 0.1x * 0.5x = 0.05x of base input. 20x cheaper than realtime.

Common patterns

Pattern 1: stable system prompt

Most common. Long persona / instructions stay the same; user messages vary.

SYSTEM_PROMPT = """<long stable system prompt, maybe 1500-3000 tokens>"""

def chat(user_msg):
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text", "text": SYSTEM_PROMPT,
                 "cache_control": {"type": "ephemeral"}}
            ]},
            {"role": "user", "content": user_msg},
        ],
    )

Pattern 2: stable system + few-shot examples

Cache through the examples; only the actual query varies.

def chat(user_msg):
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": "You classify Indonesian sentiment."},
            {"role": "user", "content": "Barang ok, ongkir cepat."},
            {"role": "assistant", "content": "positif"},
            {"role": "user", "content": "Salah kirim, refund lambat."},
            {"role": "assistant", "content": [
                {"type": "text", "text": "negatif",
                 "cache_control": {"type": "ephemeral"}}
            ]},
            # ^ everything above this is cached
            {"role": "user", "content": user_msg},  # only this varies
        ],
    )

Pattern 3: RAG context

Cache the retrieved document context; the question changes per call.

def answer(question, retrieved_docs):
    context_block = "\n\n".join(f"[{d['id']}] {d['text']}" for d in retrieved_docs)
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text",
                 "text": f"Answer based on these documents:\n\n{context_block}",
                 "cache_control": {"type": "ephemeral"}}
            ]},
            {"role": "user", "content": question},
        ],
    )

This pattern is most valuable when the user asks multiple follow-up questions about the same retrieved set. The first call writes the cache; follow-ups hit it.

Pattern 4: agentic loop with cached tools spec

Use case: multi-iteration agentic loop where a stable system prompt + tools spec is replayed each call, only the running scratchpad / observation varies.

The naive approach pays full rate on tool definitions every iteration, since payload["tools"] isn't covered by the marker. Workaround: render the tools as readable text into the cached system message, while still sending the structured tools field for function-call schema enforcement.

def render_tools_doc(tools_spec):
    """Convert structured tools spec to plain text doc for cacheable system prompt."""
    lines = []
    for t in tools_spec:
        fn = t["function"]
        lines.append(f"- `{fn['name']}({', '.join(fn['parameters'].get('properties', {}).keys())})`")
        lines.append(f"    {fn.get('description', '')}")
    return "\n".join(lines)

SYSTEM_PROMPT = """<your stable system prompt, ~750 tokens>"""

def run_iteration(history, tools_spec):
    system_text = SYSTEM_PROMPT + "\n\n## Available tools\n" + render_tools_doc(tools_spec)
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text", "text": system_text,
                 "cache_control": {"type": "ephemeral"}}
            ]},
            # iteration history (variable) - not cached
            *history,
        ],
        tools=tools_spec,  # structured spec - schema enforcement, NOT cached
    )

Across 9 iterations with a 750-token system + 600-token tools doc = 1350 token prefix:

Call 1: 1× cache write (1.25x) on 1350 tokens
Calls 2-9: 8× cache read (0.1x) on 1350 tokens each
Savings vs no-cache baseline: (8 × 0.9 × 1350) / (9 × 1350) ≈ 80% on the cached prefix portion

Verify in usage.cache_creation_input_tokens (expect ~1350 on call 1, then 0) and usage.cache_read_input_tokens (expect ~1350 on calls 2-9).

Diagnostics

Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens to verify your caching is working:

print(resp.usage)

If cache_creation_input_tokens > 0 and cache_read_input_tokens == 0: first call with marker present, explicit cache being written. Expected.

If cache_creation_input_tokens == 0 and cache_read_input_tokens > 0: explicit cache hit (when you used a marker) or automatic backend prefix-cache hit (when you didn't). To distinguish, check whether you sent a cache_control marker in the request — only marker-based hits get the 0.1x billing rate.

If both are 0: no explicit marker and no recent backend prefix-cache hit. You're paying full input rate. To unlock billing savings, add a cache_control marker on stable prefixes ≥100 tokens.

If you expected an explicit hit but got cache_creation_input_tokens > 0: the prefix differs from the previous call. Check whitespace, message order, content of every block before the marker.

Common mistakes

Putting cache_control in the wrong place: it must be on a content block, not the message itself. Wrong: {"role": "system", "content": "...", "cache_control": ...}. Right: {"role": "system", "content": [{"type": "text", "text": "...", "cache_control": {...}}]}.
Cache miss because of identifier in system prompt: if your system prompt includes the user's ID or session ID, every user's cache is separate. Pull dynamic identifiers out of the cached prefix.
Trying to cache short prefixes: below 100 tokens, the marker is silently ignored. Aim for prefixes 500+ tokens for meaningful savings.
Mixing cache_control with non-cacheable tool definitions: tool definitions today aren't cached. The marker still applies to the message content; tool tokens are just billed full-rate alongside.

Best practices guide - retry patterns, batch + cache compounding.
Cookbook: RAG - end-to-end pattern with caching for retrieved context.
Chat reference - full cache_control schema.