Prompt caching

If you send the same system prompt or few-shot examples on every request, prompt caching saves you up to 90% on those input tokens. Mark the cacheable portion with cache_control; subsequent requests within 5 minutes hit cache.

Two layers operate independently:

The two stack naturally: an explicit cache hit will almost always also be a backend prefix-cache hit, so explicit caching gives you both billing savings and the latency win.

Why it matters

Common workload: customer support bot with a 2000-token system prompt + 500-token user message. Every request pays for the full 2500 input tokens.

With caching:

How to use it

Mark the cacheable content with cache_control: {"type": "ephemeral"} on the last content block of the prefix you want cached.

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[
        {"role": "system", "content": [
            {"type": "text",
             "text": "<long stable system prompt with few-shot examples>",
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": "Pertanyaan user yang berubah-ubah..."},
    ],
)

print(resp.usage)
# {
#   "prompt_tokens": 2450,
#   "completion_tokens": 180,
#   "total_tokens": 2630,
#   "cache_creation_input_tokens": 2400,  # written to cache on this call
#   "cache_read_input_tokens": 0          # this is the first call
# }

Second call within 5 minutes with the same prefix:

# Same messages, just a different user content
# usage on the response:
# {"prompt_tokens": 2450, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 2400}

The cached tokens bill at 0.1x input rate.

Marker placement: what the gateway actually parses

The cache layer only sees markers placed at content-block level inside a list-form content field. Anything else is silently ignored — you'll still get a successful response, just no caching.

Required shape:

{"role": "system", "content": [
    {"type": "text", "text": "<prefix>", "cache_control": {"type": "ephemeral"}}
]}

Ignored shapes (response succeeds, marker silently dropped, full input rate billed):

// Plain-string content - content must be a list, not a string
{"role": "system", "content": "<prefix>", "cache_control": {"type": "ephemeral"}}

// Message-level field outside content - parser only inspects content blocks
{"role": "system", "cache_control": {"type": "ephemeral"}, "content": [...]}

// Wrong type value - only "ephemeral" is recognized
{"type": "text", "text": "...", "cache_control": {"type": "persistent"}}

If you have an existing string-content system message, convert it to list form:

# Before
{"role": "system", "content": SYSTEM_PROMPT}

# After
{"role": "system", "content": [
    {"type": "text", "text": SYSTEM_PROMPT,
     "cache_control": {"type": "ephemeral"}}
]}

What can be cached

The marker covers everything from messages[0] up to and including the message containing the marker. Concretely:

What can't be cached:

Cache rules

Cost math

Base input rate is the model's input_per_mtok price. The multipliers:

Mode Multiplier Example on epithre-omni (Rp7,000 / 1M tok input)
Regular input 1.0x Rp7,000 / 1M cached-prefix-equivalent tokens
Cache write 1.25x Rp8,750 / 1M
Cache read 0.1x Rp700 / 1M

Break-even is 1 hit: write cost = (1.25 - 1.0) = 0.25 of input cost. Reading once saves you (1.0 - 0.1) = 0.9 of input cost. So 0.25 < 0.9 means even one re-use is worth it.

Stack with Batch API for max savings: cache-read inside a batch = 0.1x * 0.5x = 0.05x of base input. 20x cheaper than realtime.

Common patterns

Pattern 1: stable system prompt

Most common. Long persona / instructions stay the same; user messages vary.

SYSTEM_PROMPT = """<long stable system prompt, maybe 1500-3000 tokens>"""

def chat(user_msg):
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text", "text": SYSTEM_PROMPT,
                 "cache_control": {"type": "ephemeral"}}
            ]},
            {"role": "user", "content": user_msg},
        ],
    )

Pattern 2: stable system + few-shot examples

Cache through the examples; only the actual query varies.

def chat(user_msg):
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": "You classify Indonesian sentiment."},
            {"role": "user", "content": "Barang ok, ongkir cepat."},
            {"role": "assistant", "content": "positif"},
            {"role": "user", "content": "Salah kirim, refund lambat."},
            {"role": "assistant", "content": [
                {"type": "text", "text": "negatif",
                 "cache_control": {"type": "ephemeral"}}
            ]},
            # ^ everything above this is cached
            {"role": "user", "content": user_msg},  # only this varies
        ],
    )

Pattern 3: RAG context

Cache the retrieved document context; the question changes per call.

def answer(question, retrieved_docs):
    context_block = "\n\n".join(f"[{d['id']}] {d['text']}" for d in retrieved_docs)
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text",
                 "text": f"Answer based on these documents:\n\n{context_block}",
                 "cache_control": {"type": "ephemeral"}}
            ]},
            {"role": "user", "content": question},
        ],
    )

This pattern is most valuable when the user asks multiple follow-up questions about the same retrieved set. The first call writes the cache; follow-ups hit it.

Pattern 4: agentic loop with cached tools spec

Use case: multi-iteration agentic loop where a stable system prompt + tools spec is replayed each call, only the running scratchpad / observation varies.

The naive approach pays full rate on tool definitions every iteration, since payload["tools"] isn't covered by the marker. Workaround: render the tools as readable text into the cached system message, while still sending the structured tools field for function-call schema enforcement.

def render_tools_doc(tools_spec):
    """Convert structured tools spec to plain text doc for cacheable system prompt."""
    lines = []
    for t in tools_spec:
        fn = t["function"]
        lines.append(f"- `{fn['name']}({', '.join(fn['parameters'].get('properties', {}).keys())})`")
        lines.append(f"    {fn.get('description', '')}")
    return "\n".join(lines)

SYSTEM_PROMPT = """<your stable system prompt, ~750 tokens>"""

def run_iteration(history, tools_spec):
    system_text = SYSTEM_PROMPT + "\n\n## Available tools\n" + render_tools_doc(tools_spec)
    return client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content": [
                {"type": "text", "text": system_text,
                 "cache_control": {"type": "ephemeral"}}
            ]},
            # iteration history (variable) - not cached
            *history,
        ],
        tools=tools_spec,  # structured spec - schema enforcement, NOT cached
    )

Across 9 iterations with a 750-token system + 600-token tools doc = 1350 token prefix:

Verify in usage.cache_creation_input_tokens (expect ~1350 on call 1, then 0) and usage.cache_read_input_tokens (expect ~1350 on calls 2-9).

Diagnostics

Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens to verify your caching is working:

print(resp.usage)

If cache_creation_input_tokens > 0 and cache_read_input_tokens == 0: first call with marker present, explicit cache being written. Expected.

If cache_creation_input_tokens == 0 and cache_read_input_tokens > 0: explicit cache hit (when you used a marker) or automatic backend prefix-cache hit (when you didn't). To distinguish, check whether you sent a cache_control marker in the request — only marker-based hits get the 0.1x billing rate.

If both are 0: no explicit marker and no recent backend prefix-cache hit. You're paying full input rate. To unlock billing savings, add a cache_control marker on stable prefixes ≥100 tokens.

If you expected an explicit hit but got cache_creation_input_tokens > 0: the prefix differs from the previous call. Check whitespace, message order, content of every block before the marker.

Common mistakes