Prompt caching
If you send the same system prompt or few-shot examples on every request, prompt caching saves you up to 90% on those input tokens. Mark the cacheable portion with cache_control; subsequent requests within 5 minutes hit cache.
Two layers operate independently:
- Explicit cache (this guide): you mark cacheable content with
cache_control. You get the 1.25x write / 0.1x read billing rates and a 5-minute Redis TTL keyed per(api_key, model, prefix). - Automatic backend prefix cache: the inference backend opportunistically reuses KV state when a recent request shared a prompt prefix. This reduces latency but has no billing discount — you still pay full input rate. It surfaces in
usage.cache_read_input_tokensfor observability when no explicit marker was used. Best-effort, no TTL guarantees.
The two stack naturally: an explicit cache hit will almost always also be a backend prefix-cache hit, so explicit caching gives you both billing savings and the latency win.
Why it matters
Common workload: customer support bot with a 2000-token system prompt + 500-token user message. Every request pays for the full 2500 input tokens.
With caching:
- First call: 1.25x rate on the 2000 cached tokens (write cost), 1x on the 500 user tokens.
- Calls 2+ within 5 minutes: 0.1x rate on the 2000 cached tokens (read), 1x on the user tokens.
- Net savings on call 2: (2000 * 0.9) / 2500 = 72% off input.
- Over a steady-state load of 100 requests/hour, you pay 1 write + 99 reads = 90% savings on the cached portion.
How to use it
Mark the cacheable content with cache_control: {"type": "ephemeral"} on the last content block of the prefix you want cached.
resp = client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": [
{"type": "text",
"text": "<long stable system prompt with few-shot examples>",
"cache_control": {"type": "ephemeral"}}
]},
{"role": "user", "content": "Pertanyaan user yang berubah-ubah..."},
],
)
print(resp.usage)
# {
# "prompt_tokens": 2450,
# "completion_tokens": 180,
# "total_tokens": 2630,
# "cache_creation_input_tokens": 2400, # written to cache on this call
# "cache_read_input_tokens": 0 # this is the first call
# }
Second call within 5 minutes with the same prefix:
# Same messages, just a different user content
# usage on the response:
# {"prompt_tokens": 2450, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 2400}
The cached tokens bill at 0.1x input rate.
Marker placement: what the gateway actually parses
The cache layer only sees markers placed at content-block level inside a list-form content field. Anything else is silently ignored — you'll still get a successful response, just no caching.
Required shape:
{"role": "system", "content": [
{"type": "text", "text": "<prefix>", "cache_control": {"type": "ephemeral"}}
]}
Ignored shapes (response succeeds, marker silently dropped, full input rate billed):
// Plain-string content - content must be a list, not a string
{"role": "system", "content": "<prefix>", "cache_control": {"type": "ephemeral"}}
// Message-level field outside content - parser only inspects content blocks
{"role": "system", "cache_control": {"type": "ephemeral"}, "content": [...]}
// Wrong type value - only "ephemeral" is recognized
{"type": "text", "text": "...", "cache_control": {"type": "persistent"}}
If you have an existing string-content system message, convert it to list form:
# Before
{"role": "system", "content": SYSTEM_PROMPT}
# After
{"role": "system", "content": [
{"type": "text", "text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}
]}
What can be cached
The marker covers everything from messages[0] up to and including the message containing the marker. Concretely:
- System messages with long stable content
- Few-shot examples in early user/assistant turns
- Document context that doesn't change between calls
- RAG context (the retrieved docs portion, before the query)
What can't be cached:
- The top-level
toolsfield. Tool definitions sent aspayload["tools"]are billed at full input rate every call, even with a marker present elsewhere — the cache layer only inspectsmessages. Workaround for large tool specs: duplicate the tool documentation as plain text into your cached system message, and keep the structuredtoolsfield for function-call schema enforcement. The marker then covers the documentation copy; the structured spec stays full-rate (typically ~10% of the documentation token cost on JSON Schema). See Pattern 4: agentic loop with tools. - Anything after the marker. Content after the marker is full-rate input — by design, so per-call variable content (user queries, iteration turns) stays uncached.
- Image content blocks, currently. Marker on a message containing an
image_urlblock is parsed normally, but vision-token billing doesn't carry a cache discount yet. Roadmap.
Cache rules
- Minimum prefix length: 100 tokens (~400 chars). Below this, the marker is ignored and you bill at full rate. There's no point caching tiny prefixes.
- TTL: 5 minutes from last access. Cache hit refreshes the TTL — long-running loops stay warm as long as gaps between calls are under 5 minutes.
- Cache key scope:
(api_key_id, model, prefix_content)— per-key, not per-account. Two API keys on the same account with identical prefixes will each MISS on their first call. Use one consistent key per workload to maximize hit rate. No cross-key sharing means no cross-tenant billing contamination either. - Backend prefix cache is separate and shared: the inference backend's KV cache (a latency optimization) IS shared across tenants. You may see
prompt_tokens_details.cached_tokens > 0even on a cold explicit-cache miss because another tenant primed the same prefix at the backend. This affects latency only — your billing is still based on the explicit-cache(api_key_id, ...)key. - One breakpoint per request currently. Multi-breakpoint support is on the roadmap.
- Identical prefix wins: cache lookup is exact hash match on prefix content. One typo, a timestamp baked into the system prompt, a session ID injected per-call — all break the cache. Pull dynamic content out of the cached prefix.
Cost math
Base input rate is the model's input_per_mtok price. The multipliers:
| Mode | Multiplier | Example on epithre-omni (Rp7,000 / 1M tok input) |
|---|---|---|
| Regular input | 1.0x | Rp7,000 / 1M cached-prefix-equivalent tokens |
| Cache write | 1.25x | Rp8,750 / 1M |
| Cache read | 0.1x | Rp700 / 1M |
Break-even is 1 hit: write cost = (1.25 - 1.0) = 0.25 of input cost. Reading once saves you (1.0 - 0.1) = 0.9 of input cost. So 0.25 < 0.9 means even one re-use is worth it.
Stack with Batch API for max savings: cache-read inside a batch = 0.1x * 0.5x = 0.05x of base input. 20x cheaper than realtime.
Common patterns
Pattern 1: stable system prompt
Most common. Long persona / instructions stay the same; user messages vary.
SYSTEM_PROMPT = """<long stable system prompt, maybe 1500-3000 tokens>"""
def chat(user_msg):
return client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": [
{"type": "text", "text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}
]},
{"role": "user", "content": user_msg},
],
)
Pattern 2: stable system + few-shot examples
Cache through the examples; only the actual query varies.
def chat(user_msg):
return client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": "You classify Indonesian sentiment."},
{"role": "user", "content": "Barang ok, ongkir cepat."},
{"role": "assistant", "content": "positif"},
{"role": "user", "content": "Salah kirim, refund lambat."},
{"role": "assistant", "content": [
{"type": "text", "text": "negatif",
"cache_control": {"type": "ephemeral"}}
]},
# ^ everything above this is cached
{"role": "user", "content": user_msg}, # only this varies
],
)
Pattern 3: RAG context
Cache the retrieved document context; the question changes per call.
def answer(question, retrieved_docs):
context_block = "\n\n".join(f"[{d['id']}] {d['text']}" for d in retrieved_docs)
return client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": [
{"type": "text",
"text": f"Answer based on these documents:\n\n{context_block}",
"cache_control": {"type": "ephemeral"}}
]},
{"role": "user", "content": question},
],
)
This pattern is most valuable when the user asks multiple follow-up questions about the same retrieved set. The first call writes the cache; follow-ups hit it.
Pattern 4: agentic loop with cached tools spec
Use case: multi-iteration agentic loop where a stable system prompt + tools spec is replayed each call, only the running scratchpad / observation varies.
The naive approach pays full rate on tool definitions every iteration, since payload["tools"] isn't covered by the marker. Workaround: render the tools as readable text into the cached system message, while still sending the structured tools field for function-call schema enforcement.
def render_tools_doc(tools_spec):
"""Convert structured tools spec to plain text doc for cacheable system prompt."""
lines = []
for t in tools_spec:
fn = t["function"]
lines.append(f"- `{fn['name']}({', '.join(fn['parameters'].get('properties', {}).keys())})`")
lines.append(f" {fn.get('description', '')}")
return "\n".join(lines)
SYSTEM_PROMPT = """<your stable system prompt, ~750 tokens>"""
def run_iteration(history, tools_spec):
system_text = SYSTEM_PROMPT + "\n\n## Available tools\n" + render_tools_doc(tools_spec)
return client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": [
{"type": "text", "text": system_text,
"cache_control": {"type": "ephemeral"}}
]},
# iteration history (variable) - not cached
*history,
],
tools=tools_spec, # structured spec - schema enforcement, NOT cached
)
Across 9 iterations with a 750-token system + 600-token tools doc = 1350 token prefix:
- Call 1: 1× cache write (1.25x) on 1350 tokens
- Calls 2-9: 8× cache read (0.1x) on 1350 tokens each
- Savings vs no-cache baseline:
(8 × 0.9 × 1350) / (9 × 1350) ≈ 80%on the cached prefix portion
Verify in usage.cache_creation_input_tokens (expect ~1350 on call 1, then 0) and usage.cache_read_input_tokens (expect ~1350 on calls 2-9).
Diagnostics
Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens to verify your caching is working:
print(resp.usage)
If cache_creation_input_tokens > 0 and cache_read_input_tokens == 0: first call with marker present, explicit cache being written. Expected.
If cache_creation_input_tokens == 0 and cache_read_input_tokens > 0: explicit cache hit (when you used a marker) or automatic backend prefix-cache hit (when you didn't). To distinguish, check whether you sent a cache_control marker in the request — only marker-based hits get the 0.1x billing rate.
If both are 0: no explicit marker and no recent backend prefix-cache hit. You're paying full input rate. To unlock billing savings, add a cache_control marker on stable prefixes ≥100 tokens.
If you expected an explicit hit but got cache_creation_input_tokens > 0: the prefix differs from the previous call. Check whitespace, message order, content of every block before the marker.
Common mistakes
- Putting cache_control in the wrong place: it must be on a content block, not the message itself. Wrong:
{"role": "system", "content": "...", "cache_control": ...}. Right:{"role": "system", "content": [{"type": "text", "text": "...", "cache_control": {...}}]}. - Cache miss because of identifier in system prompt: if your system prompt includes the user's ID or session ID, every user's cache is separate. Pull dynamic identifiers out of the cached prefix.
- Trying to cache short prefixes: below 100 tokens, the marker is silently ignored. Aim for prefixes 500+ tokens for meaningful savings.
- Mixing cache_control with non-cacheable tool definitions: tool definitions today aren't cached. The marker still applies to the message content; tool tokens are just billed full-rate alongside.
Related
- Best practices guide - retry patterns, batch + cache compounding.
- Cookbook: RAG - end-to-end pattern with caching for retrieved context.
- Chat reference - full
cache_controlschema.