Streaming responses (SSE)

Streaming returns tokens as they're generated, instead of waiting for the full response. Better UX for chat apps; required for very long outputs.

Wire format: SSE chunks with data: {...}\n\n lines, terminated by data: [DONE]. Industry-standard shape, compatible with all common SSE chat clients.

Enabling streaming

Set stream: true in your request body.

stream = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": "Ceritakan sejarah singkat Jakarta"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

SSE wire format

Each chunk on the wire:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Jakarta "}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"adalah "}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"ibu kota..."}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":9,"total_tokens":27}}

data: [DONE]

The final useful chunk before [DONE] carries usage. Always include stream_options: {"include_usage": true} if you need token counts (the Python openai SDK sets this by default; raw HTTP clients should set it explicitly).

Manual SSE parsing

If you're not using the SDK:

import httpx, json

with httpx.stream(
    "POST", "https://api.epithre.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {EPITHRE_KEY}", "Content-Type": "application/json"},
    json={
        "model": "epithre-omni",
        "messages": [{"role": "user", "content": "..."}],
        "stream": True,
        "stream_options": {"include_usage": True},
    },
    timeout=60,
) as resp:
    for line in resp.iter_lines():
        if not line.startswith("data: "):
            continue
        data_str = line[6:].strip()
        if data_str == "[DONE]":
            break
        chunk = json.loads(data_str)
        choices = chunk.get("choices", [])
        if choices:
            delta = choices[0].get("delta", {})
            if delta.get("content"):
                print(delta["content"], end="", flush=True)
        if chunk.get("usage"):
            print(f"\n[usage: {chunk['usage']}]")

Handling tool calls in streams

Tool calls are split across delta chunks. Accumulate by tool_calls[].index:

tool_calls = {}  # index -> {id, name, args}
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        for tc in delta.tool_calls:
            slot = tool_calls.setdefault(tc.index, {"id": "", "name": "", "args": ""})
            if tc.id:                       slot["id"] = tc.id
            if tc.function and tc.function.name:       slot["name"] += tc.function.name
            if tc.function and tc.function.arguments:  slot["args"] += tc.function.arguments
    if delta.content:
        print(delta.content, end="")

After the stream ends, tool_calls has the complete calls.

Aborting a stream

The openai SDK aborts on .close(). With raw httpx, the stream context exit closes the connection.

When you abort:

The TCP connection closes.
On Epithre's gateway, the request is persisted regardless (the LLM keeps generating to completion, the response is saved). If you reconnect and ask for the same conversation, you'll see the complete answer.
The gateway worker isn't billed-for any tokens beyond what was emitted before abort; usage event records actual tokens generated.

This is a deliberate design choice: client disconnect should not waste compute. Both for you (you can resume) and for us (we don't have to handle interrupted-mid-token cleanup paths).

Network reliability

For unreliable connections (mobile, VPNs):

Set a generous timeout (60-180s for long outputs).
Don't retry on partial output. A retry resubmits the whole prompt; you'll get a fresh full response. If you want incremental progress, build your client to handle partial-output state.
Buffer the partial result as it streams. If the connection dies mid-stream, you have what was received so far. Decide based on finish_reason whether to retry.

Common gotchas

1. SSE buffering by proxies

If your output looks like it arrives in large chunks rather than per-token, there's a buffering proxy in between. Solutions:

From the client side: there's nothing you can do (the buffering happens upstream).
Server-side header X-Accel-Buffering: no is set by Epithre's gateway to tell nginx-like proxies not to buffer. If you self-host a proxy, replicate.
Reduce max_tokens so chunks are smaller; if the model emits its full response in 100 tokens, buffering matters less.

in_thinking = False
for chunk in stream:
    txt = chunk.choices[0].delta.content
    if not txt:
        continue
    if "<think>" in txt:
        in_thinking = True
        continue
    if "</think>" in txt:
        in_thinking = False
        continue
    if not in_thinking:
        print(txt, end="")

This is rough; the SDK doesn't yet split thinking vs content into separate fields. We're tracking upstream SDK changes and will document when better support lands.

When NOT to stream

Streaming adds protocol overhead. Skip it when:

Output is short (under ~50 tokens). User won't see the difference.
You're invoking inside a batch job (no human watching).
You need structured JSON output (partial JSON is unparseable).
Your client is server-to-server (no UX gain from incremental output).

Chat reference - full body params for streaming chat.
Best practices guide - retry patterns for production.
Tool use guide - tool calls in streams.