Streaming responses (SSE)

Streaming returns tokens as they're generated, instead of waiting for the full response. Better UX for chat apps; required for very long outputs.

Wire format: SSE chunks with data: {...}\n\n lines, terminated by data: [DONE]. Industry-standard shape, compatible with all common SSE chat clients.

Enabling streaming

Set stream: true in your request body.

stream = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": "Ceritakan sejarah singkat Jakarta"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

SSE wire format

Each chunk on the wire:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Jakarta "}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"adalah "}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"ibu kota..."}}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":9,"total_tokens":27}}

data: [DONE]

The final useful chunk before [DONE] carries usage. Always include stream_options: {"include_usage": true} if you need token counts (the Python openai SDK sets this by default; raw HTTP clients should set it explicitly).

Manual SSE parsing

If you're not using the SDK:

import httpx, json

with httpx.stream(
    "POST", "https://api.epithre.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {EPITHRE_KEY}", "Content-Type": "application/json"},
    json={
        "model": "epithre-omni",
        "messages": [{"role": "user", "content": "..."}],
        "stream": True,
        "stream_options": {"include_usage": True},
    },
    timeout=60,
) as resp:
    for line in resp.iter_lines():
        if not line.startswith("data: "):
            continue
        data_str = line[6:].strip()
        if data_str == "[DONE]":
            break
        chunk = json.loads(data_str)
        choices = chunk.get("choices", [])
        if choices:
            delta = choices[0].get("delta", {})
            if delta.get("content"):
                print(delta["content"], end="", flush=True)
        if chunk.get("usage"):
            print(f"\n[usage: {chunk['usage']}]")

Handling tool calls in streams

Tool calls are split across delta chunks. Accumulate by tool_calls[].index:

tool_calls = {}  # index -> {id, name, args}
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        for tc in delta.tool_calls:
            slot = tool_calls.setdefault(tc.index, {"id": "", "name": "", "args": ""})
            if tc.id:                       slot["id"] = tc.id
            if tc.function and tc.function.name:       slot["name"] += tc.function.name
            if tc.function and tc.function.arguments:  slot["args"] += tc.function.arguments
    if delta.content:
        print(delta.content, end="")

After the stream ends, tool_calls has the complete calls.

Aborting a stream

The openai SDK aborts on .close(). With raw httpx, the stream context exit closes the connection.

When you abort:

This is a deliberate design choice: client disconnect should not waste compute. Both for you (you can resume) and for us (we don't have to handle interrupted-mid-token cleanup paths).

Network reliability

For unreliable connections (mobile, VPNs):

Common gotchas

1. SSE buffering by proxies

If your output looks like it arrives in large chunks rather than per-token, there's a buffering proxy in between. Solutions:

2. JSON-decoding partial chunks

If you're trying json.loads(chunk_text) per chunk and assuming the assembled output is valid JSON: don't. Use response_format with non-streaming if you need guaranteed JSON. Or assemble all delta.content strings, then parse once at the end.

3. Usage chunk missed

If usage isn't in the response, you didn't set stream_options: {"include_usage": true}. The openai SDK sets this by default; check your manual HTTP code.

4. Empty deltas

Many SSE chunks have delta: {} or delta: {"role": "assistant"} with no content. Skip them; they're heartbeat / structural events. Only process when delta.content or delta.tool_calls is truthy.

Streaming with thinking enabled

If you set chat_template_kwargs: {"enable_thinking": true}, the thinking portion streams first (wrapped in <think>...</think> tokens on some models), then the visible content. Filter or display per your UX:

in_thinking = False
for chunk in stream:
    txt = chunk.choices[0].delta.content
    if not txt:
        continue
    if "<think>" in txt:
        in_thinking = True
        continue
    if "</think>" in txt:
        in_thinking = False
        continue
    if not in_thinking:
        print(txt, end="")

This is rough; the SDK doesn't yet split thinking vs content into separate fields. We're tracking upstream SDK changes and will document when better support lands.

When NOT to stream

Streaming adds protocol overhead. Skip it when: