Streaming responses (SSE)
Streaming returns tokens as they're generated, instead of waiting for the full response. Better UX for chat apps; required for very long outputs.
Wire format: SSE chunks with data: {...}\n\n lines, terminated by data: [DONE]. Industry-standard shape, compatible with all common SSE chat clients.
Enabling streaming
Set stream: true in your request body.
stream = client.chat.completions.create(
model="epithre-omni",
messages=[{"role": "user", "content": "Ceritakan sejarah singkat Jakarta"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
SSE wire format
Each chunk on the wire:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Jakarta "}}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"adalah "}}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"ibu kota..."}}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":9,"total_tokens":27}}
data: [DONE]
The final useful chunk before [DONE] carries usage. Always include stream_options: {"include_usage": true} if you need token counts (the Python openai SDK sets this by default; raw HTTP clients should set it explicitly).
Manual SSE parsing
If you're not using the SDK:
import httpx, json
with httpx.stream(
"POST", "https://api.epithre.com/v1/chat/completions",
headers={"Authorization": f"Bearer {EPITHRE_KEY}", "Content-Type": "application/json"},
json={
"model": "epithre-omni",
"messages": [{"role": "user", "content": "..."}],
"stream": True,
"stream_options": {"include_usage": True},
},
timeout=60,
) as resp:
for line in resp.iter_lines():
if not line.startswith("data: "):
continue
data_str = line[6:].strip()
if data_str == "[DONE]":
break
chunk = json.loads(data_str)
choices = chunk.get("choices", [])
if choices:
delta = choices[0].get("delta", {})
if delta.get("content"):
print(delta["content"], end="", flush=True)
if chunk.get("usage"):
print(f"\n[usage: {chunk['usage']}]")
Handling tool calls in streams
Tool calls are split across delta chunks. Accumulate by tool_calls[].index:
tool_calls = {} # index -> {id, name, args}
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
for tc in delta.tool_calls:
slot = tool_calls.setdefault(tc.index, {"id": "", "name": "", "args": ""})
if tc.id: slot["id"] = tc.id
if tc.function and tc.function.name: slot["name"] += tc.function.name
if tc.function and tc.function.arguments: slot["args"] += tc.function.arguments
if delta.content:
print(delta.content, end="")
After the stream ends, tool_calls has the complete calls.
Aborting a stream
The openai SDK aborts on .close(). With raw httpx, the stream context exit closes the connection.
When you abort:
- The TCP connection closes.
- On Epithre's gateway, the request is persisted regardless (the LLM keeps generating to completion, the response is saved). If you reconnect and ask for the same conversation, you'll see the complete answer.
- The gateway worker isn't billed-for any tokens beyond what was emitted before abort; usage event records actual tokens generated.
This is a deliberate design choice: client disconnect should not waste compute. Both for you (you can resume) and for us (we don't have to handle interrupted-mid-token cleanup paths).
Network reliability
For unreliable connections (mobile, VPNs):
- Set a generous timeout (60-180s for long outputs).
- Don't retry on partial output. A retry resubmits the whole prompt; you'll get a fresh full response. If you want incremental progress, build your client to handle partial-output state.
- Buffer the partial result as it streams. If the connection dies mid-stream, you have what was received so far. Decide based on
finish_reasonwhether to retry.
Common gotchas
1. SSE buffering by proxies
If your output looks like it arrives in large chunks rather than per-token, there's a buffering proxy in between. Solutions:
- From the client side: there's nothing you can do (the buffering happens upstream).
- Server-side header
X-Accel-Buffering: nois set by Epithre's gateway to tell nginx-like proxies not to buffer. If you self-host a proxy, replicate. - Reduce
max_tokensso chunks are smaller; if the model emits its full response in 100 tokens, buffering matters less.
2. JSON-decoding partial chunks
If you're trying json.loads(chunk_text) per chunk and assuming the assembled output is valid JSON: don't. Use response_format with non-streaming if you need guaranteed JSON. Or assemble all delta.content strings, then parse once at the end.
3. Usage chunk missed
If usage isn't in the response, you didn't set stream_options: {"include_usage": true}. The openai SDK sets this by default; check your manual HTTP code.
4. Empty deltas
Many SSE chunks have delta: {} or delta: {"role": "assistant"} with no content. Skip them; they're heartbeat / structural events. Only process when delta.content or delta.tool_calls is truthy.
Streaming with thinking enabled
If you set chat_template_kwargs: {"enable_thinking": true}, the thinking portion streams first (wrapped in <think>...</think> tokens on some models), then the visible content. Filter or display per your UX:
in_thinking = False
for chunk in stream:
txt = chunk.choices[0].delta.content
if not txt:
continue
if "<think>" in txt:
in_thinking = True
continue
if "</think>" in txt:
in_thinking = False
continue
if not in_thinking:
print(txt, end="")
This is rough; the SDK doesn't yet split thinking vs content into separate fields. We're tracking upstream SDK changes and will document when better support lands.
When NOT to stream
Streaming adds protocol overhead. Skip it when:
- Output is short (under ~50 tokens). User won't see the difference.
- You're invoking inside a batch job (no human watching).
- You need structured JSON output (partial JSON is unparseable).
- Your client is server-to-server (no UX gain from incremental output).
Related
- Chat reference - full body params for streaming chat.
- Best practices guide - retry patterns for production.
- Tool use guide - tool calls in streams.