Vision QA on documents
Send an image, get structured data back. Common workflow for invoice / KTP / receipt processing where you have a photo and need fields extracted.
Basic extraction
import base64, json
img = base64.b64encode(open("invoice.jpg", "rb").read()).decode()
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{
"role": "user",
"content": [
{"type": "text", "text":
"Ekstrak data invoice ini sebagai JSON: vendor, tanggal, "
"nomor_invoice, item (array of {nama, jumlah, harga_satuan, subtotal}), "
"subtotal, ppn, total."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
]
}],
response_format={"type": "json_object"},
)
data = json.loads(resp.choices[0].message.content)
print(data["total"])
With strict schema
For production reliability, lock the output shape via json_schema:
schema = {
"type": "object",
"properties": {
"vendor": {"type": "string"},
"vendor_npwp": {"type": "string"},
"invoice_number": {"type": "string"},
"date": {"type": "string", "description": "YYYY-MM-DD"},
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"qty": {"type": "integer", "minimum": 1},
"price": {"type": "number"},
"total": {"type": "number"},
},
"required": ["name", "qty", "price", "total"],
"additionalProperties": False,
}
},
"subtotal": {"type": "number"},
"ppn": {"type": "number"},
"grand_total": {"type": "number"},
},
"required": ["vendor", "invoice_number", "date", "items", "grand_total"],
"additionalProperties": False,
}
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{"role": "user", "content": [
{"type": "text", "text": "Ekstrak invoice. Format ke JSON sesuai schema."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
]}],
response_format={"type": "json_schema",
"json_schema": {"name": "invoice", "strict": True, "schema": schema}},
)
KTP / identity document
ktp_img = base64.b64encode(open("ktp.jpg", "rb").read()).decode()
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{"role": "user", "content": [
{"type": "text", "text":
"Ekstrak data KTP: NIK, nama lengkap, tempat lahir, tanggal lahir, "
"jenis kelamin, alamat, RT/RW, kelurahan, kecamatan, agama, "
"status perkawinan, pekerjaan, kewarganegaraan, berlaku hingga."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{ktp_img}"}},
]}],
response_format={"type": "json_object"},
)
ktp = json.loads(resp.choices[0].message.content)
Receipt with line items
# Often receipts are tall portrait shots; use lyt for fast turnaround
resp = client.chat.completions.create(
model="epithre-lyt", # cheap+fast; usually sufficient for receipts
messages=[{"role": "user", "content": [
{"type": "text", "text":
"Receipt analysis: extract merchant name, date, "
"line items (name + price each), and total. Return JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
]}],
response_format={"type": "json_object"},
)
Screenshots / UI
Useful for QA bots that look at app screenshots and describe issues.
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{"role": "user", "content": [
{"type": "text", "text":
"User dilapor error di app. Lihat screenshot ini, jelaskan: "
"(1) layar apa, (2) error apa yang muncul, (3) langkah debug yang disarankan."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
]}],
)
Multi-image: compare two states
before = base64.b64encode(open("before.jpg", "rb").read()).decode()
after = base64.b64encode(open("after.jpg", "rb").read()).decode()
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{"role": "user", "content": [
{"type": "text", "text": "Gambar 1: kondisi sebelum. Gambar 2: kondisi sesudah. Sebutkan perubahan signifikan."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{before}"}},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{after}"}},
]}],
)
Quality tips
- Image quality matters: blurry photos hurt OCR. Recommend users take well-lit, focused shots.
- Crop tight when possible: a receipt photo where the receipt is 30% of the frame works less well than one where it's 80%.
- Use lyt for high volume (e.g. mobile app uploads), omni for hard cases (poor handwriting, complex layouts).
- Validate extracted fields: NPWP format, NIK format, date plausibility. The model is usually right but always validate critical fields.
Failure mode handling
When the model can't read the document confidently, it often returns null values or admits uncertainty. Add to your prompt:
text = ("Ekstrak data. Kalau ada field yang tidak terbaca dengan jelas, "
"set ke null jangan dikira-kira. Kalau seluruh dokumen blur, "
"set field 'readable' ke false dan kosongkan field lain.")
Cost note
epithre-omni for vision: ~Rp7,000 / 1M input tokens, image counts as ~1500-3000 tokens depending on size. Roughly Rp20 per invoice extraction. Use epithre-lyt for ~5x cheaper if accuracy is OK.