Vision QA on documents

Send an image, get structured data back. Common workflow for invoice / KTP / receipt processing where you have a photo and need fields extracted.

Basic extraction

import base64, json

img = base64.b64encode(open("invoice.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text":
                "Ekstrak data invoice ini sebagai JSON: vendor, tanggal, "
                "nomor_invoice, item (array of {nama, jumlah, harga_satuan, subtotal}), "
                "subtotal, ppn, total."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
        ]
    }],
    response_format={"type": "json_object"},
)
data = json.loads(resp.choices[0].message.content)
print(data["total"])

With strict schema

For production reliability, lock the output shape via json_schema:

schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "vendor_npwp": {"type": "string"},
        "invoice_number": {"type": "string"},
        "date": {"type": "string", "description": "YYYY-MM-DD"},
        "items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name":  {"type": "string"},
                    "qty":   {"type": "integer", "minimum": 1},
                    "price": {"type": "number"},
                    "total": {"type": "number"},
                },
                "required": ["name", "qty", "price", "total"],
                "additionalProperties": False,
            }
        },
        "subtotal": {"type": "number"},
        "ppn":      {"type": "number"},
        "grand_total": {"type": "number"},
    },
    "required": ["vendor", "invoice_number", "date", "items", "grand_total"],
    "additionalProperties": False,
}

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Ekstrak invoice. Format ke JSON sesuai schema."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
    ]}],
    response_format={"type": "json_schema",
                     "json_schema": {"name": "invoice", "strict": True, "schema": schema}},
)

KTP / identity document

ktp_img = base64.b64encode(open("ktp.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "Ekstrak data KTP: NIK, nama lengkap, tempat lahir, tanggal lahir, "
            "jenis kelamin, alamat, RT/RW, kelurahan, kecamatan, agama, "
            "status perkawinan, pekerjaan, kewarganegaraan, berlaku hingga."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{ktp_img}"}},
    ]}],
    response_format={"type": "json_object"},
)
ktp = json.loads(resp.choices[0].message.content)

Receipt with line items

# Often receipts are tall portrait shots; use lyt for fast turnaround
resp = client.chat.completions.create(
    model="epithre-lyt",   # cheap+fast; usually sufficient for receipts
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "Receipt analysis: extract merchant name, date, "
            "line items (name + price each), and total. Return JSON."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
    ]}],
    response_format={"type": "json_object"},
)

Screenshots / UI

Useful for QA bots that look at app screenshots and describe issues.

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "User dilapor error di app. Lihat screenshot ini, jelaskan: "
            "(1) layar apa, (2) error apa yang muncul, (3) langkah debug yang disarankan."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    ]}],
)

Multi-image: compare two states

before = base64.b64encode(open("before.jpg", "rb").read()).decode()
after  = base64.b64encode(open("after.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Gambar 1: kondisi sebelum. Gambar 2: kondisi sesudah. Sebutkan perubahan signifikan."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{before}"}},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{after}"}},
    ]}],
)

Quality tips

Image quality matters: blurry photos hurt OCR. Recommend users take well-lit, focused shots.
Crop tight when possible: a receipt photo where the receipt is 30% of the frame works less well than one where it's 80%.
Use lyt for high volume (e.g. mobile app uploads), omni for hard cases (poor handwriting, complex layouts).
Validate extracted fields: NPWP format, NIK format, date plausibility. The model is usually right but always validate critical fields.

Failure mode handling

When the model can't read the document confidently, it often returns null values or admits uncertainty. Add to your prompt:

text = ("Ekstrak data. Kalau ada field yang tidak terbaca dengan jelas, "
        "set ke null jangan dikira-kira. Kalau seluruh dokumen blur, "
        "set field 'readable' ke false dan kosongkan field lain.")

Cost note

epithre-omni for vision: ~Rp7,000 / 1M input tokens, image counts as ~1500-3000 tokens depending on size. Roughly Rp20 per invoice extraction. Use epithre-lyt for ~5x cheaper if accuracy is OK.