Vision QA on documents

Send an image, get structured data back. Common workflow for invoice / KTP / receipt processing where you have a photo and need fields extracted.

Basic extraction

import base64, json

img = base64.b64encode(open("invoice.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text":
                "Ekstrak data invoice ini sebagai JSON: vendor, tanggal, "
                "nomor_invoice, item (array of {nama, jumlah, harga_satuan, subtotal}), "
                "subtotal, ppn, total."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
        ]
    }],
    response_format={"type": "json_object"},
)
data = json.loads(resp.choices[0].message.content)
print(data["total"])

With strict schema

For production reliability, lock the output shape via json_schema:

schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "vendor_npwp": {"type": "string"},
        "invoice_number": {"type": "string"},
        "date": {"type": "string", "description": "YYYY-MM-DD"},
        "items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name":  {"type": "string"},
                    "qty":   {"type": "integer", "minimum": 1},
                    "price": {"type": "number"},
                    "total": {"type": "number"},
                },
                "required": ["name", "qty", "price", "total"],
                "additionalProperties": False,
            }
        },
        "subtotal": {"type": "number"},
        "ppn":      {"type": "number"},
        "grand_total": {"type": "number"},
    },
    "required": ["vendor", "invoice_number", "date", "items", "grand_total"],
    "additionalProperties": False,
}

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Ekstrak invoice. Format ke JSON sesuai schema."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
    ]}],
    response_format={"type": "json_schema",
                     "json_schema": {"name": "invoice", "strict": True, "schema": schema}},
)

KTP / identity document

ktp_img = base64.b64encode(open("ktp.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "Ekstrak data KTP: NIK, nama lengkap, tempat lahir, tanggal lahir, "
            "jenis kelamin, alamat, RT/RW, kelurahan, kecamatan, agama, "
            "status perkawinan, pekerjaan, kewarganegaraan, berlaku hingga."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{ktp_img}"}},
    ]}],
    response_format={"type": "json_object"},
)
ktp = json.loads(resp.choices[0].message.content)

Receipt with line items

# Often receipts are tall portrait shots; use lyt for fast turnaround
resp = client.chat.completions.create(
    model="epithre-lyt",   # cheap+fast; usually sufficient for receipts
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "Receipt analysis: extract merchant name, date, "
            "line items (name + price each), and total. Return JSON."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}},
    ]}],
    response_format={"type": "json_object"},
)

Screenshots / UI

Useful for QA bots that look at app screenshots and describe issues.

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text":
            "User dilapor error di app. Lihat screenshot ini, jelaskan: "
            "(1) layar apa, (2) error apa yang muncul, (3) langkah debug yang disarankan."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    ]}],
)

Multi-image: compare two states

before = base64.b64encode(open("before.jpg", "rb").read()).decode()
after  = base64.b64encode(open("after.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Gambar 1: kondisi sebelum. Gambar 2: kondisi sesudah. Sebutkan perubahan signifikan."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{before}"}},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{after}"}},
    ]}],
)

Quality tips

Failure mode handling

When the model can't read the document confidently, it often returns null values or admits uncertainty. Add to your prompt:

text = ("Ekstrak data. Kalau ada field yang tidak terbaca dengan jelas, "
        "set ke null jangan dikira-kira. Kalau seluruh dokumen blur, "
        "set field 'readable' ke false dan kosongkan field lain.")

Cost note

epithre-omni for vision: ~Rp7,000 / 1M input tokens, image counts as ~1500-3000 tokens depending on size. Roughly Rp20 per invoice extraction. Use epithre-lyt for ~5x cheaper if accuracy is OK.

See also