Multimodal - vision + cross-modal embed

Three distinct multimodal capabilities on Epithre:

Vision input on chat - send images to epithre-omni / epithre-lyt, get text responses about them.
Image generation + editing - epithre-iris text-to-image and reference-guided editing.
Cross-modal embeddings - epithre-embed produces text and image vectors in the same 4000-dim space.

Each has its own ideal use case. This guide covers all three plus when to combine them.

1. Vision input on chat

Pass image_url content blocks alongside text in your messages array.

import base64
img_b64 = base64.b64encode(open("invoice.jpg", "rb").read()).decode()

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Ekstrak vendor, total, dan tanggal dari invoice ini."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
        ],
    }],
)
print(resp.choices[0].message.content)

epithre-omni is best for vision tasks. epithre-lyt supports image input too and is faster/cheaper but less detailed.

Supported formats

PNG, JPEG, WebP, GIF
Up to 20 MB per image (decoded size)
Multiple images per message: stack content blocks. Order matters; model "sees" them in array order.

What it's good at

OCR + structured extraction: invoices, KTP, receipts, screenshots.
Layout-aware understanding: charts, tables, infographics, maps.
Indonesian text on images: handles ID-language text in images natively; no separate OCR step needed.
Multi-image reasoning: compare two photos, find differences.
Document review: extract key claims from a screenshot of a contract.

What it's less good at

Pixel-perfect counting: counting many small items (>10) in a complex scene.
Very fine handwriting: print is fine; cursive Indonesian handwriting is hit-or-miss.
Spatial measurements: precise distances, sizes in pixels.

Pre-processing

Heavy compression or low resolution will hurt OCR-style tasks. Recommend:

Original or near-original quality JPEG/PNG.
Aspect ratio doesn't matter; the model handles wide / tall.
If the image is huge (>5 MB), down-sample to ~1500 px on longest side for faster inference without quality loss.

2. Image generation and editing

epithre-iris is our image generation model. Three operations:

Text-to-image (generation)

resp = client.images.generate(
    model="epithre-iris",
    prompt="warung kopi pinggir jalan Jakarta malam, cinematic, golden hour, wide shot",
    size="768x768",
    num_steps=20,        # 4-step default for preview; 20-30 for final
    seed=42,             # reproducibility
)
import base64
open("warkop.png", "wb").write(base64.b64decode(resp.data[0].b64_json))

Parameters: - size: "WxH" up to 960x960. Default 768x768. Rounded down to multiple of 16. - num_steps: 1-50. Quality plateaus around 20-25 for most prompts. - seed: -1 for random; integer for reproducibility. - guidance_scale: 1.0 default; higher = more literal prompt adherence. - lora: "none" (default), "dark" (moody/cinematic), "anime".

Single-image edit

resp = client.images.edit(
    model="epithre-iris",
    prompt="change sky to dramatic stormy clouds with lightning",
    image=open("original.png", "rb"),
    size="512x512",
    strength=0.7,       # 0-1; higher = bigger change from source
)

Multi-reference composition

Pass up to 5 reference images, the model composes them.

import httpx, base64

refs = [base64.b64encode(open(f"ref_{i}.png", "rb").read()).decode()
        for i in range(3)]

r = httpx.post(
    "https://api.epithre.com/v1/images/edits",
    headers={"Authorization": f"Bearer {EPITHRE_KEY}"},
    json={
        "model": "epithre-iris",
        "prompt": "the product from image 1, displayed in the studio setting from image 2, in the photography style of image 3",
        "images": refs,
        "size": "640x640",
    },
).json()

Common pattern: product-in-context shots from real product photo + lifestyle scene + style reference.

Iris quirks

Prompts work best in English but Indonesian is also supported.
Best at: photorealistic, cinematic, illustration with lora=dark or lora=anime.
Limited at: text rendering inside images (don't ask for words on signs), exact-faithful logos, precise color reproduction.

epithre-embed is the key differentiator. Text and images embed into the same 4000-dim vector space, so cosine similarity between a text vector and an image vector is meaningful.

import base64

# Embed text and image in one call
img_b64 = base64.b64encode(open("photo.jpg", "rb").read()).decode()

resp = client.embeddings.create(
    model="epithre-embed",
    input=[
        "kucing oren tidur di kasur",
        {"type": "image", "image": img_b64},
    ],
)

text_vec = resp.data[0].embedding   # 4000-dim
img_vec  = resp.data[1].embedding   # 4000-dim

import numpy as np
sim = np.dot(text_vec, img_vec)     # cosine sim, both L2-normalized
print(f"Text-image similarity: {sim:.3f}")

Use cases

Pattern A: text query -> image search

Build an index of product photos, search by description.

# Build index once
catalog_vecs = []
for path in glob.glob("products/*.jpg"):
    b64 = base64.b64encode(open(path, "rb").read()).decode()
    r = client.embeddings.create(model="epithre-embed",
                                 input=[{"type": "image", "image": b64}])
    catalog_vecs.append((path, np.array(r.data[0].embedding)))

# Search by text
def search(query, top_k=5):
    qv = np.array(client.embeddings.create(
        model="epithre-embed", input=[query]).data[0].embedding)
    scored = [(p, float(qv @ v)) for p, v in catalog_vecs]
    return sorted(scored, key=lambda x: -x[1])[:top_k]

Pattern B: image query -> text search

User uploads a product photo, find matching text descriptions.

# Pre-indexed text vectors
descriptions = [...]
desc_vecs = [client.embeddings.create(model="epithre-embed", input=[d]).data[0].embedding
             for d in descriptions]

def search_by_image(img_path, top_k=3):
    b64 = base64.b64encode(open(img_path, "rb").read()).decode()
    qv = np.array(client.embeddings.create(
        model="epithre-embed",
        input=[{"type": "image", "image": b64}]).data[0].embedding)
    scored = [(d, float(qv @ np.array(v))) for d, v in zip(descriptions, desc_vecs)]
    return sorted(scored, key=lambda x: -x[1])[:top_k]

Pattern C: unified hybrid index

Store text passages and images in a single pgvector table. Search from either modality hits both.

CREATE TABLE assets (
    id BIGSERIAL PRIMARY KEY,
    kind TEXT NOT NULL,           -- 'text' or 'image'
    payload TEXT,                 -- text content or image path
    embedding halfvec(4000)       -- L2-normalized
);
CREATE INDEX ON assets USING hnsw (embedding halfvec_cosine_ops);

-- Search across both modalities at once:
SELECT kind, payload, 1 - (embedding <=> $1::halfvec) AS sim
FROM assets ORDER BY embedding <=> $1::halfvec LIMIT 20;

This works because all vectors live in the same space. No special routing or index segregation needed.

Combining the three

A realistic workflow that uses all three:

# 1. User uploads a damaged-product photo to support
img_b64 = base64.b64encode(uploaded.read()).decode()

# 2. Search past support tickets / KB for similar issues (cross-modal embed)
qv = client.embeddings.create(model="epithre-embed",
                              input=[{"type": "image", "image": img_b64}]).data[0].embedding
matches = vector_search(qv, top_k=5)

# 3. Have the model look at the photo + retrieved context
resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": f"User reported this issue. Similar past cases: {matches}. Diagnose and suggest next steps."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
        ],
    }],
)

# 4. If the resolution requires a replacement product render, generate it
preview = client.images.generate(
    model="epithre-iris",
    prompt=f"product replacement preview: {extracted_description}",
    size="768x768",
)

Cookbook: vision QA on documents - invoice/KTP/receipt extraction patterns.
Cookbook: cross-modal RAG - end-to-end text<->image retrieval.
Cookbook: image generation app - Iris production patterns.
Embeddings reference - full body params.
Image reference - generation + edit params.