Multimodal: vision + cross-modal embed
Three distinct multimodal capabilities on Epithre:
- Vision input on chat - send images to
epithre-omni/epithre-lyt, get text responses about them. - Image generation + editing -
epithre-iristext-to-image and reference-guided editing. - Cross-modal embeddings -
epithre-embedproduces text and image vectors in the same 4000-dim space.
Each has its own ideal use case. This guide covers all three plus when to combine them.
1. Vision input on chat
Pass image_url content blocks alongside text in your messages array.
import base64
img_b64 = base64.b64encode(open("invoice.jpg", "rb").read()).decode()
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Ekstrak vendor, total, dan tanggal dari invoice ini."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
],
}],
)
print(resp.choices[0].message.content)
epithre-omni is best for vision tasks. epithre-lyt supports image input too and is faster/cheaper but less detailed.
Supported formats
- PNG, JPEG, WebP, GIF
- Up to 20 MB per image (decoded size)
- Multiple images per message: stack content blocks. Order matters; model "sees" them in array order.
What it's good at
- OCR + structured extraction: invoices, KTP, receipts, screenshots.
- Layout-aware understanding: charts, tables, infographics, maps.
- Indonesian text on images: handles ID-language text in images natively; no separate OCR step needed.
- Multi-image reasoning: compare two photos, find differences.
- Document review: extract key claims from a screenshot of a contract.
What it's less good at
- Pixel-perfect counting: counting many small items (>10) in a complex scene.
- Very fine handwriting: print is fine; cursive Indonesian handwriting is hit-or-miss.
- Spatial measurements: precise distances, sizes in pixels.
Pre-processing
Heavy compression or low resolution will hurt OCR-style tasks. Recommend:
- Original or near-original quality JPEG/PNG.
- Aspect ratio doesn't matter; the model handles wide / tall.
- If the image is huge (>5 MB), down-sample to ~1500 px on longest side for faster inference without quality loss.
2. Image generation and editing
epithre-iris is our image generation model. Three operations:
Text-to-image (generation)
resp = client.images.generate(
model="epithre-iris",
prompt="warung kopi pinggir jalan Jakarta malam, cinematic, golden hour, wide shot",
size="768x768",
num_steps=20, # 4-step default for preview; 20-30 for final
seed=42, # reproducibility
)
import base64
open("warkop.png", "wb").write(base64.b64decode(resp.data[0].b64_json))
Parameters:
- size: "WxH" up to 960x960. Default 768x768. Rounded down to multiple of 16.
- num_steps: 1-50. Quality plateaus around 20-25 for most prompts.
- seed: -1 for random; integer for reproducibility.
- guidance_scale: 1.0 default; higher = more literal prompt adherence.
- lora: "none" (default), "dark" (moody/cinematic), "anime".
Single-image edit
resp = client.images.edit(
model="epithre-iris",
prompt="change sky to dramatic stormy clouds with lightning",
image=open("original.png", "rb"),
size="512x512",
strength=0.7, # 0-1; higher = bigger change from source
)
Multi-reference composition
Pass up to 5 reference images, the model composes them.
import httpx, base64
refs = [base64.b64encode(open(f"ref_{i}.png", "rb").read()).decode()
for i in range(3)]
r = httpx.post(
"https://api.epithre.com/v1/images/edits",
headers={"Authorization": f"Bearer {EPITHRE_KEY}"},
json={
"model": "epithre-iris",
"prompt": "the product from image 1, displayed in the studio setting from image 2, in the photography style of image 3",
"images": refs,
"size": "640x640",
},
).json()
Common pattern: product-in-context shots from real product photo + lifestyle scene + style reference.
Iris quirks
- Prompts work best in English but Indonesian is also supported.
- Best at: photorealistic, cinematic, illustration with
lora=darkorlora=anime. - Limited at: text rendering inside images (don't ask for words on signs), exact-faithful logos, precise color reproduction.
3. Cross-modal embeddings
epithre-embed is the key differentiator. Text and images embed into the same 4000-dim vector space, so cosine similarity between a text vector and an image vector is meaningful.
import base64
# Embed text and image in one call
img_b64 = base64.b64encode(open("photo.jpg", "rb").read()).decode()
resp = client.embeddings.create(
model="epithre-embed",
input=[
"kucing oren tidur di kasur",
{"type": "image", "image": img_b64},
],
)
text_vec = resp.data[0].embedding # 4000-dim
img_vec = resp.data[1].embedding # 4000-dim
import numpy as np
sim = np.dot(text_vec, img_vec) # cosine sim, both L2-normalized
print(f"Text-image similarity: {sim:.3f}")
Use cases
Pattern A: text query -> image search
Build an index of product photos, search by description.
# Build index once
catalog_vecs = []
for path in glob.glob("products/*.jpg"):
b64 = base64.b64encode(open(path, "rb").read()).decode()
r = client.embeddings.create(model="epithre-embed",
input=[{"type": "image", "image": b64}])
catalog_vecs.append((path, np.array(r.data[0].embedding)))
# Search by text
def search(query, top_k=5):
qv = np.array(client.embeddings.create(
model="epithre-embed", input=[query]).data[0].embedding)
scored = [(p, float(qv @ v)) for p, v in catalog_vecs]
return sorted(scored, key=lambda x: -x[1])[:top_k]
Pattern B: image query -> text search
User uploads a product photo, find matching text descriptions.
# Pre-indexed text vectors
descriptions = [...]
desc_vecs = [client.embeddings.create(model="epithre-embed", input=[d]).data[0].embedding
for d in descriptions]
def search_by_image(img_path, top_k=3):
b64 = base64.b64encode(open(img_path, "rb").read()).decode()
qv = np.array(client.embeddings.create(
model="epithre-embed",
input=[{"type": "image", "image": b64}]).data[0].embedding)
scored = [(d, float(qv @ np.array(v))) for d, v in zip(descriptions, desc_vecs)]
return sorted(scored, key=lambda x: -x[1])[:top_k]
Pattern C: unified hybrid index
Store text passages and images in a single pgvector table. Search from either modality hits both.
CREATE TABLE assets (
id BIGSERIAL PRIMARY KEY,
kind TEXT NOT NULL, -- 'text' or 'image'
payload TEXT, -- text content or image path
embedding halfvec(4000) -- L2-normalized
);
CREATE INDEX ON assets USING hnsw (embedding halfvec_cosine_ops);
-- Search across both modalities at once:
SELECT kind, payload, 1 - (embedding <=> $1::halfvec) AS sim
FROM assets ORDER BY embedding <=> $1::halfvec LIMIT 20;
This works because all vectors live in the same space. No special routing or index segregation needed.
Combining the three
A realistic workflow that uses all three:
# 1. User uploads a damaged-product photo to support
img_b64 = base64.b64encode(uploaded.read()).decode()
# 2. Search past support tickets / KB for similar issues (cross-modal embed)
qv = client.embeddings.create(model="epithre-embed",
input=[{"type": "image", "image": img_b64}]).data[0].embedding
matches = vector_search(qv, top_k=5)
# 3. Have the model look at the photo + retrieved context
resp = client.chat.completions.create(
model="epithre-omni",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"User reported this issue. Similar past cases: {matches}. Diagnose and suggest next steps."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
],
}],
)
# 4. If the resolution requires a replacement product render, generate it
preview = client.images.generate(
model="epithre-iris",
prompt=f"product replacement preview: {extracted_description}",
size="768x768",
)
Related
- Cookbook: vision QA on documents - invoice/KTP/receipt extraction patterns.
- Cookbook: cross-modal RAG - end-to-end text<->image retrieval.
- Cookbook: image generation app - Iris production patterns.
- Embeddings reference - full body params.
- Image reference - generation + edit params.