`epithre-embed`

Multimodal embedding model. Indonesian-optimized, with text and image vectors in a shared 4000-dim space.

Capabilities

Capability	Notes
Tier	Embedding
Max input tokens	4,096 (per text item; ~10K chars effective)
Output dim	4,000 (native), Matryoshka-truncatable to 1-4000
Modalities	Text, image
Cross-modal	YES - text and image vectors in same space
Instruction-aware	Yes; `instruction` field for task-specific prompting
Auto-truncate	Yes; 10K-char cap per text with `END` / `START` / `NONE` modes

Text and image embeddings live in the same 4000-dim space. Cosine similarity between any text vector and any image vector is meaningful, so you can:

Search an image catalog by text description.
Find text passages matching an uploaded photo.
Maintain a unified pgvector index for mixed-modality corpora (one table, one index, one similarity function).

Vectors are L2-normalized (cosine sim = dot product)
4000 native dims, Matryoshka-truncatable: dimensions=1024 returns the first 1024 dims re-normalized (lossless prefix)
halfvec-compatible for pgvector storage

Max 64 text items per request
Max 8 image items per request
Max 25 MB request body
Each text item auto-truncated to 10K chars (matched against the 4096-token context window)

instruction field is text-only. Mixed text + image input rejects instruction.
Image embeddings are MRL-truncated to 4000 + L2-renormalized in the gateway to share the same space as text vectors. Effectively identical to native 4000-dim output.
Truncation is char-based, not token-based. For dense legal/code text (2-3 chars/token), the 10K-char limit corresponds to ~3000-3500 tokens.