epithre-embed
Multimodal embedding model. Indonesian-optimized, with text and image vectors in a shared 4000-dim space.
Capabilities
| Capability | Notes |
|---|---|
| Tier | Embedding |
| Max input tokens | 4,096 (per text item; ~10K chars effective) |
| Output dim | 4,000 (native), Matryoshka-truncatable to 1-4000 |
| Modalities | Text, image |
| Cross-modal | YES - text and image vectors in same space |
| Instruction-aware | Yes; instruction field for task-specific prompting |
| Auto-truncate | Yes; 10K-char cap per text with END / START / NONE modes |
When to use
- RAG retrieval (text + image)
- Semantic search
- Clustering / deduplication
- Classification via centroid distance
- Cross-modal search (find images by text query, or vice versa)
Key feature: cross-modal in one space
Text and image embeddings live in the same 4000-dim space. Cosine similarity between any text vector and any image vector is meaningful, so you can:
- Search an image catalog by text description.
- Find text passages matching an uploaded photo.
- Maintain a unified pgvector index for mixed-modality corpora (one table, one index, one similarity function).
Output details
- Vectors are L2-normalized (cosine sim = dot product)
- 4000 native dims, Matryoshka-truncatable:
dimensions=1024returns the first 1024 dims re-normalized (lossless prefix) - halfvec-compatible for pgvector storage
Pricing
- Text: Rp1,500 / 1M input tokens
- Image: Rp25 / image (flat)
- Batch: 0.5x both rates
- Storage: free (1 GB quota for uploaded knowledge files)
Performance
- Throughput: ~25-40 req/s sustained on a single backend GPU.
- Latency: ~0.1-0.5s for a batch of 64 short texts; ~0.5-1s for image input.
- Quality on Indonesian retrieval: ~92% recall@10 on our 25-question gold set.
Limits
- Max 64 text items per request
- Max 8 image items per request
- Max 25 MB request body
- Each text item auto-truncated to 10K chars (matched against the 4096-token context window)
Caveats
instructionfield is text-only. Mixed text + image input rejectsinstruction.- Image embeddings are MRL-truncated to 4000 + L2-renormalized in the gateway to share the same space as text vectors. Effectively identical to native 4000-dim output.
- Truncation is char-based, not token-based. For dense legal/code text (2-3 chars/token), the 10K-char limit corresponds to ~3000-3500 tokens.