RAG: embed + rerank + chat
The canonical retrieval-augmented generation pattern:
- Index your corpus once with
epithre-embed. - At query time, embed the question, find top-K nearest by cosine.
- Rerank to narrow to the most-relevant 3-5.
- Pass the reranked context to
epithre-omni(orepithre-lyt) to generate a grounded answer.
This pattern gives ~95%+ retrieval quality on Indonesian content in our benchmarks, well above embed-only.
Full pipeline
import os, httpx, numpy as np
from openai import OpenAI
EK = os.environ["EPITHRE_KEY"]
client = OpenAI(api_key=EK, base_url="https://api.epithre.com/v1")
# ============================================================
# 1) ONE-TIME: embed your corpus
# ============================================================
corpus = [
"UU 41/1999 pasal 50 - barangsiapa merusak hutan akan dipidana penjara paling lama 10 tahun.",
"Permen LHK No. 92/2018 tentang pengelolaan satwa liar dilindungi.",
"PP No. 23/2021 tentang penyelenggaraan kehutanan dan rehabilitasi lahan.",
# ... thousands of docs in production
]
e = client.embeddings.create(
model="epithre-embed",
input=corpus,
extra_body={"instruction": "Represent this document for retrieval:"},
)
corpus_vecs = np.array([row.embedding for row in e.data]) # (N, 4000)
# Store corpus_vecs + corpus in your vector DB (pgvector, Qdrant, etc.)
# ============================================================
# 2) AT QUERY TIME: embed the question, find top-K
# ============================================================
question = "Apa hukuman untuk perusakan hutan lindung?"
qe = client.embeddings.create(
model="epithre-embed",
input=[question],
extra_body={"instruction": "Represent this query for retrieving relevant documents:"},
)
qv = np.array(qe.data[0].embedding)
# Cosine sim = dot product (vectors are L2-normalized)
scores = corpus_vecs @ qv
top_k_idx = np.argsort(-scores)[:10]
candidates = [corpus[i] for i in top_k_idx]
# ============================================================
# 3) Rerank to narrow to top 3
# ============================================================
r = httpx.post("https://api.epithre.com/v1/rerank",
headers={"Authorization": f"Bearer {EK}"},
json={
"model": "epithre-rerank",
"query": question,
"documents": candidates,
"top_n": 3,
"return_documents": True,
},
).json()
context_blocks = [item["document"]["text"] for item in r["results"]]
context = "\n\n".join(f"[{i+1}] {t}" for i, t in enumerate(context_blocks))
# ============================================================
# 4) Generate answer
# ============================================================
resp = client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content": [
{"type": "text",
"text": ("Kamu asisten hukum Indonesia. Jawab berdasarkan konteks "
"yang diberikan. Selalu sebutkan nomor pasal/peraturan yang "
"relevan. Kalau konteks tidak cukup, bilang terus terang."),
"cache_control": {"type": "ephemeral"}}
]},
{"role": "user", "content": f"Konteks:\n{context}\n\nPertanyaan: {question}"},
],
)
print(resp.choices[0].message.content)
Pgvector storage pattern
For production scale, use Postgres + pgvector to store embeddings.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
text TEXT NOT NULL,
embedding halfvec(4000) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON documents USING hnsw (embedding halfvec_cosine_ops);
Insert:
import psycopg2
for text, vec in zip(corpus, corpus_vecs):
vec_str = "[" + ",".join(f"{v:.7f}" for v in vec) + "]"
cur.execute(
"INSERT INTO documents (text, embedding) VALUES (%s, %s::halfvec(4000))",
(text, vec_str),
)
Query (cosine search):
qvec_str = "[" + ",".join(f"{v:.7f}" for v in qv) + "]"
cur.execute("""
SELECT id, text, 1 - (embedding <=> %s::halfvec(4000)) AS score
FROM documents
ORDER BY embedding <=> %s::halfvec(4000)
LIMIT 10
""", (qvec_str, qvec_str))
candidates = [row[1] for row in cur.fetchall()]
Then rerank + chat as above.
Using /v1/retrieval instead
If your corpus is documents (PDF/TXT/MD), upload them as knowledge files and skip the manual embed+store step. See retrieval reference.
# Upload once
client.files.create(file=open("regulasi.pdf", "rb"), purpose="knowledge")
# Query later (handles embed + cosine search server-side)
hits = httpx.post(".../v1/retrieval", json={
"query": question, "top_k": 10,
}, headers={"Authorization": f"Bearer {EK}"}).json()["results"]
# Then rerank (top 10 -> top 3) and chat as before
When to rerank vs not
- Skip rerank if you only have <20 candidates and they're already pretty specific. Embed alone gives ~85% precision.
- Definitely rerank for: production-quality factual questions, legal / medical / financial domains, when first-page Google-level quality matters.
Rerank cost: Rp5 per document, so 10 docs reranked = Rp50. Negligible vs. embed/chat costs.
Indonesian-specific tips
- Use
instructionon embed with explicit task descriptions (see code above). On dense legal/finance corpora this can improve recall@10 by 5-10%. - Don't expect high absolute scores from rerank on Indonesian. A true match might score 0.10-0.30, not 0.7-0.9. Trust the rank order, not the absolute number.
- Chunk documents at paragraph boundaries, not arbitrary char counts. Indonesian legal text especially has structural cohesion at the pasal/ayat level.
See also
- Cross-modal RAG - same pattern with text + image vectors.
- Legal doc analysis - end-to-end with full Indonesian regulations.
- Embed reference, Rerank reference, Retrieval reference.