RAG: embed + rerank + chat

The canonical retrieval-augmented generation pattern:

Index your corpus once with epithre-embed.
At query time, embed the question, find top-K nearest by cosine.
Rerank to narrow to the most-relevant 3-5.
Pass the reranked context to epithre-omni (or epithre-lyt) to generate a grounded answer.

This pattern gives ~95%+ retrieval quality on Indonesian content in our benchmarks, well above embed-only.

Full pipeline

import os, httpx, numpy as np
from openai import OpenAI

EK = os.environ["EPITHRE_KEY"]
client = OpenAI(api_key=EK, base_url="https://api.epithre.com/v1")

# ============================================================
# 1) ONE-TIME: embed your corpus
# ============================================================
corpus = [
    "UU 41/1999 pasal 50 - barangsiapa merusak hutan akan dipidana penjara paling lama 10 tahun.",
    "Permen LHK No. 92/2018 tentang pengelolaan satwa liar dilindungi.",
    "PP No. 23/2021 tentang penyelenggaraan kehutanan dan rehabilitasi lahan.",
    # ... thousands of docs in production
]

e = client.embeddings.create(
    model="epithre-embed",
    input=corpus,
    extra_body={"instruction": "Represent this document for retrieval:"},
)
corpus_vecs = np.array([row.embedding for row in e.data])   # (N, 4000)
# Store corpus_vecs + corpus in your vector DB (pgvector, Qdrant, etc.)

# ============================================================
# 2) AT QUERY TIME: embed the question, find top-K
# ============================================================
question = "Apa hukuman untuk perusakan hutan lindung?"

qe = client.embeddings.create(
    model="epithre-embed",
    input=[question],
    extra_body={"instruction": "Represent this query for retrieving relevant documents:"},
)
qv = np.array(qe.data[0].embedding)

# Cosine sim = dot product (vectors are L2-normalized)
scores = corpus_vecs @ qv
top_k_idx = np.argsort(-scores)[:10]
candidates = [corpus[i] for i in top_k_idx]

# ============================================================
# 3) Rerank to narrow to top 3
# ============================================================
r = httpx.post("https://api.epithre.com/v1/rerank",
    headers={"Authorization": f"Bearer {EK}"},
    json={
        "model": "epithre-rerank",
        "query": question,
        "documents": candidates,
        "top_n": 3,
        "return_documents": True,
    },
).json()
context_blocks = [item["document"]["text"] for item in r["results"]]
context = "\n\n".join(f"[{i+1}] {t}" for i, t in enumerate(context_blocks))

# ============================================================
# 4) Generate answer
# ============================================================
resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[
        {"role": "system", "content": [
            {"type": "text",
             "text": ("Kamu asisten hukum Indonesia. Jawab berdasarkan konteks "
                      "yang diberikan. Selalu sebutkan nomor pasal/peraturan yang "
                      "relevan. Kalau konteks tidak cukup, bilang terus terang."),
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": f"Konteks:\n{context}\n\nPertanyaan: {question}"},
    ],
)
print(resp.choices[0].message.content)

Pgvector storage pattern

For production scale, use Postgres + pgvector to store embeddings.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    embedding halfvec(4000) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON documents USING hnsw (embedding halfvec_cosine_ops);

Insert:

import psycopg2

for text, vec in zip(corpus, corpus_vecs):
    vec_str = "[" + ",".join(f"{v:.7f}" for v in vec) + "]"
    cur.execute(
        "INSERT INTO documents (text, embedding) VALUES (%s, %s::halfvec(4000))",
        (text, vec_str),
    )

Query (cosine search):

qvec_str = "[" + ",".join(f"{v:.7f}" for v in qv) + "]"
cur.execute("""
    SELECT id, text, 1 - (embedding <=> %s::halfvec(4000)) AS score
    FROM documents
    ORDER BY embedding <=> %s::halfvec(4000)
    LIMIT 10
""", (qvec_str, qvec_str))
candidates = [row[1] for row in cur.fetchall()]

Then rerank + chat as above.

Using `/v1/retrieval` instead

If your corpus is documents (PDF/TXT/MD), upload them as knowledge files and skip the manual embed+store step. See retrieval reference.

# Upload once
client.files.create(file=open("regulasi.pdf", "rb"), purpose="knowledge")

# Query later (handles embed + cosine search server-side)
hits = httpx.post(".../v1/retrieval", json={
    "query": question, "top_k": 10,
}, headers={"Authorization": f"Bearer {EK}"}).json()["results"]

# Then rerank (top 10 -> top 3) and chat as before

When to rerank vs not

Skip rerank if you only have <20 candidates and they're already pretty specific. Embed alone gives ~85% precision.
Definitely rerank for: production-quality factual questions, legal / medical / financial domains, when first-page Google-level quality matters.

Rerank cost: Rp5 per document, so 10 docs reranked = Rp50. Negligible vs. embed/chat costs.

Indonesian-specific tips

Use instruction on embed with explicit task descriptions (see code above). On dense legal/finance corpora this can improve recall@10 by 5-10%.
Don't expect high absolute scores from rerank on Indonesian. A true match might score 0.10-0.30, not 0.7-0.9. Trust the rank order, not the absolute number.
Chunk documents at paragraph boundaries, not arbitrary char counts. Indonesian legal text especially has structural cohesion at the pasal/ayat level.