Legal document analysis (Bahasa Indonesia)

A demonstration of Epithre's strength on Indonesian legal text. We use:

Knowledge upload to ingest statute PDFs into a per-customer index.
Retrieval to find relevant pasal for a query.
Chat with epithre-omni (or prme for long context) to synthesize a cited answer.

Step 1: ingest your statute corpus

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["EPITHRE_KEY"], base_url="https://api.epithre.com/v1")

# Upload PDFs of regulations
statute_files = [
    "regulations/UU_41_1999_kehutanan.pdf",
    "regulations/UU_32_2009_lingkungan_hidup.pdf",
    "regulations/PP_23_2021_penyelenggaraan_kehutanan.pdf",
    "regulations/Permen_LHK_92_2018_satwa.pdf",
]

file_ids = []
for path in statute_files:
    r = client.files.create(file=open(path, "rb"), purpose="knowledge")
    file_ids.append(r.id)
    print(f"Uploaded {path} -> {r.id}")

# Poll until processed
import time
for fid in file_ids:
    while True:
        f = client.files.retrieve(fid)
        if f.status == "processed":
            break
        if f.status == "error":
            raise Exception(f"Processing failed for {fid}")
        time.sleep(5)

Step 2: query

import httpx

question = "Apa hukuman pidana untuk perambahan hutan lindung tanpa izin?"

retrieved = httpx.post(
    "https://api.epithre.com/v1/retrieval",
    headers={"Authorization": f"Bearer {os.environ['EPITHRE_KEY']}"},
    json={
        "query": question,
        "top_k": 10,
        "file_ids": file_ids,
    },
).json()["results"]

context = "\n\n".join(
    f"[{h['file_id']} chunk {h['chunk_index']}] {h['text']}"
    for h in retrieved
)

Step 3: synthesize a cited answer

resp = client.chat.completions.create(
    model="epithre-omni",
    messages=[
        {"role": "system", "content": [
            {"type": "text", "text":
                ("Kamu asisten hukum Indonesia. Jawab berdasarkan konteks "
                 "peraturan yang diberikan. WAJIB:\n"
                 "1. Sebutkan dasar hukum lengkap: nama UU/PP/Permen + nomor + tahun + Pasal + ayat.\n"
                 "2. Kutip persis isi pasal yang relevan.\n"
                 "3. Kalau konteks tidak cukup untuk menjawab, bilang dengan jujur.\n"
                 "4. Jangan ngarang. Jangan tambah dasar hukum yang tidak ada di konteks.\n\n"
                 "Format jawaban:\n"
                 "DASAR HUKUM: <kutipan pasal lengkap>\n"
                 "JAWABAN: <penjelasan dengan bahasa awam>\n"
                 "PERINGATAN: <kalau ada nuansa atau pengecualian>"),
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content":
            f"KONTEKS PERATURAN:\n{context}\n\n"
            f"PERTANYAAN: {question}"},
    ],
)
print(resp.choices[0].message.content)

Example output:

DASAR HUKUM:
- UU No. 41 Tahun 1999 tentang Kehutanan, Pasal 50 ayat (3) huruf e:
  "Setiap orang dilarang melakukan kegiatan pertambangan tanpa izin Menteri di kawasan hutan."
- Pasal 78 ayat (5): "Barangsiapa dengan sengaja melanggar... diancam dengan
  pidana penjara paling lama 10 (sepuluh) tahun dan denda paling banyak
  Rp 5.000.000.000,00."

JAWABAN:
Perambahan hutan lindung tanpa izin dapat dipidana penjara hingga 10 tahun
dan denda hingga Rp 5 miliar berdasarkan UU Kehutanan...

PERINGATAN: Hukuman dapat lebih berat jika perbuatan dilakukan secara
terorganisir atau menyebabkan kerusakan lingkungan yang signifikan,
yang diatur dalam UU 32/2009 tentang PPLH.

Why this works well

epithre-embed is tuned on Indonesian legal text, so retrieval finds the right pasal even when the query phrasing differs.
The system prompt is anchored: model must cite, must say "tidak cukup" when context is thin, must not hallucinate.
Cache marker on the system prompt means follow-up queries in the same session bill at 10% read rate on the framing.
epithre-omni has strong reasoning over legal hierarchy (UU > PP > Permen) and ayat structure.

For long-document follow-ups

If the user asks a follow-up that needs additional context (e.g. "kalau pelakunya korporasi, gimana?"), you can either:

Re-retrieve with the new query (broader context).
OR pass the full statute text in one shot to epithre-prme (180K context) for deep analysis.

# Full-document deep dive
resp = client.chat.completions.create(
    model="epithre-prme",
    messages=[
        {"role": "system", "content": "Analisa hukum mendalam."},
        {"role": "user", "content":
            f"Berikut UU 41/1999 lengkap (50 halaman):\n\n{full_text}\n\n"
            f"Pertanyaan: {complex_question}"},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

With thinking enabled, epithre-prme can produce dense multi-pasal cross-references that match a junior lawyer's analysis quality.

Validating citations

Critical: the model occasionally hallucinates pasal numbers, especially for niche statutes. Always programmatically verify cited pasal exist in your corpus.

import re
citations = re.findall(r"Pasal\s+(\d+)[\w\s]*ayat\s*\((\d+)\)", model_output)
for pasal, ayat in citations:
    # Verify in your statute index
    if not statute_index.has(pasal, ayat):
        log.warning(f"Citation Pasal {pasal} ayat ({ayat}) not in corpus, possible hallucination")