Evaluating model output

If you're shipping LLM-powered features to production, you need evaluation. Two main approaches: rule-based metrics (precision/recall/BLEU) and LLM-as-judge.

Build a gold set

Start with 50-200 examples per task. Each example has:

Input
Expected output (your hand-graded correct answer)
Optional: metadata (difficulty, category) for slicing

Store as JSONL:

{"id": "qa-001", "input": "Apa ibu kota Jepang?", "expected": "Tokyo", "category": "factual"}
{"id": "qa-002", "input": "Berapa hukuman penebangan liar?", "expected": "Pasal 50 UU 41/1999", "category": "legal"}

Rule-based metrics

For tasks with well-defined correct outputs:

def evaluate(predictions, gold):
    correct = 0
    for p, g in zip(predictions, gold):
        if g["expected"].lower() in p.lower():
            correct += 1
    return correct / len(gold)

Useful for: extraction (did we find the NIK?), classification (did we pick the right enum?), simple factual QA.

Limits: doesn't handle paraphrasing, partial matches, register differences.

LLM-as-judge

For open-ended tasks (summarization, explanation, creative output), use the model itself to score:

def judge(input_text, expected, predicted, criteria="overall correctness and completeness"):
    resp = client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content":
                f"Evaluate a prediction against the expected answer. Score 1-5 on {criteria}. "
                "Output JSON: {score: int, reasoning: string}."},
            {"role": "user", "content":
                f"Question: {input_text}\nExpected: {expected}\nPrediction: {predicted}"},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

For consistency across runs: - Use temperature=0 for deterministic judging. - Use a stronger model than the one being evaluated (e.g., judge epithre-lyt outputs with epithre-omni). - Run the judge 3 times and average to reduce variance.

A/B comparison

When you want to know if epithre-omni or epithre-prme is better for your task:

def compare(input_text):
    a = client.chat.completions.create(model="epithre-omni",
                                       messages=[{"role": "user", "content": input_text}])
    b = client.chat.completions.create(model="epithre-prme",
                                       messages=[{"role": "user", "content": input_text}])

    judge = client.chat.completions.create(
        model="epithre-omni",
        messages=[
            {"role": "system", "content":
                "Two assistants answered the same question. Which is better? "
                "Output JSON: {winner: 'A' | 'B' | 'tie', reasoning: string}."},
            {"role": "user", "content":
                f"Q: {input_text}\n\nA: {a.choices[0].message.content}\n\nB: {b.choices[0].message.content}"},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(judge.choices[0].message.content)

Bias: judging your own output is biased. Use the strongest available model as judge.

Per-category slicing

Always slice your eval results by category. A model might score 90% overall but 60% on legal queries, masking a real problem.

from collections import defaultdict
scores = defaultdict(list)
for ex in gold:
    pred = run_model(ex["input"])
    s = evaluate_one(pred, ex["expected"])
    scores[ex["category"]].append(s)

for cat, ss in scores.items():
    print(f"{cat}: {sum(ss)/len(ss):.2%}  (n={len(ss)})")

Eval at scale via Batch API

For 10K+ evals, use Batch:

with open("eval_inputs.jsonl", "w") as f:
    for ex in gold:
        f.write(json.dumps({
            "custom_id": ex["id"],
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {"model": "epithre-omni",
                     "messages": [{"role": "user", "content": ex["input"]}]}
        }) + "\n")

# Submit, wait, download. Match by custom_id, run judge separately.

50% off lets you afford much larger eval sets.

Indonesian-specific considerations

Register match: if input is casual, expected output should be casual. Don't penalize the model for matching register.
Spelling variations: "Yogya" vs "Yogyakarta" vs "Jogja" all reference the same city. Treat as equivalent in your scorer.
Code-switching: in Indonesian tech contexts, mixing English is natural. Don't penalize.