Evaluating model output
If you're shipping LLM-powered features to production, you need evaluation. Two main approaches: rule-based metrics (precision/recall/BLEU) and LLM-as-judge.
Build a gold set
Start with 50-200 examples per task. Each example has:
- Input
- Expected output (your hand-graded correct answer)
- Optional: metadata (difficulty, category) for slicing
Store as JSONL:
{"id": "qa-001", "input": "Apa ibu kota Jepang?", "expected": "Tokyo", "category": "factual"}
{"id": "qa-002", "input": "Berapa hukuman penebangan liar?", "expected": "Pasal 50 UU 41/1999", "category": "legal"}
Rule-based metrics
For tasks with well-defined correct outputs:
def evaluate(predictions, gold):
correct = 0
for p, g in zip(predictions, gold):
if g["expected"].lower() in p.lower():
correct += 1
return correct / len(gold)
Useful for: extraction (did we find the NIK?), classification (did we pick the right enum?), simple factual QA.
Limits: doesn't handle paraphrasing, partial matches, register differences.
LLM-as-judge
For open-ended tasks (summarization, explanation, creative output), use the model itself to score:
def judge(input_text, expected, predicted, criteria="overall correctness and completeness"):
resp = client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content":
f"Evaluate a prediction against the expected answer. Score 1-5 on {criteria}. "
"Output JSON: {score: int, reasoning: string}."},
{"role": "user", "content":
f"Question: {input_text}\nExpected: {expected}\nPrediction: {predicted}"},
],
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)
For consistency across runs:
- Use temperature=0 for deterministic judging.
- Use a stronger model than the one being evaluated (e.g., judge epithre-lyt outputs with epithre-omni).
- Run the judge 3 times and average to reduce variance.
A/B comparison
When you want to know if epithre-omni or epithre-prme is better for your task:
def compare(input_text):
a = client.chat.completions.create(model="epithre-omni",
messages=[{"role": "user", "content": input_text}])
b = client.chat.completions.create(model="epithre-prme",
messages=[{"role": "user", "content": input_text}])
judge = client.chat.completions.create(
model="epithre-omni",
messages=[
{"role": "system", "content":
"Two assistants answered the same question. Which is better? "
"Output JSON: {winner: 'A' | 'B' | 'tie', reasoning: string}."},
{"role": "user", "content":
f"Q: {input_text}\n\nA: {a.choices[0].message.content}\n\nB: {b.choices[0].message.content}"},
],
response_format={"type": "json_object"},
)
return json.loads(judge.choices[0].message.content)
Bias: judging your own output is biased. Use the strongest available model as judge.
Per-category slicing
Always slice your eval results by category. A model might score 90% overall but 60% on legal queries, masking a real problem.
from collections import defaultdict
scores = defaultdict(list)
for ex in gold:
pred = run_model(ex["input"])
s = evaluate_one(pred, ex["expected"])
scores[ex["category"]].append(s)
for cat, ss in scores.items():
print(f"{cat}: {sum(ss)/len(ss):.2%} (n={len(ss)})")
Eval at scale via Batch API
For 10K+ evals, use Batch:
with open("eval_inputs.jsonl", "w") as f:
for ex in gold:
f.write(json.dumps({
"custom_id": ex["id"],
"method": "POST",
"url": "/v1/chat/completions",
"body": {"model": "epithre-omni",
"messages": [{"role": "user", "content": ex["input"]}]}
}) + "\n")
# Submit, wait, download. Match by custom_id, run judge separately.
50% off lets you afford much larger eval sets.
Indonesian-specific considerations
- Register match: if input is casual, expected output should be casual. Don't penalize the model for matching register.
- Spelling variations: "Yogya" vs "Yogyakarta" vs "Jogja" all reference the same city. Treat as equivalent in your scorer.
- Code-switching: in Indonesian tech contexts, mixing English is natural. Don't penalize.
See also
- Cookbook: classification - for classifier evals.
- Cookbook: translation - BLEU/Rouge patterns.
- Best practices