Engineering · for technical evaluators
I build production AI for a business where a wrong answer carries financial or legal weight — the work sits under board review and external audit. This page is how a change earns its way into production.
Evals first
The company’s question-answering system started as the usual prototype: embed the documents, search by vector, hope. Against a hand-labelled gold set it scored roughly 0.05 recall@5 — the right source passage reached the top five results about once in twenty queries. It demoed well. It was useless.
The fix was unglamorous: hybrid retrieval — keyword search alongside vectors — then a reranking pass, tuned question by question against that gold set. Recall@5 now sits around 0.85. That figure is retrieval recall, not answer accuracy: it measures whether the right evidence reaches the model at all, which caps everything downstream.
What keeps the number honest is the gate: every retrieval change re-runs the gold set in CI, and a change that scores below the current baseline does not ship. The same logic runs in the simulation below, and in the public harness.
A working simulation — go on, ship something bad
Eight gold questions, three candidate changes, one gate. The threshold is 0.80.
| Gold question | Top-5 result |
|---|---|
| What’s the weekend call-out fee? | ✓ in top 5 — rank 1 |
| Which jobs need a permit to work near a watercourse? | ✓ in top 5 — rank 2 |
| What’s the cancellation policy inside 24 hours? | ✓ in top 5 — rank 1 |
| Who signs off a discount above 10%? | ✓ in top 5 — rank 1 |
| What PPE does confined-space entry require? | ✓ in top 5 — rank 3 |
| How do we invoice a job that spans two sites? | ✗ not in top 5the evidence is split across two documents |
| What does error E-17 on the jetting unit mean? | ✓ in top 5 — rank 1 |
| When was the fleet insurance last renewed? | ✓ in top 5 — rank 2 |
Facts a code review would confirm.
The stack, in one sentence: TypeScript and Python on Postgres, serverless functions plus one small VPS, n8n for the plumbing between SaaS systems, LLM providers behind a thin abstraction so a model can be swapped the day the eval says a better one exists, and GitHub Actions for CI.