Engineering · for technical evaluators

How I build.

I build production AI for a business where a wrong answer carries financial or legal weight — the work sits under board review and external audit. This page is how a change earns its way into production.

Evals first

The retrieval rebuild.

The company’s question-answering system started as the usual prototype: embed the documents, search by vector, hope. Against a hand-labelled gold set it scored roughly 0.05 recall@5 — the right source passage reached the top five results about once in twenty queries. It demoed well. It was useless.

The fix was unglamorous: hybrid retrieval — keyword search alongside vectors — then a reranking pass, tuned question by question against that gold set. Recall@5 now sits around 0.85. That figure is retrieval recall, not answer accuracy: it measures whether the right evidence reaches the model at all, which caps everything downstream.

What keeps the number honest is the gate: every retrieval change re-runs the gold set in CI, and a change that scores below the current baseline does not ship. The same logic runs in the simulation below, and in the public harness.

A working simulation — go on, ship something bad

Break the deploy.

Eight gold questions, three candidate changes, one gate. The threshold is 0.80.

Candidate change

Simulation: the per-question results are hand-authored, not computed live — but the gate logic matches the public harness. To run the real thing: git clone, one command — the README shows it.

Gold-set results for the selected change
Gold questionTop-5 result
What’s the weekend call-out fee?✓ in top 5 — rank 1
Which jobs need a permit to work near a watercourse?✓ in top 5 — rank 2
What’s the cancellation policy inside 24 hours?✓ in top 5 — rank 1
Who signs off a discount above 10%?✓ in top 5 — rank 1
What PPE does confined-space entry require?✓ in top 5 — rank 3
How do we invoice a job that spans two sites?✗ not in top 5the evidence is split across two documents
What does error E-17 on the jetting unit mean?✓ in top 5 — rank 1
When was the fleet insurance last renewed?✓ in top 5 — rank 2

How the production systems are built.

Facts a code review would confirm.

The stack, in one sentence: TypeScript and Python on Postgres, serverless functions plus one small VPS, n8n for the plumbing between SaaS systems, LLM providers behind a thin abstraction so a model can be swapped the day the eval says a better one exists, and GitHub Actions for CI.

This site is hand-built the same way, one file per page — the colophon explains how. Evaluating me for a role or an engagement? Email me.