Engineering · Thoughtcoast — code, evals, architecture

Evals first

The retrieval rebuild.

The company’s question-answering system started as the usual prototype: embed the documents, search by vector, hope. Against a hand-labelled gold set it scored roughly 0.05 recall@5 — the right source passage reached the top five results about once in twenty queries. It demoed well. It was useless.

The fix was unglamorous: hybrid retrieval — keyword search alongside vectors — then a reranking pass, tuned question by question against that gold set. Recall@5 now sits around 0.85. That figure is retrieval recall, not answer accuracy: it measures whether the right evidence reaches the model at all, which caps everything downstream.

What keeps the number honest is the gate: every retrieval change re-runs the gold set in CI, and a change that scores below the current baseline does not ship. The same logic runs in the simulation below, and in the public harness.

The case study → How to evaluate a RAG system — the method, written up → rag-eval-harness — the public repo →

A working simulation — go on, ship something bad

Break the deploy.

Eight gold questions, three candidate changes, one gate. The threshold is 0.80.

Gold-set results for the selected change
Gold question	Top-5 result
What’s the weekend call-out fee?	✓ in top 5 — rank 1
Which jobs need a permit to work near a watercourse?	✓ in top 5 — rank 2
What’s the cancellation policy inside 24 hours?	✓ in top 5 — rank 1
Who signs off a discount above 10%?	✓ in top 5 — rank 1
What PPE does confined-space entry require?	✓ in top 5 — rank 3
How do we invoice a job that spans two sites?	✗ not in top 5the evidence is split across two documents
What does error E-17 on the jetting unit mean?	✓ in top 5 — rank 1
When was the fleet insurance last renewed?	✓ in top 5 — rank 2

How the production systems are built.

Facts a code review would confirm.

Webhook handlers are idempotent — a retried delivery can’t create a second job or a second invoice.
Outbound calls respect provider rate limits; retries are capped and dead-lettered, never infinite.
API tokens are least-privilege — the quoting service can read the CRM, not delete from it.
Writes with real consequence stop for human sign-off — money out, records with legal weight. Automation drafts; a person commits.
Every automated change lands in an audit log: what changed, which system triggered it, when.
Every deploy carries a quality number, so a regression shows on a graph before it shows as a complaint.
Secrets live in the environment, never the repository; each environment has its own credentials.
If an automation stops, work falls back to a human queue — the failure mode is manual, not silent.

The stack, in one sentence: TypeScript and Python on Postgres, serverless functions plus one small VPS, n8n for the plumbing between SaaS systems, LLM providers behind a thin abstraction so a model can be swapped the day the eval says a better one exists, and GitHub Actions for CI.