AgentsAI adoptionEngineering

Spotify's agent playbook, at 30 people

Spotify merged 2.5 million AI-written maintenance pull requests behind a playbook of standardisation, queryable context and verification gates. I run the miniature version inside a thirty-person services group, and it holds up.

Spotify's chief architect gave a talk this year called How Spotify runs agents across 20M+ lines of code, and the companion post on their engineering blog is titled Coding Is No Longer the Constraint. The numbers are worth sitting with. Ninety-nine per cent of their engineers use the AI tooling in a given week. Pull request volume is up 76 per cent. They have merged 2.5 million automated maintenance PRs, many with no human in the loop at all.

The easy response is that this is a Spotify story: thousands of engineers, a platform organisation, an internal developer portal so good they open-sourced it. Nothing to do with a thirty-person business. I think that reading is wrong. I run the technology for a multi-brand UK services group of about that size, and a shrunk-to-fit version of the same playbook is the most useful thing I have taken from a big-company engineering blog in years.

What Spotify actually did

Strip the talk to its bones and there are three moves.

First, they standardised. Agents perform better on consistent code, so Spotify deliberately narrowed its technology choices: fewer languages, fewer frameworks, the same patterns repeated everywhere. Consistency used to be a nice-to-have for humans. For agents it is a performance feature.

Second, they made context queryable. Backstage, their internal catalog, means an agent can look up who owns a service, where its docs live and how it deploys, instead of guessing from file names. The agent does not need to be told about the codebase. It can ask.

Third, they refused to trust output without verification. A background agent they call Honk runs builds on agent-written changes, CI verifies everything before a PR lands, and auto-merge is reserved for classes of change that have proven safe. The 2.5 million merged PRs are downstream of that last move. You only merge at fleet scale once the verification is boring.

The thirty-person translation

Each move has a miniature version. I run all three.

Backstage becomes one knowledge API. Mine is a small service that agents query for business facts by slug: how pricing works, how commission is calculated, which system is the source of truth for a booking. Before it existed, every agent session started from zero, and roughly once a fortnight one of them would confidently re-derive the commission rules and get them wrong. Now the rule is simple. If a fact matters, it has a slug, and the agent fetches it instead of inventing it.

Honk's build farm becomes eval gates in CI. The gate I lean on most guards retrieval: a change that scores worse than the last shipped version against a golden set of questions does not ship, however good it looks in a demo. That gate is a scoring script and around fifty questions with known answers. An afternoon of work, not a farm.

Fleet migrations become multi-agent fan-outs with adversarial verifiers. When the output matters (a financial report going to a director, a pricing change touching live quotes) I spawn two or three further agents whose only job is to attack the first one's work: check every figure against its source, hunt for the claim that does not survive contact. It is the same principle as Spotify's auto-merge. Trust is a property of the verification, not of the generator.

Where the analogy breaks

I want to be honest about the differences, because they cut both ways.

At Spotify's scale the enemy is coordination: keeping thousands of engineers moving in compatible directions. At mine the enemy is discipline. There is no platform team to stop me adopting a shiny new framework on a tired Friday. I am the platform team. One person can break consistency faster than any agent can exploit it, so the standardisation rule is really a rule about my own behaviour, and it is the hardest one to keep.

The counterweight is that the payoff curve is steeper. Spotify's investment amortises across thousands of engineers, slowly. A thirty-person firm has no platform team to amortise anything, which means each piece of this has to pay back on the next job. It does. The knowledge API took a weekend and made every agent session better from the Monday after.

And some of it simply does not transfer. Two and a half million maintenance PRs presupposes a fleet of services. I have a dozen repos. The equivalent at my size is unglamorous hygiene, dependency bumps and dead-config removal on a schedule. Worth doing, never the headline.

Copy these three first

If you run technology for a small firm and want the compounding version of this, start here.

  1. Standardise ruthlessly. Pick one way to do each kind of thing and hold the line, especially against yourself. Agents repay consistency every single day.
  2. Make your facts queryable. One place, an API or even a folder of files, where agents fetch business truths instead of guessing them. It is the highest-value weekend project I know.
  3. Gate agent output with a score. Nothing agent-written ships on confidence alone. A golden set and a threshold in CI turns "seems fine" into "measurably not worse".

None of that needs Spotify's headcount. It needs a weekend, about fifty test questions, and the discipline not to break your own rules.

Building or running AI in a real business? Let's talk.

← All writing