Al Karakas
← All essays

AI Building · Product

The Spec Was Never Updated

15 min read

I wrote eight AI agents into the product spec before a line of code existed. I shipped two LLM calls. The pipeline ran end-to-end for the first time only after a session that found eleven separate production bugs. The spec is still open on my laptop. It still describes the eight-agent system. The code disagrees, but the spec was never updated.

8 → 2
AI agents in the product spec vs LLM calls in the shipped system. The rest is deterministic code with no model involved
11
production bugs found in a single session. Column names in code that had never matched the database schema. One session, zero ambiguity.
0
times the pipeline ran end-to-end in production before those eleven bugs were fixed. Not slow. Not partial. Simply never.

Aplio is my personal job search tool. Not a SaaS product, not a demo. A production system that runs daily, finds jobs, evaluates them against a verified evidence bank, and generates tailored CVs and cover letters grounded only in documented outcomes. No fabrication. No inference beyond what is written down. When a job requirement has no matching evidence, it flags the gap. It does not paper over it.

I wrote the product requirements document before I wrote any code. It runs to over 180 pages. It names eight agents, each with a canonical name, model, timeout, and full system prompt versioned under semver. It describes a seven-stage pipeline: JD Extractor passes to Research Agent, Research Agent passes to Evidence Agent, Evidence Agent passes to Composition Planner, Composition Planner passes to Plan Critique Agent, Plan Critique Agent passes contested outputs to a Sonnet pass and clean ones straight through, Document Writer produces a differential CV against a template, Document Critique Agent runs two automated quality passes. Five to seven LLM calls per application. Two critique loops. Three human approval gates.

None of that shipped.

What shipped instead

The production system has five components. Two of them make LLM calls.

The Analysis Agent is a single Sonnet call. It receives the job description, the evidence bank, and a structured prompt. It produces a composition plan: which evidence items to use, in which order, with which framing. In the original spec, that work was divided between the Evidence Agent (Haiku pre-pass, then Sonnet for ambiguous cases), the Composition Planner (Sonnet), and the Plan Critique Agent (Haiku first, Sonnet if contested). The Analysis Agent does all of that in one pass, with the human reviewing the output before anything continues.

The Brief Review is not an agent. It is a UI screen. A person reads the composition plan, amends it if necessary, and approves it. In the original spec, that review was the Plan Critique Agent: an automated Haiku pass against a set of golden files, with contested outputs escalated to Sonnet. The hypothesis was that the model could catch quality problems faster than a human. The production system uses the human instead. The human is faster, requires no golden files, and brings context the model does not have.

The Clerk is entirely deterministic code. It takes the approved composition plan and looks up the exact approved phrasings from the evidence bank by ID. No model. No inference. A lookup table and a renderer. In the spec, this was the Document Writer: a Sonnet call that would produce differential CV output against a structured template, guided by a 180-page style guide. The actual work of translating a composition plan into structured CV output turned out to be a code problem, not a language model problem.

The Counsel Agent is the second LLM call. Sonnet. 1,500 token maximum. It writes the professional summary and the role introduction for the cover letter. That is all it does. In the spec, the Document Writer handled this too, alongside every bullet point and every transition. In production, every bullet point comes from the evidence bank via the Clerk, and only the prose that genuinely requires language generation goes to the model.

The Compiler assembles the final output. Deterministic code.

Somewhere between the spec and the build, a seven-stage pipeline with two critique loops and five to seven LLM calls became two LLM calls with a human checkpoint in the middle. There is a comment in the routes file: "Legacy agent imports removed in Phase B1 when the seven-stage pipeline was retired." There is no record of when Phase B1 was, why the pipeline was retired, or who made the decision. Four unused prompt files still exist in the repository: evidence_agent/v1.0.0.txt, composition_planner/v1.0.0.txt, and two more. Three database tables exist in the schema that have never had a row inserted: composition_plans, plan_critique_outputs, document_critique_outputs. The spec still describes them as active.

The pipeline that had never run

On the first full production run, after the codebase was declared ready, the session found eleven bugs in a single pass.

Not edge cases. Not configuration drift. Structural failures. Column names in the code that did not match the database schema, across four separate source files independently. The code that read salary data from Reed had been writing to salary_extracted_min and salary_extracted_max. The schema uses salary_value and salary_type. The code for Adzuna had the same mismatch. The RSS parser had the same mismatch. The LinkedIn importer had the same mismatch. Four files, same error, never caught, because the pipeline had never actually run end-to-end in production before that session.

The deduplication module was calling jaro_winkler on an import that exported it as jaroWinkler. One character of case difference. The dedup function had never fired because the upstream data had never reached it. The status flags service was checking created_at; the schema uses discovered_at. The weekly review agent was writing to agent_name and severity; the correct column names are flag_detail and revision_outcome. The health check endpoint listed resend as a required service and was returning 503 when Resend was not configured, which was always.

Before those eleven bugs were fixed, the pipeline had never run in production. Not in testing, not in staging, not once. The code was written, reviewed, committed, and declared production-ready. The pipeline did not exist in any operational sense until the bugs were found and corrected.

The pipeline was fiction that had the correct shape. It described what the system would do, in the right sequence, with the right components. It had never actually done it.

After the bugs were fixed, Reed returned 200 jobs fetched, 157 inserted. Adzuna returned 115 fetched, 107 inserted. Deduplication found 37 duplicates and flagged 92 borderline cases. The pipeline worked correctly on the first clean run. The code was right. The schema was right. The connection between them had never been tested.

Why eight agents became two

There is no decision register entry for the pivot. I can reconstruct a plausible rationale.

Every LLM call in a pipeline is a failure surface. It is latency, cost, and a point where the output can be wrong in ways that are hard to detect automatically. The Plan Critique Agent was designed to catch quality problems in the composition plan before they reached the Document Writer. That is a real problem. Automated critique loops are a real solution to it. They are also expensive to maintain: you need golden files, you need to tune the critique prompt, you need to validate that the Haiku pass agrees with the Sonnet pass at the rate the spec requires before you can cut over.

A human reviewing the composition plan before the Writer runs is faster. It catches the same quality problems. It catches additional problems the model cannot catch: strategic framing issues, positioning choices, gaps the evidence bank does not cover. It requires no golden files. It requires no tuning. It does not degrade as the job market shifts or as prompts drift.

The Evidence Agent's Haiku pre-pass was designed to mark obvious strong matches quickly and pass only ambiguous cases to Sonnet. In production, the Analysis Agent does both in one Sonnet call. Whether that trades some cost efficiency for simplicity is a legitimate question. What is not a question is which system is easier to debug, maintain, and explain.

The same reasoning applies to the Composition Planner and the Document Writer. In the spec, these are separate agents with separate prompts, separate versioning, and separate failure modes. Separating them creates clean boundaries. It also creates two prompts to maintain, two timeouts to configure, and two points where the handoff can fail. In production, one Analysis Agent produces the plan and one Counsel Agent writes the prose. The boundary between planning and writing is held by a human, not by an agent-to-agent handoff.

The differential CV rendering from the original spec did survive. The Clerk does not regenerate static sections. It applies a structured diff against a template, updating only what the composition plan calls out. That is roughly sixty percent token savings on every CV generation compared to full regeneration. The cost logic held. The agent logic did not.

What the evidence bank taught me

The evidence bank has gone through three structural migrations since the spec was written. It started as a flat list of evidence items: thirty items, manually sourced, each with a claim, a metric, and a verification state. The spec described four layers: automated parse, user confirmation, in-pipeline enrichment, structured interview. That architecture was right.

What the spec did not anticipate was the engagement hierarchy. Evidence items in the spec were standalone: a claim, a context field, a source. In production, evidence items belong to engagements, which belong to roles. A bullet about budget management means something different depending on whether it comes from a £350k AI engagement recovery or a £1.2m voice programme stabilisation. The same claim, in different contexts, is a different piece of evidence. The flat list could not represent that distinction.

Three migrations added engagement_id, bullet_themes, is_anchor, and bullet_legacy_id to the evidence items table. A new table, engagement_blocks, introduced the role-to-engagement hierarchy. None of these are in the schema document. Four columns and one table exist in the production database that the authoritative schema document does not know about.

The spec was never updated.

What this means for AI programme management

The interesting thing about Aplio is not that it diverged from the spec. Every product diverges from the spec. The interesting thing is where it diverged, and what the divergences reveal about the specific failure modes of AI system design.

The spec overestimated what agent-to-agent handoffs could reliably do. A pipeline where Agent A's output becomes Agent B's input assumes that Agent A's output is structured, consistent, and legible to Agent B under all conditions. In practice, LLM outputs are none of those things unconditionally. They are structured and consistent most of the time, and wrong in surprising ways some of the time. The more agents you chain, the more opportunities for that variability to compound. Two agents give you two failure surfaces. Eight agents give you eight.

The spec also underestimated what humans in the loop could do. The Brief Review is not a compromise forced by the failure of automated critique. It is architecturally superior to the Plan Critique Agent for this specific task, because the person reviewing the plan has context the model cannot have: they know what this particular employer actually cares about, what gaps in the evidence bank are worth flagging, what framing is going to land with this hiring manager. That judgment is not extractable into a prompt. The spec treated human review as a fallback. The production system treats it as the primary quality gate.

The eleven-bug session teaches a different lesson. The pipeline existed on paper. The column names were wrong. The tests had not run. None of that was visible until the first production run. The lesson is not that you should write better specs or test more thoroughly, though both are true. The lesson is that a system that has never run in production does not exist in any meaningful sense, regardless of what the spec says. The spec is not the system. The running code is the system.

Diagnostic questions before your next AI agent

Is this replacing a human judgment?

Every agent in a pipeline is a bet that a model call is better than a human decision at this step. Before adding an agent, name the human judgment it replaces. If you cannot name it, the agent does not have a clear job.

Has this pipeline ever run end-to-end?

Not in testing. Not with mock data. Against the real schema, with real inputs, producing real outputs. A pipeline that has not done this does not exist operationally. The column names might be wrong. The export names might be wrong. You do not know until it runs.

What happens when this agent's output is wrong?

Every LLM call will occasionally produce wrong output. In a chained pipeline, wrong output from Agent A becomes the input for Agent B. Name the failure mode, and name where a human would catch it before it damages the output the user sees.

Is your spec describing what you built or what you planned?

They diverge faster than you expect. The divergence is not the problem. The problem is when the spec is still being treated as the authority after the code has moved on. Someone will eventually ask what system they are building on. Make sure the answer is the current one.

What is deterministic in this pipeline?

Not everything in an AI pipeline needs to be an AI call. The Clerk is code. The Compiler is code. Anything with a clear, testable specification should be code. LLM calls should be reserved for the tasks where natural language understanding is the thing you are buying. Validation, structuring, lookup, and formatting are not those tasks.

The spec for Aplio is still accurate in many places. The evidence constraint held: the Clerk only uses evidence items named in the approved plan. The approved phrasing requirement held: bullets come from verified text, not inference. The cost gate held: every Research Agent call is checked against the weekly spend before it fires. The differential rendering held. These were right at spec time and they are right now.

What did not hold was the agent architecture. Eight agents became two. A system I had not designed, with a pivot I had not documented, running on a pipeline that had never run before the day I found eleven bugs in it.

The spec was never updated. The system works.