AI Building · Quality
The Tests That Passed Without Running
Six end-to-end test files. Committed, reviewed, declared done. Five of six had wrong route paths, transcribed from acceptance criteria text rather than from the codebase. All six skipped cleanly under their declared pre-conditions. The CI run said "all tests pass." Fourteen days later, when the tests actually ran against a real application for the first time, five of six failed immediately. The failures were not edge cases. They were foundational. The route paths had never existed.
Pmly is a professional governance system for PM contractors. It tracks engagements, surfaces RAID entries, calculates earned value, generates narrative reports, and has seven AI agents. An Assistant, a Challenger, a Coach, a Report Drafter, a Budget Narrator, a Session Closer, and an Engagement Intelligence Agent that primes all the others with domain knowledge about the sector and client type before any of them speak.
Before the system was built, there was a 122-task plan and a 40-table database schema. The plan ran to 186 estimated sessions. Eight phases. Most of it was built in a compressed window. The session log for T-001 through T-075 is reconstructed from git history with minimal detail. By the time the first session log entry was written with enough detail to be useful, the foundation, data layer, authentication, all core entities, all governance entities, and the start of agent infrastructure were already committed.
The end-to-end tests came in at T-108, near the end of the build.
What the session log says
The T-108 session log entry includes this note:
"Without seeded fixtures or a running app, the specs cannot be executed end-to-end in this session. The acceptance criterion ('All E2E tests pass') is satisfied in the trivial sense: every test currently skips per its declared pre-conditions, none fail."
That sentence is the whole story. The acceptance criterion was "all E2E tests pass." All E2E tests passed. The criterion was satisfied. The fact that they passed by skipping rather than by running was noted in the session log and not treated as a blocking issue.
Six Playwright spec files, written by reading the PRD acceptance criteria and transcribing what the system was supposed to do into test assertions. Not by running grep on the codebase. Not by checking what URL the route actually lived at. Not by reading the route file. By reading the PRD, which describes what the system does, not where it lives.
The paths that were wrong:
GET /api/health: the canonical path is /api/v1/health. The v1 namespace is the standard for every route in the system.
GET /api/v1/raid?engagement_id=: the canonical path is /api/v1/raid-entries?projectId=. Wrong parameter name, wrong query key.
POST /api/v1/engagements/[id]/intelligence/regenerate: the canonical path is POST .../intelligence/gaps. The word "regenerate" appears in the PRD description of the capability. The route file uses "gaps".
GET /api/v1/portal/projects/[id]/reports: the canonical path is /api/v1/portal/reports. The project ID is a query parameter, not a path segment.
GET /api/v1/portal/engagements/[id]: this route does not exist in V1 scope at all. The PRD describes the capability; the V1 scope document excludes it.
POST /api/v1/agents/orchestrator: there is no chat orchestrator in the V1 architecture. The test was asserting a path for an architectural component the system does not have.
All six skipped. None failed. The test runner reported success.
The infrastructure session
Two weeks later, T-118 built the actual E2E infrastructure: a seeded test database, CI workflow, Playwright configuration, fixture factories. When the tests ran for the first time against a real application, the following required immediate attention:
The pg_cron extension uses bigint for job IDs. The unschedule call in the test teardown was passing text. Type mismatch, teardown fails, test isolation breaks.
The pg_cron extension was not enabled in the local Supabase Docker image by default. Tests that depended on scheduled job infrastructure were running against a database without the scheduler.
The CI workflow included supabase db reset before each test run. This left PostgREST with a stale schema cache. API calls made immediately after reset were returning schema errors from the previous state.
The JWT signing key was hardcoded in the test configuration. The actual key was derived dynamically from supabase status. The hardcoded key was wrong, which meant every test that required authentication was failing with invalid token errors.
Six infrastructure fixes, each one a separate problem, each one discovered only when the tests ran for the first time. The E2E infrastructure session required six follow-up pull requests before CI was stable.
The eighteen tests that Redis broke
There was a third category of failure, separate from wrong paths and infrastructure issues. Eighteen tests were failing with 500 errors. Not because the routes were wrong. Not because the fixtures were missing. Because the Redis client was throwing an exception on startup when Redis credentials were not configured.
The Redis client was written to throw on missing credentials. The rationale at the time was explicit: fail loudly rather than silently. The four callers of the Redis client each had null-handling code, the null-object pattern, ready to handle a null return. The client never returned null. It threw instead.
In the test environment, Redis credentials were not configured. Every request that touched any code path that initialised the Redis client was generating a 500 before it reached the route handler. The test was testing a Redis initialisation error, not the route it was supposed to test. The test had no way to know that.
The fix was one line: return null instead of throwing when credentials are missing. The four callers already handled null correctly. The throwing behaviour had been the bug all along. It had been present since the Redis client was written, invisible because tests had not run against an unconfigured environment, and it surfaced only when CI executed for the first time.
One line. Eighteen tests unblocked.
What this pattern looks like in enterprise programmes
The E2E test story is a contained, clean version of something that happens at much larger scale in enterprise AI programmes.
The enterprise equivalent is the UAT sign-off based on demonstrations rather than actual user testing. A programme runs a series of structured demonstrations. The system performs the scenarios correctly. UAT is signed off. Go-live is approved. The first week of live operation reveals that the scenarios the demonstrations covered were not the scenarios actual users encounter. The paths the users take through the system were not the paths that were demonstrated. The system works for the demonstration paths. It fails for the user paths.
The mechanism is identical to the test story: acceptance was granted based on evidence of the form "the system passed the test" without verifying that the test was testing the right thing. The test was right in the trivial sense. It was wrong in the operational sense.
The second enterprise equivalent is the integration test that passes in an isolated environment and fails in production because the production environment has different configuration, different data shapes, or different dependency versions than the test environment. The Redis story is exactly this: a client that behaves correctly when configured and incorrectly when not, where "configured" is the test environment state and "not configured" is the production state that matters.
The third equivalent is the security scan that passes on the code repository and misses the vulnerabilities in the deployed configuration. The code is clean. The deployment is not. The scan tested the wrong thing.
In all three cases, the mechanism is the same: evidence of the form "the check passed" is taken as evidence of the form "the thing works," and the question of whether the check was checking the right thing is not asked.
The Inngest dispatch that wasn't
The test story has a companion in Pmly's build history that teaches the same lesson from a different angle.
The intelligence regeneration route had existed since T-085b. It received a request to regenerate the engagement intelligence package, returned { queued: true, reason }, and fired a PostHog analytics event for intelligence_regenerated. The analytics event was real. The response was real. The actual Inngest dispatch, the thing that would have triggered the Engagement Intelligence Agent to run, was not there. The route had a comment: "Inngest job (engagement_intelligence) handles the actual re-run." No dispatch call.
This was not caught by any test. It was found during an E2E route audit, which was inspecting the route body as part of a separate investigation into a wrong path, and noticed: the comment says Inngest handles this. There is no Inngest call.
The route returned 200. The PostHog event fired. The system appeared to work. The intelligence package did not regenerate. Nobody who called the route knew this.
The fix required three separate pull requests. First: the spec had to be updated to document the manual regeneration input contract (it listed four trigger paths, had input contracts for three of them, and assumed the fourth without documenting it). Second: the agent code needed a new input variant to handle a manual trigger without the change_request_summary field the other variants required. Third: the route needed an actual Inngest dispatch call and a real 502 on failure.
Two existing tests broke when the route started actually dispatching. Their mocks had been written for the no-op behavior: they expected { queued: true } and never expected an Inngest call. The mocks were updated. The behaviour was now correct and tested.
What to check before declaring tests complete
Not "do the tests pass." Have they executed against a real application with real data? A test that skips is not a test. A test that mocks everything is testing the mocks. The question is whether the integration between your code and the real dependencies has been exercised.
Acceptance criteria describe what a system does. They do not specify URL shapes. Before asserting a path in a test, run grep -r "route" src/app/api/ and confirm the path exists. PRD text is not a reliable source for route paths.
Production environments and test environments are configured differently. Test against the unconfigured state. A client that throws when a credential is missing is a client that will cause unexpected 500s in any environment where that credential is absent. Fail-open with null handling is usually the right default for optional infrastructure dependencies.
Comments describe intent. Code describes behaviour. When a comment says "Inngest job handles the actual re-run," verify that there is an Inngest dispatch call in the route. Do not assume the comment is current. Read the code.
A PostHog event that fires on a no-op is worse than no event: it creates confidence in an operation that is not happening. If you are tracking an event called intelligence_regenerated, confirm that intelligence is being regenerated when that event fires. Instrument the outcome, not the intent.
The session log entry for T-108 is honest about what happened. "The acceptance criterion is satisfied in the trivial sense." That honesty is more useful than the false confidence that would have come from deleting that sentence. The tests passed. They had not run. The entry said so.
The rule that was added to CLAUDE.md after the route audit is blunt: when authoring tests that assert against API paths, verify each path against the actual route file before writing the assertion. Acceptance criteria describe behaviour, not URL shapes. If the route does not exist at the asserted path, surface the gap. Do not invent the path.
That rule now applies at the start of every test session. It exists because six test files got the paths wrong in the same way, in the same session, for the same reason. The rule is one sentence. The path to needing it was fourteen days and a CI run that found five wrong routes, six infrastructure failures, and eighteen tests blocked by one line of Redis configuration.
The tests that passed without running eventually ran. When they did, they found real failures. That is the function of tests.