Al Karakas
← All essays

Building · AI

25 Things I Learned Building Production Apps as a Vibe Coder

19 min read

A retry loop that drained a day's AI budget in minutes. A missing connection pooler that caused a production outage at 11pm. 25 specific things I learned building two production systems with AI that nobody told me before I started.

80%
of AI projects fail to reach production or deliver business value. The failure modes in vibe-coded apps are the same as in enterprise AI: not the model, the delivery structure around it (RAND Corporation)
$4.44M
average cost of a data breach globally. Vibe-coded apps are not exempt. The AI writes the code you asked for. It does not write the security you forgot to ask for (IBM Security 2025)
25
things I learned from two production systems, eight AI agents, a Chrome extension, and more production incidents than I had planned for. The expensive lessons, so you do not have to pay for them

I have shipped two production systems using Claude and Cursor. Aplio is my personal job search intelligence platform: eight AI agents, a Chrome extension, a Supabase backend, and a differential CV renderer. pmly is a PM contractor operating system: engagement tracking, earned value calculations, AI-assisted risk logs, and a report generation pipeline. Both are in production. Both have real data in them. Both went through multiple cycles of "I thought this worked" followed by "ah, so that's why it broke."

This is what I learned. Not a framework. Just the things that bit me, the things I had to figure out the hard way, and the things I wish someone had told me before I started.

Architecture and backend

The AI gives you working code. What it does not give you, unless you ask very specifically, is architecture that holds up when things go wrong.

Never let the API do the heavy lifting synchronously. This was lesson one from building Aplio's research agent. I had the endpoint do everything inline: receive the request, run the research, format the output, return the result. Works fine in development. In production, with real network latency and Anthropic API variability, the request times out. The user sees an error. They click again.

The right pattern: when a user triggers something expensive, create a job record with a unique ID, return immediately with that job ID, and fire the actual work asynchronously. Show the user a status screen they can poll. Notify them when it is done. This is basic product hygiene for anything that takes more than a second.

If a user clicks twice, make sure the system treats it as one action. The first time I saw Aplio create duplicate research records I spent an hour debugging before I realised the issue was double-submission on slow requests. The answer is idempotency keys: a unique identifier for each request that your job creation logic checks before creating a new record. If a request arrives with the same idempotency key as an existing job, return the existing job. No duplicate. No error. Idempotency on job creation costs you one index lookup. Debugging duplicate state costs you an afternoon.

Links inside your app should open inside your app. This sounds obvious until you ship it and discover that half your users are being sent to a browser tab they did not ask for, losing context, and not coming back. Handle your deep links in the frontend. Define what each link type opens. Add a fallback for when the deep link cannot be resolved: show something useful, not a blank screen or a raw error.

Feature flags on everything. Every feature should be killable in seconds without a deployment. I added a flag system in Aplio after shipping something with a bug I could not reproduce locally. Without a flag, your options are: redeploy, revert, or leave it broken. With a flag, you flip it off in thirty seconds while you investigate. A database table with feature name and enabled boolean, checked at the handler entry point, is sufficient to start.

Gradual deployment when you have real users. When I had ten users on pmly and pushed a breaking change to the report pipeline, all ten hit the bug. With a 10% rollout, one would have. Push to a small percentage, set an error rate threshold, and automate promotion to 100% if the rate stays below threshold for a defined window. Vercel makes this easy on the frontend. The principle scales to the backend.

Shipping is not a single moment. It is a process with checkpoints. Define what "healthy enough to promote" looks like in numbers, and let the system make that call.

Put an API gateway in front of all your AI endpoints. Request validation, per-user spend tracking, rate limiting, and authentication belong at the gateway layer, not scattered across individual handlers. In Aplio, every Anthropic API call goes through a cost gate: a synchronous check against a current-week-cost view that blocks the call if spend exceeds threshold. That logic lives in one place. When I needed to change the threshold, I changed one value.

API keys and secrets go in a dedicated secrets manager. Not in your code. Not in a committed .env file. Not hardcoded "for now." Set up dual key rotation: always have an active and a standby, and automate the rotation so a compromised key is replaced without a manual deployment.

Database and connections

Everything works in development. Development has one connection, low latency, and no concurrent users. Production is different.

Enable your connection pooler. On Supabase this is Supavisor. In a serverless environment, every function invocation opens a new database connection. Under any real traffic you will hit your connection limit and everything fails at once. Supavisor pools connections so serverless functions share a manageable set. It is a configuration change, not a code change. Do it before you have traffic. Use transaction mode, not session mode: in transaction mode connections return to the pool after each transaction, so you need far fewer than you think.

Know your connection limit and set a ceiling below it. Find out the maximum concurrent connections for your database tier. Set your pool max comfortably below it. In session mode, each function invocation holds a connection for its entire lifetime. That is the wrong choice for serverless and will exhaust your pool at any real load.

Performance and caching

Browser caching, CDN caching, application caching, and prompt caching are four different things. Browser caching reduces repeated asset downloads. CDN caching reduces origin server load. Application caching reduces repeated database queries for data that changes rarely. Prompt caching reduces Anthropic API costs by reusing the computation for stable parts of your system prompts. All four are worth implementing. The one vibe coders most often skip is prompt caching. Cache hit rate above 85% on most Aplio agents. The cost difference is real.

Prompt caching mistake to avoid

Rotating content such as timestamps or request IDs buried in the system prompt breaks caching on every call. Stable prefix first, as long as possible. Variable context at the end, in the user message. Measure your hit rate.

Route non-urgent AI workloads to batch endpoints. Anthropic's batch API runs at significantly lower cost than the synchronous API. The Insight Agent in Aplio runs weekly pattern analysis. That is a batch job. Ask yourself, for each AI call: does this need to be real-time? If not, batch it.

Rate limiting on all endpoints, especially AI-backed ones. Without it, one retry loop can consume your daily AI budget in minutes. A client-side retry bug in Aplio triggered a cascade of research agent calls before I caught it. AI-backed endpoints now have tighter per-user-per-minute limits than standard ones.

Security

Security in a vibe-coded app is exactly as strong as the questions you thought to ask. The AI does not volunteer security concerns. It completes the request you made.

Audit always-on resources regularly. I ran a resource audit on Aplio six months in and found three Apify scrapers I had configured during testing and never disabled. Still running. Still logging. Costing money every week and representing a surface area I had forgotten entirely. Set a calendar reminder to audit running infrastructure monthly.

Set spend alerts before you have traffic. Know your expected daily AI cost at normal usage. Set an alert at 150% of that. Set a hard stop at 300%. One misconfigured retry loop or unprotected endpoint can drain a week's budget in an hour. The alerts cost nothing to set.

Run OWASP ZAP and automate it in CI. ZAP is free, open source, and will find things you did not know were there. I ran it against Aplio six months in: fourteen medium findings, three highs. Headers I had not set, endpoints I had not protected, a rate limiting gap I had missed. GitHub has built-in security scanning you can enable in repository settings.

ZAP quick start

Download from zaproxy.org. Run your app locally. Open ZAP, select Automated Scan, enter your local URL, click Attack. When it finishes, open the Alerts tab. Fix the Highs before your next deployment. Address the Mediums in your next sprint.

Test your API in Burp Suite after ZAP. ZAP is for automated breadth. Burp Suite is for manual probing. What ZAP misses, Burp usually finds. Community edition is free.

PR reviewers should check architecture, not just logic. Does this endpoint bypass the auth middleware? Is this reading from the right table? Does this response expose fields it should not? Logic bugs break features. Architecture bugs break security.

AI-specific

The AI writes the code you asked for. It does not write the security you forgot to ask for, the architecture you assumed was standard, or the failure modes you did not think to mention.

Prompt caching is real money. Structure your system prompts so the stable prefix comes first and is as long as possible. Variable context goes at the end, in the user message. Common mistake: rotating content such as timestamps buried in the system prompt, which breaks caching on every call. Measure your hit rate.

Never use AI for deterministic tasks. Validation, calculation, formatting, structural checks: these are code, not LLM calls. In Aplio I have six hard-fail validators on the professional summary output. They run as application code. Fast, free, reproducible, testable. Deterministic tasks belong in deterministic code.

UX and product

Show the user something immediately. If their request is processing asynchronously, tell them. Show a job ID. Let them see status. The worst UX in an AI-powered app is: user clicks, nothing visible happens, they sit wondering whether to click again. That uncertainty drives double-submissions, support messages, and churn.

Use a proper component library for UI quality. The gap between "a developer built this" and "this looks like a product" is usually a design system and some motion. Radix/shadcn components are accessible, composable, and Tailwind-compatible. Framer adds motion. Visual quality signals trustworthiness.

One more thing

There is a version of vibe coding where you ship fast, things mostly work, and you deal with each problem when it arrives. That version is fine for prototypes and experiments. It is not fine for production systems with real data and real users.

The twenty-five things above are not rules. They are the residue of building in the real world. The asynchronous job pattern came from a timeout I could not debug. The idempotency keys came from duplicate records I spent an afternoon cleaning up. The spend alerts came from a retry loop I caught after it had already cost me real money. The connection pooling came from a production outage at 11pm on a Thursday.

None of these lessons required a software engineering degree to implement. They required shipping something, watching it break in a specific way, and understanding what the fix was. That is the vibe coder's learning path. The only difference is whether you learn from your own mistakes or someone else's.

Prefer someone else's.