Can Claude Really Build an App with No Engineer Required?

Since the rise of LLMs, "vibe coded" applications have gone from a novelty to a genuine talking point in developer circles. The promise is seductive: describe what you want, and an AI builds it for you. No engineering degree required.

Claude — Anthropic's flagship LLM — has been one of the loudest names in this space, frequently praised for its coding ability above its competitors. But does it actually deliver? And more importantly, can it do so safely?

I've been sceptical of AI-assisted development for a while. Here's how I went from sceptic to cautious convert — and what I learned building a real application along the way.

Why I Resisted AI Coding Tools

My adoption of AI tooling has been slow and deliberate. For a long time, my only concession was GitHub Copilot for code completion — and even then, I thought of it less as "AI writing my code" and more as an extremely over-engineered autocomplete. Sometimes it would suggest an entire class, but its real value for me was eliminating boilerplate. I was still driving.

The idea of letting an AI write substantive code felt risky. What if it introduced a security flaw I hadn't noticed? What if it abstracted things in ways that were idiomatic to a model but counterintuitive to a human reading the code later? I'd still need to audit every line, at which point I may as well have written it myself — at least then I'd have confidence in what was there.

There was another pattern I kept noticing too. Every time someone showed me an AI-built project, it was always Node.js and React — with Tailwind, naturally. Regardless of the requirements, if it was web-based, those were the choices. That struck me as a bias baked into the tools rather than a considered technical decision. Why always that stack? Why not Laravel, which has a rich ecosystem and scales remarkably well for developers without a strong DevOps background?

Using AI as a Rubber Duck

My resistance softened when I started using ChatGPT not as a code generator, but as a sounding board. I'd describe a problem, ask it to poke holes in my approach, or flip the dynamic entirely and ask it for ideas I could challenge. This helped me catch footguns early — before a line of code was written.

But even when ChatGPT suggested a schema or an architectural pattern I liked, I still wrote the implementation myself. The reasoning was simple: if I've written it, I own it. I've had to think through every decision, which means I can understand it, reason about it, and spot when something is wrong.

Testing the Waters with Codex

My workload has grown considerably as I've taken on more Voice AI deployments. A common need that comes with this work is building supporting tools — typically lightweight web applications that expose APIs an LLM can call to retrieve or push data. Think middleware for CRM integrations, bespoke data surfaces, or apps that control the flow of AI interactions.

These tools take time to build, and in some cases they're demoware — built purely to illustrate what a bot can do when talking to another system. We use N8N for common, repeatable workflows, but when something genuinely bespoke is needed, I typically reach for a small Laravel application.

As the volume of these mini-apps crept up, I decided to try Codex for something low-stakes: small, well-scoped changes to existing apps — things that weren't hard, just time-consuming. I was using PHPStorm at the time, which has a built-in AI chat facility that lets you pick your LLM, give instructions, and review diffs before accepting changes. An OAuth connection to OpenAI and I was ready.

The results were genuinely impressive. The changes were accurate, fast, and the model even added unit tests to cover the new behaviour. It saved me real time on tasks I'd otherwise have ground through manually.

My First "Vibe Coded" Application

Encouraged, I decided to use AI to build something I'd been wanting to build for months but never found time for: a lightweight project management tool. I'd already started it — enough to get excited about using Nuxt for the first time — but had stalled.

I scaffolded a fresh Laravel application and started prompting. I came in with a detailed brief:

The intended tech stack (Vue frontend, Laravel backend)
Branding guidelines and a reference website for styling
A pre-built database schema with seed data to illustrate expected data shapes
A feature-by-feature breakdown of the application's purpose
An end-to-end description of how users would interact with it
RBAC requirements and how permissions should surface in the UI
The multi-tenancy approach, including global scopes to prevent data leaking between tenants
How policies should be structured and how new roles should be composed from the permission set

With that context established, I asked the AI to start with the front end — just scaffold the look and feel with placeholder content. No real data, not yet.

The result arrived quickly, and it looked good. The styling followed the branding I'd described, and placeholders were in place for every feature I'd specified. For boilerplate, it was hard to fault.

From there, connecting the front end to the backend was a matter of further prompting. Tables, charts, and forms all populated with real data from the API — and the forms even worked. The visible surface was promising.

The backend was a different story. The code was over-abstracted in ways that felt more generated than considered — controllers riddled with private methods used exactly once, logic spread across layers with no clear rationale. Everything worked, but the code would have been painful to maintain without continued AI assistance. It was brittle by design.

More prompting fixed most of it. I pushed back on the patterns, described how I'd normally structure things, and the AI adjusted. The final output was much closer to what I'd have written myself — but it had taken significant effort to get there.

Was it worth it? The application was ultimately a proof of concept, and it proved the concept. But the time spent reviewing, correcting, and re-prompting made the productivity gains less obvious than I'd hoped.

Enter Claude: The Alleged Superior Coding Tool

Several colleagues and friends had independently landed on Claude as their preferred LLM for development work, each describing it as a step above ChatGPT for coding tasks. That was enough to prompt a second experiment — this time with a project I actually intended to use.

The project was SimpleSockets: a tool to make it easy to spin up a Reverb WebSocket server for smaller projects without the usual configuration overhead. It was something I genuinely needed, something with real production requirements, and — crucially — something I could evaluate objectively because I knew exactly what it should do.

The approach this time was different. Rather than starting with a scaffolded application and a comprehensive brief, I began with an empty folder and a much shorter prompt. I specified the stack I was comfortable with and the features I needed. Beyond that, I stepped back and let Claude make its own decisions.

This smaller prompt was deliberate. I wanted to emulate the kind of prompt a non-engineer might write — someone with a clear idea but no strong opinions on implementation. My real question was about safety and guardrails: without being explicitly told to, would the AI consider cross-tenant data isolation? Would it handle subscription billing edge cases — like blocking usage when a plan limit is hit, or preventing a downgrade when a user is still over their new tier's limits? I wanted to see how far it would go unprompted.

Unlike Codex, which integrates into PHPStorm's AI chat panel, Claude Code is command-line driven. Fortunately, PHPStorm's inline terminal made the experience comparable — a conversational interface with output from the model, and the ability to interrupt and redirect mid-task. You can watch what it's doing and nudge it if you don't like where it's heading. I was keen to hold back on that as long as possible, though.

Much like my Codex experience, the UI Claude generated was immediately impressive. Functionality worked out of the gate, dependencies were pulled in without prompting, and very quickly I had the makings of something that looked production-ready. The velocity was striking.

But I was about to learn that looks can be deceiving.

Looks Can Be Deceiving

To get an objective read on the code, I asked ChatGPT — which had access to the GitHub repository — to audit the project and open ten issues covering areas that needed work.

The findings were illuminating. The early issues were relatively minor: README gaps, missing documentation. But they quickly escalated into substantive problems. WebSocket connections lacked proper security controls. Subscription billing limits had been implemented in the UI — the controls were visible — but the backend never actually enforced them. Perhaps most tellingly, there were no automated tests anywhere in the codebase. I hadn't asked for them, which is fair, but a non-engineer almost certainly wouldn't ask either. The illusion of completeness had masked a significant amount of missing work.

It became clear that to take this project further, I'd need to provide a lot more direction — and I'd need to go looking for the gaps myself, because the AI wasn't going to surface them unprompted.

Providing the Direction AI Can't

I started methodically testing edge cases — the kinds of scenarios I'd expect to break things — and many of them did, either because the code was buggy and untested, or because the functionality simply didn't exist yet. The pattern was consistent: UI came first, correctness came second.

The issues ranged from minor to genuinely problematic. Downgrading a subscription didn't check whether the user was still within the lower tier's limits — so a user could consume a higher tier's quota and then downgrade for a refund. Cancellation blocked access immediately rather than at the end of the billing period. Each of these required me to describe the exact scenario, explain the expected behaviour, and prompt Claude to fix it.

As we worked through the backlog, some bugs kept resurfacing in different forms. Others refused to be resolved at all, despite multiple rounds of prompting. The larger the project grew, the more the cracks showed — and the harder it became to close them.

Project Size Is Everything

I've seen plenty of adverts from people claiming to have built their million-dollar app idea with AI, no engineering background required. I have no doubt some of those stories are genuine — but I suspect the apps in question are very small, very tightly scoped, or both.

My honest assessment: unless an application is genuinely simple, a non-engineer cannot safely build a production-ready product with these tools alone. The bigger the scope, the larger the surface area for failure, and the less likely the AI is to cover all of it unprompted. My own app had authentication built in from the start — it was in the original brief — yet at no point did Claude add a password reset flow, or the ability to update a name or email address. Two-factor authentication defeated it entirely; it never got that working reliably. In several cases, I had to diagnose the root cause of a bug myself and tell the AI what was wrong before it could fix it.

Without technical oversight, projects like this will hit a ceiling — and bringing in an engineer to untangle AI-generated code after the fact can be difficult and expensive.

Using One AI to Audit Another

There's a more interesting question lurking here, though. I've already described using ChatGPT to audit Claude's output — and it worked. Claude was able to read the issues raised and resolve them. So rather than relying on a human engineer for oversight, what if you used a dedicated auditing AI instead?

In an ideal setup, you might deploy several specialised agents, each with a narrow focus:

Security auditor — scanning for vulnerabilities, authentication gaps, and data exposure risks
UI auditor — checking for accessibility issues, visual inconsistencies, and broken layouts
UX/functionality auditor — walking through user flows and identifying missing or broken behaviour
QA auditor — running edge cases and regression checks after each round of changes

Each agent would keep the project accountable in its domain, continuously surfacing issues and driving the work forward. This isn't entirely theoretical — my friend Jason over at bayton.org has done exactly this with his flash MDM proof of concept, using multiple AI models to build, review, and improve a mobile device management system. It's a compelling glimpse at what collaborative AI development might look like.

Conclusion

AI coding tools have come a long way, and I'll keep using them. For smaller proof-of-concepts, laborious one-off tasks, and scaffolding UIs quickly, they're genuinely valuable. Claude in particular produces polished front ends at impressive speed.

But will I trust the backend code they generate for production use? Not without thorough review. The gap between "it looks like it works" and "it is correct, secure, and maintainable" remains wide — and closing that gap still requires an engineer who knows where to look.

The companies that replace their engineering teams with AI tools entirely are, I suspect, building up a debt they don't yet know they owe. It will be paid eventually — in bugs, in security incidents, or in the cost of unpicking code that nobody fully understands. The technology is impressive. It's just not there yet.