Who's Watching Your Code?

There is a version of this story I have now encountered more than once. An organisation — often a small one, often without a technical person in the ranks — commissions software from an external agency. The software gets built. It works, in the sense that it loads and does something. The invoices get paid. The software goes live, customers start using it, and with their feedback come feature requests. The requests get scoped, quoted, delivered — and gradually, the invoices get bigger. Requests that sound simple — a new field here, a small workflow change there — come back with price tags that are hard to square with the apparent complexity of the work. The explanations are plausible enough, and because nobody on the client side has the technical fluency to push back meaningfully, they get accepted. The relationship continues. The costs keep climbing.

The cracks rarely appear from the inside. They appear when something forces an outside perspective — often the possibility of changing supplier, because the current one can no longer deliver simple asks, or because other agencies have looked at the codebase and refused to touch it, offering only a full rebuild at a price to match.

This is a story about what tends to be found when that outside perspective finally arrives.

What Gets Found

What a proper technical review tends to reveal is not incompetence in the dramatic, immediately-obvious sense. The application works. Users can log in. Data appears on screen. To anyone without a technical background, it looks like functional software.

Underneath the surface, things are usually different.

API endpoints with no authentication at all — not weakly authenticated, unauthenticated entirely. A moderately curious person with a REST client and ten minutes of curiosity can pull personal data directly from the server. The exposure is not theoretical. It is trivially exploitable.

Secrets — database credentials, API keys, tokens granting access to third-party services — committed directly to version control, sitting in the git history permanently, retrievable by anyone who has ever been granted repository access. Rotating them is the first priority when this is found.

Infrastructure provisioned at a scale that bears no relationship to actual traffic. Services running that serve no current purpose. Instances sized for workloads that don't exist. Static assets — images, CSS, JavaScript — being served from EC2 instances with a load balancer in front of them, the CPU utilisation history barely ever crossing 5%. These assets belong on a CDN, where they would cost a fraction of the price and load faster for users into the bargain.

The infrastructure conversation is worth dwelling on, because it comes in two flavours and they are easy to conflate. Sometimes an application genuinely needs more capacity. More often, the capacity is fine — the queries are the problem. Unindexed columns, N+1 patterns that fire a database call for every row in a result set, repeated work that could be cached, operations that block a web request when a queue worker would handle them asynchronously. Recommending a larger instance or a move to managed RDS is a legitimate answer to a legitimate scaling problem. It is a very expensive non-answer when the real problem is queries that could be made orders of magnitude more efficient without touching the infrastructure at all.

The work of addressing all of this, once found, is rarely quick. Architectural debt compounds. What took years to accumulate does not unwind in a sprint.

The Common Thread

When stepping back from the specifics, what is striking is how structurally predictable all of this is.

The agency is incentivised to ship features. The client is incentivised to pay for features shipped. There is no one in the middle asking whether the features being shipped are built well, whether the data is secure, whether the infrastructure makes sense, whether the tests exist. The structural conditions for a bad outcome are present from day one.

But I find myself wanting to push past the structural explanation, because it risks being too generous. When I look at unauthenticated endpoints sitting in front of personal data, credentials committed to a repository, or static assets being served from EC2 instances that exist for no defensible reason — I am genuinely unsure whether what I am looking at is incompetence, laziness, or something more deliberate. All three lead to the same place. The distinction matters if you are trying to decide whether to feel sympathy or not.

What I can say is this: if I shipped something in that state to a client, I would not be able to sleep. Not because of the professional embarrassment, but because real people's data was at risk and I would have known it. That is a floor of professional responsibility I take for granted. It is apparently not universal.

The accountability problem is sharpest with offshore development houses, where cost is the primary selling point and the legal reality is rarely discussed until it matters. When something goes wrong — when the data breach happens, when the application turns out to be a liability, when the contract has not been honoured — the question becomes: what recourse do you actually have? How do you take a company in India or the United States to the small claims court when you are based in the UK? The honest answer is that you largely cannot, at least not without legal costs that dwarf whatever you saved by going offshore in the first place. The geographic distance that made the arrangement attractive on a spreadsheet is the same distance that makes accountability practically unenforceable when things go wrong. Offshore development can be excellent and cost-effective — that is not the point. The point is that without careful contractual arrangements and legal advice, you may be trading away your legal protection at the same time as you are trading away the budget for a proper technical review.

The defence that tends to get offered, when these conversations happen, is that the customer did UAT and accepted the product. There is something almost darkly funny about this argument when you unpack it. User acceptance testing is, by definition, testing performed by the people who commissioned the software — the same people who lacked the technical resource to build it themselves. They can test that the application does what the spec describes. They can click through the flows and confirm the buttons do the right things. What they cannot do is audit the authentication model, review the secrets management, evaluate the infrastructure decisions, or assess whether the code is structured in a way that will cost them dearly in three years. Pointing to UAT sign-off as evidence that the product was fit for production is pointing to the one form of review that was structurally guaranteed to miss the most important things.

The security review, the code review, the infrastructure audit — these should have happened long before UAT. They are not the customer's job. They are the developer's.

There is an old phrase for this: buy cheap, buy twice. In software, I would revise it. Buy cheap, buy twice — and spend the intervening period hoping you do not collect a GDPR fine, a liability claim, or a data breach that ends up in the news.

This is not an unusual story. It is an extremely common one. The organisations most at risk are the ones that need software but lack the technical fluency to evaluate it themselves — and so they assume that delegation is the same thing as oversight. It is not. Delegation gets the work done. Oversight is what ensures the work was worth doing.

The Value of a Second Set of Eyes

In any engineering team worth the name, code does not ship without being reviewed by someone who did not write it. This is not bureaucracy — it is the mechanism by which problems get caught before they compound.

A good code review does several things. It checks that the implementation actually does what the issue or ticket described — that the developer's interpretation of the requirement matches the intended behaviour. It checks that the approach follows established patterns: if the rest of the codebase uses a repository layer, a feature that bypasses it is introducing friction for every future developer who reads the code. It looks for security vulnerabilities — not because the original developer is careless, but because the person who wrote the code is also the person least likely to see its blind spots. And it holds the work to a standard of craft: is this clean? Is it maintainable? Does it follow the SOLID principles where they apply?

The SOLID principles are worth naming because they are often dismissed as abstract theory, but they have very practical consequences. Code that cannot be extended without being rewritten becomes a problem the moment requirements change — and requirements always change. Components that are tightly coupled to specific implementations resist testing, refactoring, and handover. These are not aesthetic preferences. They are the difference between a codebase that can be built upon and one that has to be endured.

The problem with taking the shortest path to a working solution is that working and right are not the same thing. An endpoint that returns the correct response in the happy path is not the same as an endpoint that handles edge cases, validates input properly, and does not expose data it shouldn't. A feature that passes a quick manual test is not the same as a feature with automated test coverage that will fail loudly when a future change breaks it. The feature ships, the invoice goes out, and the shortcut becomes a permanent fixture — one that every subsequent piece of work has to navigate around.

Technical debt accumulates exactly this way. Each shortcut is a small addition to a balance that accrues interest. A single poorly structured feature is a minor inconvenience. A codebase built feature-by-feature on a foundation of shortest-path decisions becomes genuinely difficult to change. And the harder it is to change, the more expensive every future change becomes. The invoices that seemed unreasonably high for simple requests often are not the agency overcharging — they are the true cost of touching code that should not have been written the way it was.

When there is no review process of any kind, the shortcuts accumulate unchecked until someone from outside finally looks. That does not mean every project needs a team. A lone developer can still apply genuine rigour — reasoning carefully through their own code, using AI tooling to pressure-test decisions, treating self-review as a discipline rather than a formality. But "ship it and see" is never an acceptable substitute, and the absence of a second human reviewer makes automated testing non-negotiable rather than merely advisable. Unit tests and CI pipelines are not bureaucracy — they are the circuit breaker. They catch regressions before users do, they force the developer to reason about expected behaviour before writing the implementation, and they provide a baseline of confidence that no amount of informal manual testing can replicate. A codebase with meaningful test coverage and a passing pipeline is making a verifiable claim about its own correctness. A codebase without either is asking everyone downstream to take its word for it.

Enter the Vibe Coder

PHP has a reputation. I was told once, by someone who clearly had strong feelings on the subject, that it was the worst language in the world — not because it was incapable, but because you could write the absolute worst PHP imaginable and it would still run. The early versions had almost no type safety, no enforcement of structure, no guardrails of any kind. A developer who did not know what they were doing could produce something genuinely terrible, and the user would be none the wiser, because the page still loaded. The language has improved enormously — modern PHP with strict typing, named arguments, union types, and the tooling the Laravel ecosystem provides is a very different proposition — but the point still holds. Low barrier to entry means the floor is also low. The application running is not evidence that it was built correctly.

I have written previously about vibe coded applications and the gap between what they appear to be and what they actually are. The dynamic is the same problem, and in some respects more acute.

When a non-engineer uses an AI tool to build an application, the structure is identical to an agency with no oversight and no peer review. The AI produces something that looks correct: code that compiles, endpoints that respond, screens that render. It does not, unprompted, stop to ask whether an endpoint should require authentication. It does not flag that a credential has been hardcoded into a config file. It does not question whether an S3 bucket is publicly accessible, or whether a database query could be manipulated by a malicious input. It builds what it is asked to build, to the standard it is capable of, without any awareness of what it doesn't know.

But there is a subtler failure mode than missing security issues, and it is the one I think gets talked about least. An AI will not challenge the spec itself.

I had a conversation recently with a customer building an access control system. Their requirement was straightforward: a partner account type that could see accounts belonging to customers in the UK, while super admins could see everything. It was clearly stated and internally consistent. An LLM would have built it. A vibe coder would likely have built it too. I did not build it, because I asked why.

What they actually needed was for a partner to see the accounts scoped to them — regardless of region. The UK constraint was not a business rule; it was an assumption baked in because their first partner happened to be based in the UK. Scoping access by region would have produced a system that could support exactly one kind of regional partner, with UK logic hardcoded throughout. The moment a second partner appeared — in Germany, in Australia, anywhere — it would not have been an expansion. It would have been a rewrite, accompanied by a data migration that carried real risk.

Scoping by partner instead of region gave them everything they originally asked for, still restricted partners to only the accounts relevant to them, and meant the system could support any number of partners in any number of regions without touching the access control logic. The requirement was met. The system was not brittle.

That solution came from someone willing to say: no, that is not quite right, and here is why. It came from experience with what these requirements tend to look like six months later, when the first assumption turns out to have been optimistic. A good solution architect's job is not to execute instructions — it is to understand the problem well enough to know when the instructions would solve the wrong version of it. That is not a capability that scales with token count.

In my own experiments building with Claude, this was exactly what I found. The front end was impressive. The application loaded, data appeared, forms submitted. But billing limits existed only in the UI — the backend never enforced them. Subscription downgrade logic had edge cases that would let users exploit it for a refund they weren't entitled to. There were no automated tests anywhere. The illusion of a finished product concealed a significant amount of missing work, and the AI had no mechanism to surface it.

In all of these cases — the outsourced agency, the vibe-coded project, the solo developer working without review — the failure mode is the same: the builder and the auditor are the same entity, and the auditor does not exist.

What concerns me is where this leads. The software houses of the near future may not be staffed by incompetent developers — they may be staffed by optimistic vibe coders, which is a different problem and in some ways a worse one. An incompetent developer at least has some intuition that they might be missing something. An optimistic vibe coder has a working application and the confidence to match. The outcome for the client — the insecure endpoints, the brittle architecture, the data exposure waiting to happen — is identical. And in applications handling sensitive data, identical can mean dangerous. Worse still: when the vibe coder hits a problem they cannot solve and asks the LLM to fix it, the LLM may simply not be able to identify what is wrong. At that point, there is no fallback. Nobody in the chain has the expertise to go further.

I want to be fair here: I know people who are not engineers, are building with AI tools, and producing things that are genuinely decent. But I do not think it is prior engineering experience that makes the difference. It is thoroughness. The ones doing it well know exactly what they want the software to do, they test the rough edges deliberately, they push back on outputs that do not feel right, and they prompt hard until the cracks are closed. They treat the AI as something to be interrogated rather than trusted. That discipline is not common, and it is not something a software house billing by the feature has any incentive to apply.

Why Personal Data Changes Everything

Most software that goes wrong costs money to fix. Security incidents involving personal data cost something else entirely.

GDPR exists for a reason. When a data breach occurs — when personal information is exposed through an unauthenticated endpoint, a leaked credential, or a misconfigured storage bucket — the consequences are not limited to a difficult conversation. There are regulatory obligations to report, fines that scale with the severity of the breach and the size of the organisation, and reputational damage that is nearly impossible to quantify. For a small organisation, a serious data incident can be existential.

Looking secure and being secure are not the same thing, and the difference only becomes apparent when someone is actively looking for it. Login screens and SSL certificates are table stakes — the minimum a user can observe on the surface. They say nothing about what is happening underneath.

That person actively looking is the engineer the project never had.

What Good Oversight Actually Looks Like

I am not arguing that every small organisation needs a full-time engineering team. The economics simply do not work for most of them, and a capable agency can deliver good software. But there is a meaningful difference between "we hired an agency to build this" and "we have no visibility into what they built or how."

At minimum, an independent technical review before any application handling personal data goes into production is not a luxury — it is, I would argue, a duty of care. That review does not need to be exhaustive. An experienced engineer spending a day looking at authentication patterns, data exposure, secrets management, and infrastructure configuration will find the issues that matter. The cost of that day is trivially small compared to the cost of a breach.

For AI-assisted or vibe-coded projects, the bar should be the same, and ideally higher. The velocity that makes these tools attractive — the ability to go from idea to running application in an afternoon — is precisely the thing that makes independent review important. Speed is great. Speed without scrutiny is how production credentials end up committed to a public repository.

Some things worth examining, regardless of how the software was built:

Authentication and authorisation. Can unauthenticated requests reach data they shouldn't? Does the application check not just that a user is logged in, but that they have permission to access the specific resource they are requesting?

Data exposure. What does the API actually return, and is all of it necessary? APIs that return entire database rows when a single field was needed are a routine source of accidental data leakage.

Secrets management. Are credentials stored in environment variables, or somewhere they could end up in version control? Has the repository history been checked — not just the current state of the files?

Infrastructure. Is the provisioned capacity proportionate to actual usage? Are storage buckets and databases accessible from the internet when they shouldn't be? Are there obvious optimisation opportunities — caching, queuing, query tuning — that have been deferred in favour of scaling up?

Code review process. Does one exist? Is there evidence of it — pull request history, review comments, a documented process? Code that has never been read by a second set of eyes is code that has never been audited.

Tests. Do they exist? If not, how does anyone know the application does what it claims — and how will anyone know when a future change breaks something?

Own the Code, the Pipeline, and the Data

There is a related problem that security reviews alone cannot fix: the organisation does not actually own what it paid to have built.

In a healthy arrangement, the client owns the repository. The code lives in a GitHub or GitLab organisation under the client's account — not the agency's — and the agency are contributors to that repository, not custodians of the only copy. This matters more than it might seem. If a relationship with an agency turns sour, if they go out of business, if you simply want a second opinion on the work, you need to be able to hand a new developer access to the codebase without asking permission from the people who built it. That should be a given. It frequently isn't.

The same principle applies to deployment. A deployment process that lives in one developer's head, or in a proprietary dashboard only the agency can access, is not a deployment process — it is a hostage situation. Any serious project should have a documented, version-controlled deployment procedure: a CI/CD pipeline, a set of runbooks, an infrastructure-as-code configuration that describes what exists and how it is provisioned. The goal is not just reproducibility — it is vendor-neutrality. A new contributor, from a different agency or hired directly, should be able to read the documentation and follow the procedure without needing to ask the previous team how it works. If they cannot, the organisation is dependent on that team in a way that has no upside.

Cloud infrastructure deserves the same treatment. The AWS account, the database, the storage buckets, the deployed services — these should exist under accounts that the client controls. The agency can be granted the access they need to operate and deploy, but the underlying accounts should be the client's. This is partly about leverage — an organisation that does not own its own infrastructure cannot easily switch providers — but it is also about visibility. If the infrastructure bill is in someone else's account, you are trusting their interpretation of what you are being charged for and why.

Data ownership follows naturally from this. Databases and backups should be accessible to the client independently of whoever currently manages the application. The moment that changes — the moment data can only be retrieved by asking a third party — you have lost control of something you cannot afford to lose.

This is not about distrust. A good agency relationship is collaborative, and there is no reason a vendor should object to any of this. If a vendor does object — if they resist putting the code in the client's repository, or insist on managing the cloud accounts themselves, or cannot provide deployment documentation — that resistance is itself important information. An agency confident in the quality of their work has no reason to be the only person who can see it.

The Cost of Finding Out Later

The cruel irony of security and technical debt is that it is far cheaper to avoid than to address. A thorough review during build costs a fraction of what a breach response costs. Infrastructure right-sized from the start saves money every month it runs. Code written with maintainability in mind costs less to change than code that has to be unpicked before it can be touched.

The industry — and the growing class of AI-powered builders entering it — would benefit from understanding this before they find out the hard way. The builder and the auditor can be the same person, but only if that person is actively choosing to wear both hats — documenting their decisions, explaining the reasoning behind their approach, writing test coverage that forces them to reason about correctness, and inviting scrutiny rather than avoiding it. That is a discipline, not a default. What is not acceptable is a builder who is also, by omission, the auditor — because nobody else ever looked, because the process never invited it, because shipping was the only goal. Whether the builder is an agency, a solo developer, or a language model, the question is the same: is anyone asking the uncomfortable questions? And if the answer is no, the risk is the same regardless of how the code was written.

Alongside that: own the repository. Own the deployment pipeline. Own the cloud accounts. Own the data. Vendors and tools should be contributors to your project, not gatekeepers of it. The moment any single party becomes the only person who can deploy, the only person who knows the architecture, or the only person with access to the production environment, you have created a dependency that will eventually cost you — in money, in time, or in something worse.

Because if nobody is watching, the problems will wait patiently until they matter most.