All Articles
ReflectionAILLMai toolstool usemental models

Your LLM Isn't Quite As Smart As You Think

6 May 2026· 18 min read· Andrew Arscott

Ask Claude or ChatGPT what the time is. It will tell you. Ask it whether it's drizzly outside in Newport right now, and it might give you an answer — possibly a confident one. Ask it to summarise this morning's headlines, and it will happily oblige.

You'd be forgiven for thinking these are all the LLM doing what LLMs do.

None of them are.

Strip away the chat interface, the integrations and all the bells and whistles, and prompt the underlying LLM directly. Ask it the time. Ask it the weather. Ask it about the news. The honest, modern models will tell you straight — they cannot. They do not know the time. They cannot see the weather. They have no idea what is in the news today.

If you do not want to take my word for it, you can try this yourself. Cloudflare host a free LLM playground where you can prompt a small, raw model — Meta's Llama 3.2 3B Instruct — with no chat-app wiring around it. Ask it what time it is. Ask it about today's news. The smaller, non-reasoning models tend to be blunt about it — they will tell you they cannot, that they do not have access to the current time, that they cannot see the web. Older models, before this kind of self-awareness was trained in, would do something more revealing — telling you confidently that the year was whatever year their training data ended, sometimes two or three years out of date. They were not lying. They genuinely did not know.

The newer reasoning models do something more telling still. Faced with the same question, they will think out loud about how to answer it — and you can watch them reach for a tool that isn't there. They reason their way to "I should look up the current time" or "I should search the web for this", before coming to terms with the fact that no such facility exists in the conversation. Even when the tools are missing, the model knows it is supposed to have them.

A reasoning model's chain of thought, looking for a tool to answer a question it cannot answer alone

The model itself does exactly one thing — it predicts the next word in a sequence. It does that astonishingly well, drawing on a knowledge base that spans most of what humans have ever written down. But on its own, in isolation, it cannot tell you the time, look at a website, run a calculation, or generate an image. None of those things are language. They are actions. And the model isn't an actor. It's a writer.

To understand what AI assistants actually do — and why some feel so much smarter than others despite often being built on the same models — it helps to step out of the software entirely for a moment, and into a workshop.

The Carpenter With No Tools

Imagine someone who has read every book ever written about carpentry. They know which timbers are best for which jobs. They can describe how to join two pieces of wood properly in their sleep. They can talk you through the right way to cut, the right way to shape, the right way to finish, and the difference between every wood from balsa to oak. By any reasonable measure, they are a master of the theory.

Now put them in an empty room. No saw, no hammer, no sandpaper. Not even a piece of wood. Ask them to build you a chair.

They can't. All that knowledge is real, but it's locked behind something they don't have — tools. Without a saw they cannot cut. Without a hammer they cannot join. Without sandpaper they cannot finish. The expertise exists, but the expression of that expertise — the actual building of things — depends entirely on what's in front of them.

LLMs are exactly like that carpenter.

The model is the knowledge — vast, fluent, surprising in its breadth. But the things you've come to expect from an "AI assistant" — drawing pictures, looking things up, running code, sending emails, even just knowing what day it is — those are tools. And without them, the model is a very well-read carpenter in an empty room.

What The Model Actually Does

Strictly speaking, an LLM does one thing. You give it a sequence of text, and it produces the next chunk of text that statistically follows. Its weights — the numbers that encode what it has "learned" — are frozen at the moment it was trained. They do not change while you are talking to it.

It is worth being a little more precise though, because the chunks the model deals in are not always whole words. They are tokens — small pieces of text that might be an entire word, part of a word, a piece of punctuation, or even just a space. A short, common word might be a single token. A longer or rarer one might be split into two or three pieces. When the model generates a reply, it does so one token at a time, picking the next most likely token, then the next, then the next, until it decides it has nothing more to say.

Tokens being chunks of text rather than individual letters has some amusing consequences. ChatGPT once became briefly famous for being unable to count the number of times the letter R appears in the word "strawberry" — it would confidently tell you the answer was two. The reason is that the model never actually sees the letters S-T-R-A-W-B-E-R-R-Y. It sees something more like ["straw", "berry"] — two tokens — and from inside that representation, the individual letters are simply not visible. The information is not there to work with. Give the model a tool — a code interpreter, say — and it will quietly write a one-line script to count the Rs and return the right answer. Without one, it is left guessing.

This matters because everything else the model appears to do is built on top of those tokens. When you give an LLM access to a tool — a calculator, a web search, a memory store — what you are really doing is teaching it that producing a particular pattern of tokens means "call this tool". The application wrapping the model watches the stream of tokens being generated. The moment that pattern appears, the application intercepts it, runs the tool in the real world, and feeds the result back into the conversation as more tokens. The model never "uses" anything. It writes a structured request, and a piece of software around it does the rest.

The model itself has no memory of you. No awareness of the world today. No internal sense of time, place, or context beyond what is in the conversation in front of it. It is, quite literally, a writer who reads the conversation so far and decides what to say next.

That's it.

Everything else — and there is a lot of "everything else" — is built around it.

When you talk to ChatGPT, Claude, Gemini or any other AI assistant, you are not talking to the model directly. You are talking to an application that wraps the model, and that application is doing a great deal of work on your behalf, often invisibly. Before your message even reaches the model, the application is likely to be injecting a system prompt — instructions that set the model's persona, its guardrails, what it should and shouldn't talk about, the tone it should take. It might pull in your relevant chat history, surface any saved memories, attach the contents of a file you've uploaded, check the current date, or fire off a web search if it suspects the question needs fresh information.

All of this — the system prompt, the chat history, the retrieved memories, the contents of the file you uploaded, the results of any tool calls — has to fit somewhere. That somewhere is called the context window, and it is the model's working memory for a given conversation. Every token in the conversation, from any source, counts towards it. Modern models offer increasingly generous context windows — hundreds of thousands of tokens, sometimes more — but the limit is real. Fill it up and something has to give. Older parts of the conversation might be summarised or trimmed. Tool results might be truncated. The application is constantly making decisions about what to keep, what to compress, and what to leave out. Every tool you give the model is, in effect, another voice trying to be heard inside that window — and part of the craft of building a useful AI product is deciding which voices get the floor, and for how long.

Memory is a particularly clean example of this. ChatGPT's memory feature, which feels so seamless from a user's perspective, is essentially a key-value store sitting outside the model — a fancy notes file, with one tool the model can call to write things to it and another to read them back. The model has no enduring memory of its own. When it appears to "remember" you, what is really happening is that the application has loaded the relevant entries into the prompt before the conversation begins, so the model can read them like any other piece of text.

The reply, once produced, often triggers further actions — generating an image, running code, fetching a webpage, calling an API — before the final result is returned to you. Each of those is, again, mediated by tokens. The model emits them. The application acts on them. It really is tools all the way down.

The LLM is the brain. The application around it is the body, the senses, the hands — and increasingly, the memory.

The Tools That Do The Heavy Lifting

A few of the most common things people assume are "the LLM" but really aren't:

Knowing the time and date. This is one of the most surprising, because it feels so trivial. The model itself genuinely doesn't know what day it is — its weights were last updated months or even years ago. To give you a sensible answer, the application has to either include the current date in the prompt before the model sees it, or expose a tool the model can call to fetch it. If you've ever had an AI confidently tell you the year is something it isn't, that's why. The model was working from what it remembered, which is whenever its training data was last refreshed.

Searching the web. The model has no internet access. It cannot click a link. It cannot type into a search bar. When you ask it about today's news, what's actually happening is that the application invokes a search API on its behalf, retrieves the results, and feeds the relevant text back into the model's context — at which point the model can write about it as though it had just read it. Which, in fairness, it just had.

Generating images. The LLM does not draw the image. It writes a description, hands that description to a separate image generation model — a completely different kind of system entirely — and surfaces the result back to you. The LLM and the image model are two different brains, communicating in text. If the resulting image is wrong, it is often because the LLM described the wrong thing to the image model, not because either of them misbehaved.

Running code. Many AI assistants will happily run snippets of Python or JavaScript and return the output. None of that execution happens inside the model. The model writes the code, a sandboxed runtime — completely separate from the LLM — executes it, and the output is then fed back into the model's context as if it were just another piece of input. The LLM never runs anything. It just reads the result.

Reading the files you upload. Same story. The model cannot open a PDF or parse an image of a table. The application processes the file, extracts the text or relevant data, and includes that extract in the prompt. The model then "sees" what was pulled out — but it never touched the file itself.

Memory between conversations. When an AI assistant seems to remember things you told it last week, that is almost always a memory system stitched on top of the model. The weights have not changed. Somewhere there is a database, a retrieval system, or a notes file that gets loaded into the prompt at the start of each conversation. Take that system away and the model has no recollection of you whatsoever.

Calculations that need precision. The model can do arithmetic, but not reliably. For anything where you need a guaranteed correct answer, sensible AI applications hand the calculation off to an actual calculator or code interpreter. The model decides what needs computing, the tool does the computing, and the model takes the result back to weave into its reply.

The pattern is consistent. The model is always the one talking. But the tools are doing most of the heavy lifting.

Where The Cracks Appear

It is tempting to assume that once the right context is in the prompt, the model has everything it needs. The application injected the current date. It loaded the user's chat history. It surfaced the relevant memories. Surely that is enough?

Sometimes. But not always — and the failure modes are subtle enough to bite hard in production.

Take time and date. I mentioned earlier that the application can simply include the current date in the prompt before the model sees it. That works for one-shot questions. The trouble is, conversations are not always one-shot. If the application stamps the time at the start of the conversation and the user is still chatting half an hour later, the model still thinks it is that earlier time. From the inside, nothing has changed. Unless something later in the conversation provides a fresh reference point, the model will happily tell you it is 14:32 long after it has stopped being 14:32. A proper tool — one the model can call to fetch the current time the moment the question is asked — is far more reliable than a snapshot frozen into context at the start of the conversation.

Timezone conversion is a particularly telling case. Ask a model to convert 14:00 UTC into Pacific Time and it will give you an answer. It will sound confident. It might even be correct. But the model is not converting anything. It is generating the next most likely token, drawing on every example of timezone conversion in its training data. Sometimes that lands. Sometimes it does not. The model has no concept of time arithmetic, no awareness of daylight saving, no internal clock. It is pattern-matching its way to an answer, and a pattern-matched answer is not the same as a calculated one.

This kind of thing matters more than it sounds. In my work with Voice AI, business hours and timezone logic come up constantly. Is the office open right now? Is it a public holiday in the customer's country? Has the on-call rota rolled over? Without a reliable tool to answer those questions deterministically, things go "Pete Tong" very quickly. The bot will tell the customer the office is open when it is closed. It will offer a callback at a time that does not exist. It will get the clocks wrong twice a year. None of that is the model misbehaving — it is the model doing exactly what it was designed to do, which is generate plausible text. Plausible is not the same as correct.

This is also where most hallucinations come from. A hallucination is not the model lying or going rogue — it is the model generating text the way it always does, in the absence of a reliable source of truth. It will produce a confident answer because that is what it has been optimised to do. Without a tool to ground the answer in something real, the only thing the model has to draw on is its training data and its own generated context — and from that, it will produce something that sounds right. Sometimes it is. Often, in narrow demos, it is. But the moment you put the model in front of real users, with real questions, in a real production environment, the cracks appear.

The art of building a useful AI product is, in large part, knowing when to stop trusting the model and start handing the work off to a tool. Anything that needs to be exact — the time, a calculation, a database lookup, a timezone conversion, a check against business rules — should not be left to next-token prediction. Anything that needs to be current should not be frozen into context. Anything that needs to be auditable should be done by code, not by a writer guessing what comes next.

The model is a brilliant writer. But there are things writers should not be asked to do unaided.

Why This Mental Model Matters

Why does any of this matter? Because it changes how you set expectations.

I get asked, often, why one AI tool seems "smarter" than another. Most of the time the underlying model is comparable — sometimes identical. What differs is the workshop. One product gives the model better tools, more context, sharper retrieval and cleaner integrations. Another wraps the same model in a far more limited harness. The model isn't smarter. The toolkit is.

I also see businesses make decisions on the back of demos and then look puzzled when their internal deployment doesn't behave the same way. They watched an AI assistant write SQL against a live database, summarise yesterday's support tickets, or draft personalised emails to specific customers. What they actually saw, in every case, was the model plus a stack of tools and integrations. Replicate only the model in your own product and you will get a much narrower experience. The intelligence on display in those demos is rarely the model alone — it's the model plus the wiring.

This matters even more when people talk about "training" an AI on their data. I'll dig into that properly in a future article, but the short version is this — most of what you think of as an AI "knowing about your business" is in fact a tool fetching that data and feeding it into the prompt, not the model having absorbed your knowledge into its weights. Pull the tool away and the magic vanishes with it.

For those of us building these things, the practical implication is more empowering. Useful AI products turn out to be far more about the tools you give the model than about the model itself. A well-tooled mid-tier model often outperforms a frontier model with a poor harness. It's a craft question, not just a model question.

Choosing Your Carpenter

Once you accept that the workshop matters as much as — often more than — the carpenter, an interesting question opens up. Do you actually need the most expensive carpenter on the market?

For most people approaching AI for the first time, the default answer is yes. Of course you want the best. ChatGPT is the household name, GPT is the model everyone has heard of, and the assumption is that anything less is settling. So budgets get pointed at OpenAI, contracts get signed with the frontier providers, and tokens get charged at frontier prices.

It is worth slowing down here, because the question is not "which is the smartest model?" — it is "which model is smart enough to do the job, with the tools I am giving it?" Those are very different questions, and the answer is rarely the one the marketing team would prefer.

A frontier carpenter — one who has read every book on carpentry, knows every joinery technique, can speak fluently about every wood — is genuinely impressive. But they also charge a frontier rate, and they get through a frontier amount of tea and biscuits while they work. (For the rest of this analogy, tea and biscuits is compute.) At frontier scale — every token billed, every conversation running — that bill adds up faster than people expect. I have seen companies pour startling sums of money through their LLM's mouth simply because nobody asked whether a cheaper model could do the same job. Uber are perhaps the most public example: they blew through their entire annual AI budget in the first quarter. I suspect quieter versions of that story are playing out across the industry.

There is a middle ground that gets surprisingly little airtime. Open weight models like Meta's Llama 3 70B are competent carpenters. They have read a lot of books — perhaps not every book, but enough. They can use the tools you give them. They can hold a conversation, follow instructions, call APIs, and answer questions reliably for huge swathes of real-world use cases. And they cost a fraction of what the frontier providers charge — sometimes a tenth, sometimes less, particularly if you are running them yourself on sensible hardware.

The mental model here is the same as the rest of this article. The intelligence you experience in an AI product is mostly the tools, not the model. So if the tools are well chosen and the job is well scoped, a smaller, cheaper model is often more than enough. There are tasks where the frontier models genuinely earn their keep — complex multi-step reasoning, novel research-style problems, code generation in unfamiliar territory. And there are plenty of tasks where they are quietly overkill, and the costs add up faster than anyone notices.

I will go deeper on this in a future article — there is much to say about model tiers, the trade-offs they involve, and how to make grown-up decisions about which one you actually need. For now, the takeaway is this: do not let the carpenter's CV be the only thing you look at. Look at the workshop. Look at the job. Then pick the carpenter that fits both.

The Workshop, Not The Carpenter

The carpenter, in the end, is only as capable as the workshop you put them in. The same is true of LLMs.

This is partly why AI hype feels so polarised at times. The models are extraordinary — and they really are. But strip away the tooling that surrounds them in modern products, and you would find the raw output far less impressive than the experience suggests. Not because the models are weak, but because language alone is a small slice of what a useful assistant needs to do.

Once you see this clearly, a lot of confusion melts away. You stop expecting the model to know things it cannot know. You stop being surprised when one product can do something another cannot, despite using the same underlying model. You start asking better questions — not "is this AI good enough?", but "what tools does it have, and are they the right ones for the job?"

It also explains why MCP — the Model Context Protocol, originally introduced by Anthropic — has taken off the way it has. The genius of it, from Anthropic's perspective, is that it inverts the integration burden. Rather than Anthropic having to go and build first-party integrations into every product anyone might care about, the products themselves come to Anthropic. And because MCP is an open standard, those same integrations work with anyone else who adopts it. Anthropic gets the network effect; the wider industry gets a shared connector layer; everyone benefits from a thousand integrations they did not have to build themselves.

That is why every Tom, Dick and Harry shipping software in 2026 seems to have their own MCP server attached. If an LLM can pull context out of your product as easily as it pulls a date or a search result, the barrier between "I want to do this thing" and "the AI can help me do this thing" falls away. Your product stops being a separate destination the user has to detour through, and starts being another superpower the LLM can leverage on their behalf. That is a powerful place for a product to sit — and the industry has very clearly noticed.

The cleverness, more often than not, isn't in the carpenter. It's in the workshop.

Have thoughts on this? Get in touch.