
(© Aflo Images at Canva.com)
In Christopher Nolan’s film Memento from the early 00s, the protagonist has lost his short-term memory and must try to solve a mystery by leaving himself notes — because each time he sleeps, his memory is wiped. But the more notes he accumulates, the harder it becomes to tell which are important, which are misleading, and what they all mean. He doesn’t lack information — he lacks context.
Andrej Karpathy mentioned this film in a recent Startup School talk, using it to highlight the inability of generative AI’s Large Language Models (LLMs) — and hence agents that depend on them — to sustain autonomy over multiple turns. Like the protagonist in Memento, an agent only knows what we have shared with it during our last interaction — and when an interaction finishes its memory is also wiped. And while we can re-share previously important outputs or scraps of data, the model — much like the unfortunate man in Memento — doesn’t know which pieces of data are important and why.
As a result, we must recreate context each time we interact with an agent — refilling its short-term memory with the information it has forgotten. Do that badly, however — by giving it too much data, too little data, or failing to instruct it properly — and agents quickly become risky and unreliable. Blindly using the wrong information, in ways we didn’t intend, to do things we don’t want.
So how do we do it well?
Enter context engineering.
It started with a prompt
During the early stages of our relationship with generative AI, we focused on prompt engineering — trying to create clever ways of conjuring the behavior we want out of our flighty and unreliable AI models.
Everything the model needed to know was inserted into its short-term memory using the prompt — and we continually tinkered and experimented with ways of getting the behavior we wanted by using the prompt to share carefully crafted instructions, data, and examples.
Over time the amount of information we are able to share with models has grown rapidly — through external data connections, ever-expanding context windows, and standards such as MCP. Together these new techniques have made it easier than ever to give information to models — but also made their world more confusing by filling their short-term memory with ever more data.
Stack overflow
All models essentially have what is known as a ’context window’ — a scratchpad or area of short-term memory that is used as a temporary shared space for information sharing between the model and its external users or systems.
Because models don’t learn outside of their training, this context window represents everything that the model will ever know about us or the tasks we set it — and just like the memory of the protagonist in ‘Memento’, all of this information is wiped between calls. Which leaves the model reset to its baseline knowledge without any memory of the discussion.
To hold a multi-turn interaction, therefore, all information needs to be re-shared with the model each time. We might not be aware of this because chatbot clients do it ‘under the hood’ — but when we use models directly, there is no such memory management. Everything needs to be continually re-inserted into the context window to help the model ‘remember’ who we are, what we want, and what it has done so far.
But as more and more data accumulates in the context window across interactions — ie, as the model is expected to make sense of an ever-growing chain of information — things can start to drift.
Because just like the person waking up daily to an increasingly messy pile of notes, pictures, and other fragments with no idea of why they are important, models also ’wake up’ as a result of subsequent invocations to find data fragments stuffed into their short-term memory.
Without any context, therefore, the model can start to over-focus on words or phrases that appear frequently, or fixate on earlier parts of the conversation which are no longer relevant to the current task.
As a concrete example, when working with a model I frequently reach a point where it no longer seems to be able to focus on what I am asking it to do. Hallucinations increase, the model ceases to listen to instructions, and it frequently fixates on something that happened earlier in the conversation that is no longer relevant.
At this point I simply have to copy whatever is relevant and start a new conversation — effectively wiping the model’s memory, curating the key parts of the conversation to date, and creating a new context window which only includes the information I want the model to focus on.
Pause and rewind
A recent paper by vector database vendor Chroma analysed this phenomenon and dubbed it “context rot” — ie, the tendency for models to become less reliable as the volume of data stored in their short-term memory grows.
Like the protagonist in Memento, the longer a multi-turn interaction continues, the greater the volume of confusing, contradictory, or irrelevant information a model might have to process each time it wakes up.
And because models are influenced by all of the data contained in their context window — and not just what’s relevant to the current task — their performance can start to decline. This is why agents struggle to sustain autonomy over more than a few turns — because the volume and diversity of data within the context window starts to create statistical gravity wells which pull their focus away from the current task and cause them to drift.
And while this issue is mostly just irritating in the context of a turn-based discussion between a human and a chatbot, it’s potentially disastrous for the kinds of autonomous AI agents organisations are hoping will help them automate complex, sensitive tasks.
To solve this issue, we need to become far more intentional about the information we use to shape a model’s perception during each interaction — managing its context window so that it can focus on what’s important without distraction.
In short, we need to explicitly engineer context for models so that we can use them successfully within industrial-strength applications.
Context engineering
The term “context engineering” was recently popularised by Shopify CEO Tobi Lütke in a tweet:
“I really like the term ‘context engineering’ over prompt engineering.
It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”
Importantly this positions context engineering as a broader and more proactive discipline than prompt engineering — focused not only on the initial instruction, but also on the continuous pruning and tuning of the model’s context over time in order to elicit more focused behavior at each step. This makes context engineering a superset of prompt engineering — ie, using intentional design to explicitly shape the behavior of a model via the things we put into the context window.
But it also acknowledges that prompts are now only one source of context that drive a model’s behavior.
As more and more ways to share data with models have emerged — and context windows have grown to enable more information to be stored in their short-term memory — so the behavior of models has become increasingly dependent on the full scope of information we share. This means that it is no longer enough to prepare clever prompts — instead we must prepare clever context.
It would be like someone coming into the room while the man from Memento sleeps, removing anything that isn’t relevant to solving his mystery, organising the remaining information into timelines or other useful structures, and highlighting the most critical things to focus on — removing distractions and enabling him to immediately see his next best actions after waking.
Because when models — or film protagonists — lack the ability to absorb new knowledge and build lasting context by themselves, their context must be shaped by an external entity with the capacity to understand what’s important and why.
Context engineering therefore involves explicitly managing the information that is present in a model’s short-term memory — by constantly summarising previous steps, eliminating information that is no longer relevant, and by intentionally shaping the model’s short-term memory in ways that keep it focused on the things that we want it to work on.
All while balancing external RAG data, tool results, prior outputs, and freshly provided information.
Andrej Karpathy, in agreeing with Tobi Lütke, added additional useful context:
“People associate prompts with short task descriptions you’d give an LLM in your day-to-day use.
When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.
Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting… Too little or of the wrong form and the LLM doesn’t have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down.
Doing this well is highly non-trivial.”
Engineering agents with context
This need to explicitly manage context to control the behavior of models has some significant implications for how AI agents should be scoped, built, and managed.
As I noted in my recent agentic taxonomy, the greater the autonomy given to an AI agent, the greater the danger that it will drift and — eventually — spiral out of control as a result of compounding errors.
Context engineering provides a clear lens on why this happens — and illustrates the dangers of ’context rot’ for agentic systems. By becoming increasingly ’polluted’ with historical or unnecessary information as they work — information which none-the-less influences their behavior — agents can become distracted. And this pollution can compound quickly as a small deviation leads to the generation of additional erroneous tokens — pulling the agent deeper into the wrong direction just by virtue of its statistical nature.
Agents therefore need to be kept on a tight leash — either through frequent validation by humans in tight feedback loops or deterministic orchestration with rules that can assure outputs at each step.
In practice, we must reflect this reality by taking steps to explicitly manage the context window — through limited autonomy, external context management, and explicit control flows.
By limiting the autonomy of agents — essentially scoping them for discrete tasks with limited scope — we can ensure that agents don’t run through sufficient unmanaged turns to experience context rot and begin to drift.
By carefully managing the context window from outside the agent itself — deciding what is relevant and ensuring it has only the necessary information for its next step — we can keep agents focused on what we want them to do.
And by externalizing control flows — to ensure the right level of determinism over the wider process — we can combine the flexibility offered by agents for uncertain tasks with the deterministic infrastructure needed to keep them firmly under control.
Which is why the ‘Instruction’ and ‘Orchestration’ quadrants were highlighted as the most productive and reliable within the taxonomy — because the instruction of discrete agents, or the coordination of many discrete agents via external control flows, is the only way to ensure strong control over the context windows of the agents.
One good resource for exploring this topic in more detail is “12 factor agents” — a resource created by Dexter Horthy, founder of agent-focused startup HumanLayer, which outlines a set of principles for implementing strong control over agents so that they can be safely used for the kinds of industrial-strength applications mentioned by Karpathy.
My take
The shift in the discussion from prompt engineering to context engineering represents something profound — a shift from experimentation to engineering. We’re starting to see the jagged edges of these models and how best to wrangle them.
Looking backwards, the idea of context engineering makes intuitive sense. When you have a statistical machine that processes tokens, the tokens that you give it will have a significant influence on its behavior — and the more irrelevant tokens you leave in its short-term memory, the more irrelevant its behavior is likely to be.
Just like a man who wakes to find himself rich in information but lacking understanding of its meaning.
Enterprises should therefore invest in understanding context engineering — the task of curating information to provide clear meaning to AI. And as Karpathy says — that means focusing as much on philosophy as science.
Because it requires you to think deeply about what you are trying to achieve, how to align it with your values, how far you can trust a model with autonomy, and how you should control its perception in order to elicit the behavior you want. This means ignoring fancy demos and doing the hard work of building a coherent philosophy of value, breaking down problems, mapping out control flows, and explicitly designing agent interactions for continual context management.
And — perhaps most importantly of all — it means finally treating agents as data infrastructure to be engineered and not digital ’people’ to be managed.
As my colleague Phil Wainewright dryly observed, agents are not digital workers — they lack people’s ability to assimilate new tacit knowledge, navigate culture, and work within unspoken systems. They are entirely dependent on people to shape their world, provide their goals, and guide their actions — through the continuous management of their context.
And so the quicker people realise that and start treating agents as simple operational data infrastructure — with all of the attendant advantages and downsides this brings — the quicker we’ll get some practical value out of this stuff.