In this article, you will learn why a large context window is not the same thing as agent memory, and how techniques like retrieval, compression, and summarization fit together in an agent’s cognitive stack.
Topics we will cover include:
- Why a context window behaves like a stateless scratchpad rather than persistent memory.
- How retrieval-augmented generation, compression, and summarization each play a distinct role in managing what enters that scratchpad.
- How agents can achieve genuine memory persistence by acting as a database administrator rather than as the database itself.

Introduction
Context windows are a key aspect of modern AI models, particularly language models, whereby these models can attend to and utilize a limited amount of input and prior conversation — typically measured as a number of tokens — at once when producing a response.
When an AI lab releases a model with a 2-million token context window, it is no surprise some developers instinctively think like this: “Let’s shove the whole codebase into the prompt! Memory issues sorted!” However, there is a caveat. Deeming a huge context window as “memory” is, in architectural terms, similar to buying a 25-foot-wide office desk because you are reluctant to acquire a filing cabinet. Sure, you can have all your documents laid in front of you, but as soon as the working session ends, the entire desk’s documents are wiped out (by cleaning staff!).
To clarify this distinction and demystify other related concepts, this article offers a conceptual breakdown of multiple layers in AI agents’ cognitive stack. We will use several, mostly office-related metaphors to facilitate a better understanding of these concepts.
Context Window
A context window in an AI model, particularly agent-based ones with underlying language models, is like a desk surface or a stateless scratchpad. It is important to note that models are inherently fully stateless. No matter what, every API call to a model starts at “step zero”.
When passing an agent a conversation history spanning over 200K tokens (large context window), it isn’t remembering what happened at a previous step in time. Instead, it is quickly re-reading “its universe” from scratch in a matter of milliseconds. In the long-run, relying on this strategy in agent-based environments may introduce several dangerous (if not fatal) traps:
- AI models act like a lazy student, who pays close attention to the initial and final parts of a massive prompt (text), but utterly glosses over ideas and facts buried deep in the middle parts.
- There is a snowballing effect: as the conversation grows, the agent must re-send and re-read the entire history at every single step, including the earliest, often irrelevant turns.
- In terms of latency, there is a “brain freeze” effect, so that against a huge wall of text, the model will take some time until starting to generate the very first word in its response.
To make this concrete, consider what a single API call actually looks like under the hood. Because the model holds no memory between calls, every prior turn must be resent in full just to ask one new question:
|
model.generate( messages=[ {“role”: “user”, “content”: “Step 1: Let’s call this variable `session_id`.”}, {“role”: “assistant”, “content”: “Got it, I’ll use `session_id` going forward.”}, # … every intervening turn must be resent, every single time … {“role”: “user”, “content”: “Step 47: What variable name did we agree on back in step 1?”} ] ) |
Step 47 alone forces the entire desk — all 46 prior turns — back onto the table, just to answer a question about step 1. That is the snowballing effect described above, made concrete.
Retrieval
Retrieval-augmented generation (RAG) systems are like a big bookshelf across the office room, that helps fetch static, existing data relevant to the current step in a “Just-In-Time” fashion. RAG systems pull the top-K relevant document chunks into the scratchpad (the context window) as the user asks a certain question: the retrieved documents are, of course, the ones determined as most semantically relevant to the user’s question or prompt.
When agents are in the loop, things are not that easy, however, as vector similarity (the type of similarity measure and data representation used in RAG systems) is not necessarily equivalent to semantic truth in certain cases. For example, suppose a user tells their scheduling agent to move a meeting to Friday, and later says “cancel Thursday, Alice is sick.” A vector search engine may retrieve both statements from a document base, even though they contradict each other. The agent and its associated language model must be able to act as accountants capable of determining which statement better reflects the current reality.
A naive RAG pipeline simply concatenates whatever it retrieves and leaves the model to guess which instruction still holds. A more reliable pattern resolves the conflict before generation ever happens, for example by favoring the most recently recorded statement:
|
retrieved_chunks = [ {“text”: “Move meeting to Friday”, “timestamp”: “2025-01-10T09:00:00”}, {“text”: “Cancel Thursday, Alice is sick”, “timestamp”: “2025-01-12T14:30:00”} ]
# Reconcile contradictory chunks before they ever reach the prompt latest_relevant = max(retrieved_chunks, key=lambda chunk: chunk[“timestamp”]) |
That one line of reconciliation logic is the difference between an agent that confidently restates a stale instruction, and one that correctly knows the meeting was cancelled.
Compression
This is an easy one to understand if you are familiar with compressing into ZIP files. In the context of agents and language models, this entails some algorithmic token reduction: keeping the key underlying data intact, while its physical footprint inside a prompt at a certain step is shrunk. There are techniques like stripping stop-words, passing raw text to a specific compression model like LLMLingua, or Prompt Caching, to do this. This is, in essence, a bandwidth optimization play to be used in situations like squeezing a 15K-token JSON payload down to 5K, thus leaving enough scratchpad space in the model to do its main job.
In practice, this might look as simple as routing a large payload through a compression model before it ever reaches the main prompt:
|
raw_payload = json.dumps(large_api_response) # roughly 15,000 tokens
compressed_payload = compress_with_llmlingua( raw_payload, target_token_count=5000 )
prompt = f“Given this data: {compressed_payload}\n\nAnswer the user’s question.” |
The underlying facts survive the trip intact; only their footprint on the desk shrinks.
Summarization
Unlike compression, summarization removes the original data and replaces it with an abstraction. It must be treated as what it is: a one-way trip that is inherently irreversible. A good, nearly imperative practice when applying context summarization, therefore, is to use forked storage: dumping raw transcripts into cheap storage like S3 buckets or basic SQL tables, then passing just the synthesized summary into the active prompt.
That forked-storage pattern can be expressed simply as a two-step write, one to cold storage and one to the active prompt:
|
def summarize_turn(raw_transcript, session_id, turn_id): # 1. Persist the raw, unabridged transcript to cold storage s3_client.put_object( Bucket=“agent-transcripts”, Key=f“{session_id}/turn_{turn_id}.json”, Body=raw_transcript )
# 2. Generate a compact summary for the active prompt summary = summarizer_model.generate(raw_transcript)
# 3. Only the summary re-enters the context window return summary |
If a later step needs the original detail, it can always be retrieved from S3. Summarization, unlike compression, never needs to be reconstructed from inside the active prompt itself.
Memory Persistence as a State Machine
Memory persistence in agents is taken for granted more often than not, particularly by junior developers. But to give an agent genuine memory, it must not act as the database, but rather as the database administrator. Suppose a user says, “My dog’s name is Goofy, but we might rename him Pluto”. Then the agent should be able to explicitly trigger a tool-call like this:
|
{ “tool”: “update_entity_graph”, “params”: { “subject”: “User_Dog”, “attribute”: “Name”, “value”: “Goofy”, “notes”: “Considering Pluto” } } |
It is irrelevant whether it is backed by a standard SQL table, a knowledge graph, or Redis: either way, the agent should be taught to query the state machine at the start of every turn, and commit to it at the end of that turn. As a loop, this query-then-commit discipline looks like:
|
def agent_turn(user_message, entity_graph): # Query existing state at the START of every turn current_state = entity_graph.query(subject=“User_Dog”)
response = model.generate( messages=[{“role”: “user”, “content”: user_message}], context=current_state )
# Commit any updates at the END of every turn for call in response.tool_calls: entity_graph.update(**call.params)
return response |
Wrapping Up
Through these concepts, you should now have a clearer picture of the elements that play a role in context management for agents built on language models. The lesson is a simple one: stop trying to buy a huge, 10-million-token desk. Instead, just get a normal desk, give your agent a sharp pencil, and teach it how to open the filing cabinet and optimally leverage its contents to do its job.
💸 Earn Instantly With This Task
No fees, no waiting — your earnings could be 1 click away.
Start Earning