Context Engineering: The Complete Guide (2026)
Agents burn 100 input tokens per output token. The 4 management strategies, 4 failure modes, and the KV-cache rules that cut input costs 10x.
Course outline · AI Agents (3.7)
Manus measured their production agents: 100 input tokens for every 1 output token. Read that again. The model's "intelligence" in an agent system is mostly a function of those 100 tokens - what got included, what got cut, how it's arranged. The Karpathy framing: the LLM is a CPU, the context window is RAM, and you are the operating system deciding what loads into memory.
100 : 1
input-to-output token ratio in production agents - context quality IS agent quality
Prompt engineering asks "how do I phrase the ask?" Context engineering asks "what does the model see at this exact moment?" The second question is where production agents live or die. Here's the field in one pass: why context degrades, the four management strategies, the four failure modes, and the cache economics underneath it all.
Why Long Context Degrades: Context Rot
Bigger windows didn't fix this. Two structural reasons:
Attention is quadratic and finite. Every token attends to every other token - n² relationships. More tokens means each one gets a thinner slice of attention. Anthropic's framing: context is a resource with diminishing marginal returns. Each appended token spends attention budget.
Performance decays before the window fills. The measured phenomenon is called context rot. Galileo's benchmark: GPT-4o at 98.1% accuracy on short contexts dropped to 64.1% at scale - same model, same task, more tokens. This affects every transformer. It's a gradient, not a cliff, but a 34-point slide will absolutely ruin your agent's afternoon.
And tool definitions compound it. Berkeley's function-calling leaderboard found a quantized Llama 3.1 8B that worked fine with 19 tools and fell apart at 46. The window had room; the model's discrimination didn't. Complexity, not length, was the killer.
What "Context" Actually Includes
Wider than most people draw it - four layers, each failing differently:
| Layer | Contains | Failure mode |
|---|---|---|
| Instructions | System prompt, rules, few-shot examples | Bloated rules dilute everything downstream |
| Knowledge | Retrieved docs, user prefs, facts | Too much = lost model; too little = confabulation |
| Tools | Definitions, call results, errors | Tool overload (the 46-tool collapse) |
| History | Prior messages, decisions, trajectory | Stale conclusions steering current work |
Context engineering is budget allocation across these four. Spend where the task needs it; cut everywhere else.
The Four Strategies

1. Offloading - the filesystem is free memory
Don't carry what you can fetch. Manus's pattern: scrape a webpage, keep a 500-token summary plus the URL, drop the 50,000-token body. 100:1 compression, fully reversible - the agent re-fetches if it actually needs the details.
The other offloading move: the agent maintains a todo.md and re-reads it before each step. External memory doubles as an attention anchor - rewriting current goals into the recent end of context keeps long tasks from drifting. Same reason you write a checklist instead of memorizing it.
Free AI Builder Newsletter
Weekly guides on AI tools & builder strategies.
2. Retrieval - just-in-time beats preloading
Classic RAG retrieves once, up front. Agentic retrieval hands the model lightweight pointers - file paths, query templates, URLs - and lets it pull what each step needs. Claude Code is the reference implementation: it never loads your database into context; it writes a query, stores results, reads slices with head. The pointer is cheap; the data stays on disk until summoned.
Hybrid is the practical answer: preload the stable essentials (CLAUDE.md), let the agent explore the rest. Where you draw the line is the craft.
3. Isolation - don't share everything with everyone
The trap in multi-agent systems: a giant shared context everyone reads. Now every agent wades through everyone else's noise and bills for the privilege.
Treat inter-agent communication as interface design: a research sub-agent returns verified findings and decision points - not its browsing history, dead ends, or internal monologue. Anthropic's multi-agent research system proved the economics: separate contexts per sub-agent cost 15x the tokens but delivered 90.2% better task performance. Isolation is what you're buying. (This is the entire thesis of sub-agents.)
4. Compression - shrink without losing the constraints
When the window fills, summarize and restart - Claude Code triggers this automatically at 95% utilization. The subtle art is what survives compression. Importance is lagged: the detail you drop at step 5 becomes load-bearing at step 50. Priority goes to facts that constrain future action - what failed, what got created, what's been ruled out, what's still unknown.
Cheapest variant: clear old tool results. A file dump from 40 turns ago is almost never needed verbatim - it's the lowest-risk deletion in the window.
The Four Failure Modes
Name them and you'll start seeing them everywhere:
- Poisoning - one hallucination enters context and compounds. DeepMind's Gemini-plays-Pokémon logged the canonical case: a false game-state claim landed in the goal section, got re-read every turn, and the agent chased an impossible objective for dozens of cycles. Errors that enter context self-confirm.
- Distraction - past ~100K tokens, models start pattern-matching their own history instead of reasoning fresh. Gemini 2.5's report describes the agent repeating past actions rather than planning. Databricks found worse: distracted models default to summarizing the context - ignoring your instruction entirely.
- Confusion - irrelevant material degrades output even when it's ignorable. The 46-tool collapse, again. Models don't ignore noise; they pay attention tax on it.
- Clash - contradictions across turns derail reasoning. Microsoft/Salesforce sharded benchmark tasks across multi-turn conversations: average performance fell 39%. Their line: "when LLMs take a wrong turn in a conversation, they get lost and do not recover."
The Money Layer: KV-Cache
Agent contexts share a shape - stable prefix, growing tail - and inference engines cache the prefix computation. Claude Sonnet pricing makes the stakes plain: cached input $0.30/M tokens, uncached $3.00/M. At a 100:1 input ratio, cache hit rate basically is your unit cost. Three rules:
- Freeze the prefix. One changed token invalidates everything after it.
WARNING
The classic self-inflicted wound: a timestamp in the system prompt. The model can now answer "what time is it" - and your cache hit rate is zero. That one line is a 10x cost increase.
Two Counterintuitive Practices
Leave failures in. The instinct after a failed tool call is to clean up and retry on a fresh slate. Wrong - the error trace is evidence. Models that see their own failure update away from repeating it; models with scrubbed history walk into the same wall. Manus treats error recovery as one of the clearest signals of real agentic behavior, and notes benchmarks systematically under-measure it.
Inject structured variety. A context full of similar action-observation pairs becomes accidental few-shot pressure - the model mimics the pattern even when the situation changed. Varying serialization templates and phrasing breaks the rhythm.
When to Bother
If your agent is single-turn Q&A: don't. Strategy overhead beats no strategy. The escalation triggers: context regularly past ~30K tokens, more than ~20 tools, or 1,000+ sessions/day. Then add strategies in order of pain - usually compression first, retrieval second, isolation when you go multi-agent.
One more trendline worth holding: stronger models make context engineering more valuable, not less - capability unlocks longer tasks, longer tasks mean more context pressure. The window is RAM. The model is the CPU. You're the OS - and the OS is where agents are won.
Continue Learning
Mastering AI Agents
The builder's deep dive into agent loops, tools, context engineering & memory — from using AI to building it.
AI Agent 101
Build autonomous research agents with tool use, API access, web scraping, and deep search.
Build Production LLM Apps
Build production-ready LLM apps with real-time APIs, multimodal models, and agentic workflows.
AI Builder Club
Courses, workshops, and a builder community for shipping with AI agents, Claude Code, and more.
Get the free newsletter
Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.
No spam. Unsubscribe anytime.