#ai-agents#harness-engineering#context-engineering#prompt-engineering

Prompt vs Context vs Harness Engineering: The 3 Shifts

AI engineering moved from phrasing prompts to supplying information to controlling execution. What each layer solves and why each hit a ceiling.

Shirley5 min read
Course outline · AI Agents (4.3)

Three terms in two years: Prompt Engineering, Context Engineering, Harness Engineering. It looks like fashion. It's actually a ratchet. Each term took over when task complexity broke the previous one - and each corresponds to a progressively harder question:

  1. Does the model understand what you're asking?
  2. Does the model have the right information?
  3. Does the model keep doing the right thing across a long, real execution?

Trace the ratchet and you understand how AI systems went from "can chat" to "can ship." Miss it and you're optimizing the wrong layer - polishing prompts when your problem is information supply, or tuning retrieval when your problem is nobody's checking the output.


The three shifts in AI engineering as nested layers: prompt engineering inside context engineering inside harness engineering, mapped to 2022, 2024, and 2026

Shift 1: Prompt Engineering - Say It Better

The founding observation, circa GPT-3: same model, different phrasing, wildly different output. "Summarize this article" gets you mush; "As a senior tech editor, summarize in three paragraphs - core claim, evidence, limitations, max 150 words each" gets you something publishable. The toolkit crystallized fast: role assignment, few-shot examples, step-by-step decomposition, output format contracts, refusal boundaries.

Why it works is worth being precise about: an LLM is a context-sensitive probability machine. A role shifts the sampling distribution toward that persona's training data. Examples establish a pattern to continue. Constraints raise the weight of compliance. Prompting isn't commanding - it's shaping the probability space the answer gets drawn from. The skill that mattered was language design: knowing the model's temperament.

The ceiling: prompts can't conjure facts. "Analyze our internal architecture doc" fails on phrasing-perfect prompts if the doc isn't there. Prompt engineering solves the expression problem. It cannot solve the information problem - and most real tasks are information problems wearing expression-problem costumes. The moment work shifted from open-ended Q&A to "do something with my data," the center of gravity moved.


Free AI Builder Newsletter

Weekly guides on AI tools & builder strategies.

Shift 2: Context Engineering - Feed It Better

New default assumption: the model probably doesn't know - the system must deliver the right information at call time. And the question set changed shape entirely: What does the model currently see? What's missing? What should be summarized versus quoted versus excluded? What does this module need to see that that one shouldn't?

What forced the shift was agents. A chat turn is one prompt; an agent run is ~50 tool calls, each spraying results, errors, and state into a finite window. Context became a managed resource with real failure modes - rot, poisoning, distraction, clash - and a real discipline grew around managing it.

Two landmark practices define the era. RAG answered "how do facts the model never trained on get in?" - retrieve, rank, inject, with all the craft living in chunking and reranking. Agent Skills answered the subtler capability-overload problem with progressive disclosure: a ~50-token metadata layer always loaded, ~500-token instructions loaded on trigger, scripts and references loaded only at execution. Need-to-know, applied to machine attention.

Worth saying plainly: context engineering contains prompt engineering - the prompt is just one (curated) object in the window. The layers nest; they don't compete.

The ceiling: perfect inputs, unsupervised execution. The model has every fact and still: plans well then drifts at step 7, misreads a tool result and builds on the misreading, errors at step 3 and compounds it through step 30, reports confident completion on work that doesn't run. Input quality was never the whole game - because nobody was watching the work happen.


Shift 3: Harness Engineering - Control the Run

The word is literal: a harness is the rigging that turns an animal's raw power into directed, recoverable work. LangChain's definition is the cleanest in circulation:

Agent = Model + Harness. Harness = Agent − Model.

Everything around the weights: what the model sees (context), what it can touch (tools), how steps sequence, what persists (state and memory), who checks the output (evaluation), and what happens at failure (constraints, recovery). It's the answer to questions the first two layers never asked: who supervises? who verifies? who pulls it back on course?

The hiring analogy lands best. You brief a new hire on an important client visit (prompt). You hand them the account history and pricing sheet (context). But if the meeting matters, you also send a checklist, require a call at key milestones, review the recording after, and check results against criteria. That last bundle - nobody skips it for high-stakes work with humans. Harness engineering is refusing to skip it for agents.

And the receipts arrived fast. The detailed case studies live in the production practices guide, but the headline: OpenAI ran a near-million-line production app where agents wrote 100% of the code and the humans engineered the environment; Anthropic got Claude running unattended for hours via fresh-context resets and independent evaluator agents.

Top 30 → Top 5

LangChain's Terminal Bench jump - same model, only the harness changed. Same weights, different rigging, different league.


The Ratchet, In One Table

Prompt Eng.Context Eng.Harness Eng.
ObjectThe instructionThe input environmentThe execution system
Core questionDid I say it clearly?Does it see the right info?Does it keep doing it right?
Failure it fightsMisunderstandingMissing/noisy knowledgeDrift, error compounding, false "done"
Era triggerGPT-3 chatRAG + early agentsLong-running autonomous work
Skill that mattersLanguage designInformation architectureSystems + verification design

Each layer contains the previous: the prompt is an object inside the context; the context pipeline is a subsystem inside the harness. Which is why nothing here is obsolete - a sloppy prompt still hurts inside the best harness ever built. The layers are floors of one building, and complexity is the elevator.

Diagnostic, for daily use: output misunderstands the ask → prompt problem. Output is fluent but wrong or stale → context problem. Output starts right and degrades across steps, or claims success falsely → harness problem. Most "the model is bad" complaints are a mislabeled floor.

The arc of all three shifts compresses to one sentence: the engineering moved from talking to the model, to informing the model, to building the machine around the model - and the next article breaks that machine into its six load-bearing components. The model is the engine. Engines don't win races. Cars do.

Continue Learning

AI Builder Club

Courses, workshops, and a builder community for shipping with AI agents, Claude Code, and more.

Full courses on AI agents & Claude Code
Weekly live workshops
Private community of 1,000+ builders
New content every week
See what's inside →Join 1,000+ builders

Get the free newsletter

Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.

No spam. Unsubscribe anytime.