added report

2026-04-20 00:50:49 +05:00
parent f00f32d15f
commit 73da69cb0c
3 changed files with 167 additions and 0 deletions
@@ -0,0 +1,162 @@
+# Why Harness is Important
+
+Why Claude Code's features are not "just prompts in disguise" — and why the harness is what actually separates toy output from production-grade engineering work.
+
+<table width="100%">
+<tr>
+<td><a href="../">← Back to Claude Code Best Practice</a></td>
+<td align="right"><img src="../!/claude-jumping.svg" alt="Claude" width="60" /></td>
+</tr>
+</table>
+
+---
+
+## Executive Summary
+
+A common reduction among experienced Claude Code users is: *"skills, commands, subagents, hooks — they all eventually become prompts to the model, so a strong prompt alone is equivalent."*
+
+At the layer of the model's final inference call, this is technically true. The model only ever sees tokens.
+
+At every other layer — the one where real software engineering happens — **this reduction collapses.** The harness is not a prompt delivery system. It is a **prompt construction system, a deterministic execution system, and a context architecture system** — and those capabilities cannot be replaced by stronger wording.
+
+This report explains where the reduction is right, where it fails, and why confusing "what the model sees" with "what the system does" leads practitioners away from the features that give Claude Code its actual leverage.
+
+---
+
+## The Reduction That Sounds Right
+
+For a **single-shot atomic task** — "write me a recursive Fibonacci function" — the harness contributes nothing to output quality. Hand the same tokens to the same model and you get the same distribution of outputs whether they arrived via a skill, a command, or a raw prompt.
+
+In this narrow regime, the reduction holds:
+
+> Output quality ≈ Prompt quality
+
+This is the regime where Claude Code offers little value over a plain chatbot. It is also the regime the reduction implicitly assumes — and precisely the regime real engineering work is not in.
+
+---
+
+## Where the Reduction Breaks Down
+
+Ten architectural capabilities of the harness operate at layers where prompts have no access.
+
+| # | Capability | What it does | Why a prompt can't replicate |
+|---|------------|--------------|-------------------------------|
+| 1 | **Context isolation** | Subagents run in separate context windows | A prompt fills one window. N parallel subagents give ~N× effective context. |
+| 2 | **Harness-enforced tool restrictions** | `allowed-tools` / `disallowedTools` block tools before the model can use them | Prompt instructions are advisory; the model can ignore them. Deny rules cannot be ignored. |
+| 3 | **Lazy-loaded rules & memory** | `paths:` frontmatter and descendant `CLAUDE.md` files load only when Claude touches matching paths | A prompt is static — it cannot conditionally load based on which files are being read at runtime. |
+| 4 | **Hooks: deterministic code execution** | Shell commands run at lifecycle events (PreToolUse, PostToolUse, Stop, etc.) and can **block** tool calls | A prompt cannot intercept its own tool calls. Hooks execute even if the model doesn't "want" them to. |
+| 5 | **Model routing** | `model: haiku` or `model: opus` routes a call to a different model endpoint | No token in the prompt can change which model answers. |
+| 6 | **Parallelism** | Multiple subagents execute concurrently | A prompt is sequential. The harness schedules and collects results from parallel processes. |
+| 7 | **Cross-session persistence** | Memory system and settings hierarchy persist across conversations | A prompt dies when the session ends. |
+| 8 | **Modular system prompt** | The CLI loads 110+ system prompt fragments conditionally based on features activated | A user cannot hand-author or swap in the CLI's internal prompt fragments. |
+| 9 | **Skill preloading** | `skills:` field injects a skill's full content into a subagent's starting context | The user cannot pre-stuff another agent's context — only the harness loader can. |
+| 10 | **Permission classification** | `auto` permission mode uses a background classifier to pre-approve or block tool calls | A prompt cannot add a pre-execution safety layer to itself. |
+
+Each row is a dimension where "strong wording" is categorically not a substitute.
+
+---
+
+## The Two Uses of the Word "Prompt"
+
+The reduction trades on an equivocation. The word *prompt* is used to mean two very different things:
+
+| Meaning | Who controls it | Size |
+|---------|-----------------|------|
+| (a) What the user typed | The user | ~6–60 tokens |
+| (b) What the model sees at inference | The harness | ~5,000–50,000+ tokens |
+
+In a chatbot, (a) and (b) are the same thing.
+In Claude Code, they are radically different.
+
+The harness's job is precisely to make (b) much richer than (a):
+
+```
+User types: "write a recursive flatten function"   ← (a) ~6 tokens
+
+What the model actually sees at inference:         ← (b) ~15,000 tokens
+  ├── CLAUDE.md (project conventions)
+  ├── Matching .claude/rules/*.md (loaded via paths: frontmatter)
+  ├── Modular system prompt fragments
+  ├── Tool definitions
+  ├── Environment context (cwd, git status, platform)
+  ├── Prior turn history
+  ├── Files read by the model via Read/Grep tools
+  └── User's 6-token request
+```
+
+**Output quality is a function of (b), not (a).** The harness constructs (b). A "strong prompt alone" cannot reproduce (b) because most of it isn't written by the user.
+
+---
+
+## Even for Output Quality, the Harness Is Doing Work
+
+Consider the same prompt — "write a recursive flatten function" — in three environments:
+
+| Environment | What the model sees | Typical result |
+|-------------|---------------------|----------------|
+| Chatbot, no tools | The sentence | Textbook recursion, generic style |
+| Claude Code, no reading | Sentence + CLAUDE.md | Matches declared project conventions |
+| Claude Code, agentic loop | Sentence + CLAUDE.md + read adjacent files + run tests | Matches actual codebase patterns, passes tests, handles edge cases the existing code handles |
+
+Same model. Same user prompt. **Three different output qualities.** The difference is the harness — specifically, the effective context it assembles and the iteration loop it enables.
+
+For non-trivial tasks, output quality is a function of:
+
+```
+Output quality = f(effective context, model capability, iteration loop)
+```
+
+The user controls a sliver of *effective context* (their typed prompt). The harness controls the rest — and the iteration loop entirely.
+
+---
+
+## What the Reduction Gets Right (And What It Gets Wrong)
+
+| Claim | Verdict |
+|-------|---------|
+| "At inference, the model only sees tokens." | ✅ True |
+| "Skills, commands, and subagent prompts all contribute tokens to some context." | ✅ True |
+| "For an atomic task in a vacuum, prompt quality dominates output quality." | ✅ True |
+| "Therefore a strong prompt is equivalent to using features." | ❌ False |
+| "Therefore the harness doesn't matter for output quality." | ❌ False on real engineering tasks |
+
+The first three statements are accurate observations. The leap to the fourth is where the reasoning fails: it conflates the model with the system that wraps it, and conflates atomic tasks with real engineering work.
+
+---
+
+## The Correct Mental Model
+
+> **Prompts control what the model is asked to do.**
+> **The harness controls what the system does at layers the model cannot reach** — before tokens arrive, after tokens are produced, across sessions, across contexts, and across processes.
+
+Features are not prompts with extra steps. They are **harness-level primitives** — deterministic execution, context architecture, and infrastructure routing — that operate at layers where the model has no voice.
+
+A useful analogy:
+
+| Layer | Chatbot | Claude Code |
+|-------|---------|-------------|
+| Recipe | The user's message | The user's message + harness-assembled context |
+| Kitchen | None — just a student | Tools, hooks, memory, parallel workers, lifecycle events |
+
+You can write the world's best recipe. Without a kitchen, you cannot cook at scale.
+
+---
+
+## Takeaways for Practitioners
+
+1. **For atomic questions, prompt quality is ~everything.** The harness is irrelevant. Use a chatbot if that's all you need.
+2. **For real codebase work, the harness is doing silent heavy lifting.** The effective prompt at inference is mostly harness-constructed, not user-typed.
+3. **Use features for what prompts categorically cannot do:** determinism (hooks), isolation (subagents), lazy loading (rules with `paths:`), persistence (memory), routing (per-agent `model:`), and parallelism.
+4. **A strong prompt is necessary but not sufficient.** Features give you determinism, isolation, and composition that prompts cannot. The two are complementary, not substitutes.
+
+---
+
+## Sources
+
+- [Agents vs Commands vs Skills](claude-agent-command-skill.md) — shows context isolation, model override, and tool restrictions per feature
+- [Claude Agent SDK vs CLI System Prompts](claude-agent-sdk-vs-cli-system-prompts.md) — documents the 110+ modular system prompt fragments
+- [Claude Agent Memory](claude-agent-memory.md) — cross-session persistence via `memory:` scopes
+- [Claude Memory Best Practice](../best-practice/claude-memory.md) — lazy-loaded descendant `CLAUDE.md` files
+- [Claude Subagents Best Practice](../best-practice/claude-subagents.md) — frontmatter reference for harness-enforced capabilities
+- [Claude Settings Best Practice](../best-practice/claude-settings.md) — permission rule evaluation and `auto` mode classifier
+- [Orchestration Workflow](../orchestration-workflow/orchestration-workflow.md) — concrete demonstration that the reduction fails