June 28, 2026
Loop / Harness Engineering
Loop engineering is a shift from treating agents as single-prompt responders to treating them as production systems. The important work is not only writing instructions for the model. It is designing the loop that plans, acts, verifies, repairs, resumes, and stops.
Prompts still matter, but they sit inside a broader harness: repository instructions, skills, tool permissions, durable state, evaluators, hooks, and stop conditions. In that model, AGENTS.md, CLAUDE.md, SKILL.md, CI checks, traces, and approval rules become part of the product surface, not incidental setup.
For software engineers, the useful mental model is a loop with six parts: a trigger, a workspace, durable state, tooling and context surfaces, verification, and exit controls. A typical run looks like: plan → act with tools → run checks → inspect results → repair failures → persist state → repeat until a verifiable condition holds.
The practical implication is straightforward: teams with strong tests, CI, observability, approval boundaries, and repo-native guidance get more reliable agent behavior. Teams that treat agents like chatbots with one clever prompt will struggle with quality, cost control, and operational trust.
Timeline
Loop engineering timeline
A scrollable pass through the releases, guides, posts, and essays that turned loop engineering from an implementation pattern into shared vocabulary.
What production agent loops actually look like
The most representative production loop is simpler than a lot of “autonomous agents” marketing suggests. The canonical form is:
load durable instructions and state → ask the model what to do next → let it use tools → run verification → persist results → decide whether to continue. OpenAI describes the runner lifecycle as repeatedly calling the model, executing tools or handoffs, and rerunning until final_output; Codex describes long-horizon work as plan → edit → run tools → observe → repair → update docs/status → repeat; and Claude Code’s /goal keeps working across turns until a separate evaluator says the completion condition holds.
In practice, teams are now externalizing the “what good looks like” part of that loop into repository files instead of ad hoc prompts. OpenAI’s guidance is to keep prompts in code; Codex reads AGENTS.md before starting work and merges global + project-specific guidance; Claude Code reads CLAUDE.md, skills, subagents, hooks, and settings from the filesystem; and both ecosystems use skill files to package reusable workflows.
1# AGENTS.md
2
3## Done when
4- npm test passes
5- npm run lint passes
6- npm run typecheck passes
7- update docs if behavior changed
8
9## Guardrails
10- do not add dependencies without approval
11- never edit deployment configs without explicit confirmation
12- use a fresh worktree for parallel feature work
13
14## Review policy
15- run a PR-style review against main
16- summarize risky diffs before closing the task
17That pattern maps directly to current Codex and Claude Code advice. Codex says reliability improves when the agent knows what “good” looks like, including tests, lint, type checks, review expectations, and repository-specific rules; Claude Code’s best-practices guide says to give Claude a check it can run and to use subagents, hooks, skills, and repo guidance to automate and scale work.
Frameworks and tools accelerating the shift
The clearest production signal is the way Claude Code and Codex are turning loop control into product primitives. The prompt still matters, but the work now continues through goals, workflows, repo guidance, subagents, and isolated workspaces.
Claude Code
Claude Code’s /goal turns a prompt into a persistent completion condition. After each turn, an evaluator decides whether the goal has been met; if not, Claude keeps going. Auto mode reduces approval interruptions inside a turn, while /goal reduces the human’s role as the person who starts the next turn.
Dynamic workflows push this further by moving orchestration into scripts. Instead of coordinating a large migration or audit through one long chat, a workflow defines phases, branching, fan-out, and intermediate state. Agents still do the work; the workflow coordinates them.
ultracode is the shortcut for asking Claude Code to use that machinery. In a prompt, ultracode: audit every API endpoint for missing auth checks asks Claude to create and run a workflow for the task. With /effort ultracode, Claude uses xhigh reasoning and can decide when a substantive request should become one or more workflows. It is best reserved for large, ambiguous work because it spends more tokens, agents, and time.
Codex
Codex approaches the same shift through the repository. Its /goal gives the agent a durable objective and a validation loop, but AGENTS.md is the real compounding surface: build commands, lint commands, test commands, review expectations, repo conventions, and “done when” rules live there instead of being pasted into every prompt.
Subagents and worktrees provide the execution shape. Subagents split read-heavy or review-heavy work into isolated agent threads; worktrees let background changes happen in separate Git checkouts. The result is less like a single long chat and more like a small batch system wrapped around a codebase.
| Tool | Loop center | Best current use |
|---|---|---|
| Claude Code | /goal, dynamic workflows, ultracode, subagents, hooks, skills | Large tasks where Claude should create or run an explicit orchestration plan across many agents. |
| Codex | /goal, AGENTS.md, subagents, worktrees, review and validation commands | Long-running repo work where the agent should keep checking against project-defined commands and constraints. |
The shared lesson is that the prompt is no longer the only steering surface. The loop now lives in completion conditions, workflow scripts, repo instruction files, subagent boundaries, worktree isolation, permission policies, and verification commands.
How engineering workflows are changing
The most important workflow change is that engineers are moving from telling the model what to write to telling the system how to decide whether work is done. Codex’s best-practices page says not to stop at asking for a code change: ask the agent to create tests, run checks, confirm the result, and review the work; Claude Code’s best-practices guide says to give Claude a check it can run, to explore then plan then code, and to scale with subagents, hooks, multiple sessions, and adversarial review; and OpenAI’s agent-improvement cookbook explicitly closes the loop from traces to feedback to evals to proposed harness changes.
That means debugging is becoming trace debugging, not just output debugging. Teams now need to inspect which tool was called, what it saw, what permission state it inherited, how much context had accumulated, which verifier failed, and whether the agent retried intelligently or merely thrashed. LangChain’s 2026 survey shows that 89% of organizations have implemented observability for agents and 62% have detailed tracing down to individual steps and tool calls.
CI/CD is also moving inward. Instead of waiting for CI to catch errors after a human commits, teams increasingly place test/build/lint/review loops inside the agent run itself. Codex says this directly. Claude Code packages /code-review, /debug, and /loop as skills. Anthropic’s security-guidance plugin adds fast checks on each edit, deeper model review at end-of-turn, and a deeper review on commit or push. Braintrust’s production guidance goes one step further by turning failing production traces into permanent regression evals for future releases.
Pain points and risks
| Risk | How loops make it worse | What teams are doing |
|---|---|---|
| Safety and approval fatigue | If the loop can act for hours, repeated approvals become both a bottleneck and a liability. Anthropic reports users approved 93% of Claude Code permission prompts, which is exactly the condition under which humans stop reading carefully. | Auto mode, sandboxes, approval policies, pre-approval guardrails, and HITL pause/resume flows. Anthropic and OpenAI both now present approval surfaces as core runtime features. |
| Prompt injection and overeager action | A tool-using loop can read hostile files, web pages, logs, or docs, then turn them into actions with real side effects. Anthropic’s incident examples include deleting remote branches, exploring credentials, and bypassing safety checks. | Input-layer probes, transcript classifiers, containment boundaries, tight sandbox rules, and review steps before destructive actions. |
| Context bloat, token cost, and compaction failures | Long loops accumulate stale tool outputs and superseded state. Claude’s docs say context fills fast and performance degrades; Mastra calls naïve append-only history a production failure mode; a Pydantic dogfooding agent died after 26 requests and 517,132 cumulative input tokens while ~90% complete. | Externalized state, worktrees, summary checkpoints, progress files, request-size limits, prompt caching, and selective compaction. |
| Quality drift in unattended runs | An unattended loop can keep going while misunderstanding the task, silently compounding errors. LangChain’s survey says quality is the top blocker at 32%. Braintrust’s stateful-evals piece argues single-prompt evals fail once agents can create tickets, alter configs, and send messages over long periods. | Trace-based evals, LLM-as-judge plus human review, adversarial reviewers, and explicit “done when” conditions. |
What this means for engineering teams
In the short term, “agent loops replacing prompting” should be read as a shift from artisan prompt work to systems engineering. Teams are not being asked to invent magic prompts anymore. They are being asked to design workflows with persistent guidance, explicit goal conditions, work isolation, verifiers, approvals, traces, and feedback loops into CI. That is why the current discourse feels so production-oriented: it is about how software teams operate uncertain systems, not about how individuals coax prettier completions from a chat box.
The near-term org implication is that agent-loop adoption will increasingly reward platform-minded engineering teams. The teams that win will look a little more like SRE/DevTools teams than like prompt designers. They will own repo guidance, skill libraries, worktree policies, trace dashboards, eval datasets, security boundaries, and retry budgets. That framing is now visible across LangSmith’s “state of agent engineering,” Anthropic’s containment and best-practices guidance, OpenAI’s prompt-in-code deprecation, and Mastra’s recent harness/runtime releases.
Recommended next steps for engineers:
- Start with one narrow loop that has a deterministic verifier, such as “triage flaky tests,” “fix lint/type errors,” or “open small PRs where tests already exist.” Codex, Claude Code, and OpenAI all emphasize that verification is what makes autonomy useful.
- Move any recurring instructions into repo-native files like
AGENTS.md,CLAUDE.md, andSKILL.mdinstead of pasting them into chat. - Put tests, lint, typecheck, and review inside the run loop, not only after the fact in CI.
- Instrument every loop with tracing and evals, and promote production failures into reusable regression tests.
- Set hard bounds on steps, context size, permissions, and runtime, because the failure modes of loops are usually “slowly wrong,” not “instantly broken.”