Anthropic’s Blueprint for Long-Running AI Agents


📺

Article based on video by

The AI AutomatorsWatch original video ↗

Building AI agents that run for hours without forgetting their progress or derailing into loops feels impossible with today’s models. Anthropic’s three-agent harness cracks this by structuring amnesia-prone sessions into reliable, incremental workflows that autonomously code full games and apps.

📺 Watch the Original Video

What Are Long-Running AI Agents and Why Do They Fail?

Long-running AI agents are autonomous systems designed for multi-hour tasks like building full apps or running compliance audits, but they hit a wall because of amnesia—each new session starts with a blank slate, no memory of prior work.[1][3][5]

You’d think slapping a big model on it would fix things, but naive setups just burn through context windows. The agent ignores its own code from earlier, exhausts tokens on fluff, or loops forever without handing off cleanly.[3][4][5] Picture trying to code a retro game engine over 6 hours: without structure, it confidently drifts wrong, compounding subtle bugs that look plausible.[1]

Anthropic nails it—harnesses matter as much as the model itself. These are external scaffolds using artifacts, logs like `claude-progress.txt`, and init scripts to bridge sessions.[1][2] No harness? No dice on complex stuff like full-stack coding.

Here’s why they flop without it:

  • Context overload: Fresh windows mean rehashing everything, leading to endless repetition or forgotten decisions—studies show velocity spikes short-term but tech debt skyrockets later.[3]
  • No structured handoffs: Single agents chase their tail; multi-agent setups (planner, generator, evaluator) decompose tasks into chunks, grade outputs objectively, and iterate with feedback loops.[1][2][5] One example: a three-agent harness turns a vague app spec into 200+ features, marking progress commit-by-commit.
  • Real-world stakes: Agents ace bounded jobs like test generation (clear inputs, verifiable outputs), but open-ended ones? 60-70% done, then human fixes needed for UX edges.[1] Deployments fail on runtime crashes or ignored memory, especially in user apps.[2][4]

Honestly, the fix isn’t fancier models—it’s these harnesses enforcing incremental wins. Fazm’s approach of 10-20 minute human-checkpointed tasks proves reliability trumps unchecked ambition.[1]

Anthropic’s Three-Agent Harness: The Core Blueprint

AI agents hit a wall when tasks run longer than a few minutes. The problem isn’t the model—it’s that agents can’t see their own mistakes, and they lose context when starting fresh.[5] Anthropic’s answer is a three-agent harness that splits work into distinct roles: planning, generation, and evaluation.[1]

Planner Agent: Decomposing Ambitious Scope

The planner takes a vague prompt and expands it into a detailed product specification, then breaks it into ordered stories—the way a product manager would.[2] For something like a Claude.ai clone, this might generate 200+ features, all marked as “failing” initially.[1] This upfront specification matters: an agent building against clear acceptance criteria produces dramatically better output than one iterating on vague feedback.[2]

The key insight is preventing under-scoping. By externalizing the planning phase and getting human approval before coding starts, you avoid the agent getting stuck halfway through or abandoning work early.[1]

Generator Agent: Incremental, Artifact-Driven Progress

The generator implements one story at a time, writing tests alongside code.[2] It doesn’t evaluate—that’s someone else’s job. Each session ends with clean artifacts: Git commits, passing tests, a working app. This matters because the next agent (or human) needs to understand what happened without reconstructing the thinking process.[2]

Context resets happen naturally between sessions. Instead of trying to preserve everything (which makes models cautious about approaching limits), the harness uses structured handoffs—JSON specs, commits, progress logs—so the next session knows exactly where to start.[1]

Evaluator Agent: Objective Grading Without Rationalization

Here’s the brutal truth: Claude talks itself into approving mediocre work.[5] It identifies real problems, then minimizes them (“it’s pretty good for a first pass”). The evaluator fixes this by being ruthlessly skeptical—testing the generator’s output against a pre-agreed Definition of Done, using tools like Playwright to actually interact with the running app.[2]

The evaluator doesn’t make excuses. If something fails, it fails.[2] This separation—generator creates, evaluator critiques—mirrors how GANs work in machine learning: one agent pushes forward, the other pushes back, and the loop produces better results.[3]

Session Flow: Initialization to Handoff

Each session starts with an initializer agent that sets up the environment: Git repos, progress logs, initial commits.[1] Subsequent agents inherit this state and focus on tractable chunks. When a session ends, it leaves clean artifacts—not scattered thinking traces, but concrete progress: code, tests, specs.[4]

The four-layer memory model keeps amnesia in check: working context (current task), session logs (what happened), external artifacts (code, specs), and compaction (summarized state for the next window).[4] This prevents the agent from looping or forgetting decisions already made.

The result: multi-hour workflows where each agent knows its role, output quality stays high, and humans only review what matters.[1]

Overcoming Failure Modes: Scaling Principles and Tools

Building AI agents that run for hours means hitting walls like context overflow or endless loops. The fix? Smart harnesses and principles that keep state intact across sessions.[1][4][7]

Common Pitfalls

Poor scoping leaves agents underprepared—think missing a 200+ feature list for a full app clone, marking most as “failing” upfront.[1] Ignored memory causes amnesia in new windows, sparking loops where agents repeat work.[2][4][5] Context overflow drowns reasoning in noise, while looping behaviors waste cycles on the same dead ends. Honestly, I’ve seen this tank 6-hour coding runs without compaction.[1]

Failure reflection systems spot these early, paired with architecture ceiling tests to reveal harness limits before scaling.[3][7] In practice, enforced testing and Git loops catch 80% of regressions commit-by-commit.[1][2][5]

Nine Scaling Principles

These keep agents reliable as tasks grow. State persistence uses initializer agents for clean handoffs, like `claude-progress.txt` logs and `init.sh` scripts.[1][4][7] Attention budgets allocate focus to avoid dilution.

Cache stability prevents volatile external memory from derailing runs.[3] Retrieval triggers pull just-in-time info, dodging overload. Others: compaction schemas for chats, objective evaluators, and multi-layer memory (working context, sessions, artifacts).[2][4][5] One stat: tool-heavy setups drop multi-agent efficiency by 33% without these.[2]

Tools Integration and Monitoring

Git loops and summarization schemas handle long convos, ensuring incremental wins.[1][2][5] Trace observability logs every step; calibrate scoring for evaluators to grade objectively.[1][2][5] Decompose for parallel subagents—planner breaks tasks, generator codes, evaluator iterates.[1][7]

Evolving Design

Max single-agent first; go multi only after.[4][5] Swap ReAct for plan-and-execute—it’s 2x faster for long horizons, per Anthropic blueprints.[1][4] Pilot small, learn from flops, like in 4-hour audio workstation builds.[1]

How to Build and Implement the Harness Yourself

Building a harness for AI agents lets them tackle multi-hour tasks without losing track, like bridging sessions with no built-in memory. Think of it as the skeleton that keeps everything coherent—honestly, it’s as crucial as the model powering it.[1][4]

Step 1: Prompt Initializer for Git Repo, Feature Spec, Progress Log

Kick off with a single-agent initializer to set up your environment. It creates a Git repo, generates a massive feature spec (like 200+ items for a Claude.ai clone, all marked “failing” at first), and starts a progress log like `claude-progress.txt` or `init.sh` scripts.[1][4]

Use frameworks such as BMAD or SpecKit for structured plans, especially if you’re splitting from single-agent at 10+ tools or domain shifts. For non-engineers, grab 12 prompt templates for state analysis and context audits—they handle compaction and external memory to fight “amnesia.”[3]

Every session ends clean: incremental commits ensure the next one picks up seamlessly.

Step 2: Cycle Three Agents with Handoffs; Enforce Clean Exits

Switch to a three-agent harness: planner decomposes tasks, generator builds (code, outputs), evaluator grades and iterates. Handoffs use structured JSON specs and artifacts for continuity across context windows.[1][2]

Go orchestrator-worker for parallelism—lead agents filter and hand off to subagents, like in Managed Agents setups. Enforce clean exits: each cycle advances one tractable chunk, logs progress, and commits. In practice, this built a 6-hour retro game engine autonomously.[1][5][6]

Test with design audits and multi-agent scope checks to avoid loops from ignored memory.

Step 3: Add Evaluator Criteria and Human Calibration for Production

Layer in objective evaluator criteria: grading rubs, generator-discriminator patterns, enforced testing per commit. Add failure reflection for iteration.[1][2]

Human calibration tunes for production—future-proof with retrieval triggers, summarization, and four-layer memory (working context, sessions, artifacts). Scale via nine principles like attention budgets; one example: 4-hour browser-based audio workstation.[3][5][6]

This setup handles long-horizon stuff, like hundred-turn conversations, reliably.

Real-World Examples: Games, Apps, and Beyond

Autonomous AI agents, powered by structured harnesses, crank out real apps over multi-hour sessions—think playable prototypes with enforced testing baked in from the start[1][2]. It’s wild how they bridge session gaps with init scripts and Git commits, turning “amnesia” into steady progress.

6-Hour Retro Game Engine: Incremental Features via Commits

Picture this: an AI builds a full retro game engine in six hours, layering in features commit by commit. It starts with a Git repo and `init.sh` setup, then incrementally adds mechanics like exponential growth generators—ducks as currency, prestige resets for bonuses, even offline earnings[1][2]. By session end, you’ve got a working incremental game where numbers go up addictively, hitting price walls that unlock higher-tier stuff. Honestly, 200+ initial features marked “failing” get ticked off reliably, proving the three-agent setup (planner, generator, evaluator) nails long runs[1][4].

4-Hour Browser Audio Workstation: Full-Stack from Spec

In just four hours, agents spin up a browser-based digital audio workstation—full-stack, from JSON specs to frontend/backend code. Using tools like SpecKit, they decompose the spec, code chunks, and evaluate with tests, persisting state via `claude-progress.txt` logs[1][2][7]. The result? A functional app handling audio processing, ready to iterate. One stat: evaluator agents grade objectively, looping on failures to hit 95% reliability in production-like builds[1][2].

Other Wins: Compliance Audits, Risk Analysis, Claude.ai Clone

Beyond games, these harnesses tackle compliance audits and risk analysis via subagents—orchestrators dispatching workers for parallel tasks[1][5]. A standout: cloning Claude.ai with 200+ features, using feature lists and BMAD frameworks for scoped advances[1][4]. LinkedIn pros rave about JSON specs and init scripts for rock-solid reliability in industry[1]. Looking ahead, specialized testing/QA agents and model internalization will evolve harnesses into pure coordinators[3][5][6]. In practice, this shifts AI from one-shots to marathon builders.

Frequently Asked Questions

What is Anthropic’s three-agent harness for long-running AI agents?

Anthropic’s three-agent harness splits tasks into planner, generator, and evaluator agents to handle multi-hour workflows reliably. The planner decomposes tasks into features, the generator builds outputs like code incrementally, and the evaluator grades work objectively to enable iteration across sessions.[1][3] This structure prevents common pitfalls like poor self-evaluation and context loss in discrete sessions.

How do you prevent amnesia in multi-session AI agents?

Prevent amnesia by using initializer agents to create structured artifacts like JSON feature specs, init.sh scripts, claude-progress.txt logs, and Git commits that hand off clean state between sessions. Compaction manages context windows, while external memory layers—working context, sessions, memory, and artifacts—maintain continuity without overflow.[1][4] Each session ends with incremental progress, ensuring the next agent picks up seamlessly.

What are common failure modes of long-running AI agents?

Common failures include ‘context anxiety’ from overflowing windows and poor self-evaluation where agents confidently approve mediocre work. Amnesia hits every new session due to no inherent memory, and naive single-agent setups lack decomposition for complex tasks.[2][3][4] These lead to incoherent progress over multi-hour runs without structured handoffs.

Can long-running AI agents autonomously code full apps?

Yes, with proper harnesses like Anthropic’s three-agent system, they can autonomously code full-stack apps over multi-hour sessions, producing working applications via incremental feature implementation. Experiments show substantial improvements from decomposing specs into 200+ features and using commit-by-commit progress with enforced testing.[1][3] Human oversight helps calibrate evaluators initially, but agents handle most iteration.

How to implement Anthropic’s AI agent harness for coding tasks?

Start with an initializer agent to generate a feature list from the spec, set up init.sh, progress logs, and initial Git commit for a clean base. Run planner to decompose, generator to code one feature at a time, and evaluator to grade and iterate, leaving artifacts for the next session.[1][3][4] Use tools like Claude Agent SDK for compaction and session management to bridge context windows.

Experiment with Anthropic’s harness on your next project and share your results in the comments.

Subscribe to Fix AI Tools for weekly AI & tech insights.

O

Onur

AI Content Strategist & Tech Writer

Covers AI, machine learning, and enterprise technology trends. Focused on practical applications and real-world impact across the data ecosystem.

 LinkedIn ↗