Introduction: The Depth of Agency
Series Overview: This is Part 9 of our 18-part AI Application Development Mastery series. We now explore the most advanced agent architectures — systems that plan multi-step strategies, reflect on their own performance, search through solution spaces, and operate with increasing autonomy while maintaining safety bounds.
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
You Are Here
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS
Parts 7 and 8 gave us the building blocks: tool-calling agents and LangGraph's stateful workflows. But the agents we built so far are reactive — they respond to user input, call tools, and return results. Deep agents go further. They plan before acting, reflect on their results, self-correct when things go wrong, and can operate with increasing levels of autonomy.
This part explores the architectures that power the most sophisticated AI systems being built today — the patterns behind Devin, Copilot Workspace, Claude Code, and research prototypes pushing the boundary of what autonomous AI can do.
Key Insight: The difference between a simple agent and a deep agent is the difference between a junior developer who writes code when told (reactive) and a senior architect who designs the solution, implements it, tests it, reviews their own work, and iterates until it meets quality standards (proactive, reflective, autonomous).
1. Planner-Executor-Critic
The Planner-Executor-Critic pattern separates the agent into three distinct roles, each handled by a different LLM call (or potentially a different model). This separation of concerns dramatically improves reliability and enables self-correction.
1.1 The Planning Phase
The planning phase is where a deep agent decomposes a complex, open-ended goal into a sequence of concrete steps. The planner receives the user’s objective and generates a structured plan — typically a numbered list of actions — that the executor can follow. This separation of planning from execution lets the agent reason about strategy before committing to actions, much like how a human outlines an approach before diving into implementation.
# pip install langchain-openai langgraph
import os
from langchain_openai import ChatOpenAI
from typing import TypedDict, Annotated, Optional
from operator import add
# Requires OPENAI_API_KEY environment variable
# export OPENAI_API_KEY="sk-..."
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# State schema for the Planner-Executor-Critic agent
class DeepAgentState(TypedDict):
task: str
plan: list[str]
current_step: int
step_results: Annotated[list, add]
critique: str
final_result: str
revision_count: int
max_revisions: int
status: str
def planner(state: DeepAgentState) -> dict:
"""Generate a step-by-step plan for the task."""
response = llm.invoke(
f"You are a planning expert. Break this task into 3-7 concrete, "
f"actionable steps. Each step should be independently executable.\n\n"
f"Task: {state['task']}\n\n"
f"Return a numbered list of steps, one per line."
)
steps = [
line.strip().lstrip("0123456789.)- ")
for line in response.content.strip().split("\n")
if line.strip() and any(c.isalpha() for c in line)
]
return {"plan": steps, "current_step": 0, "status": "executing"}
1.2 The Execution Phase
The executor takes the plan generated in the planning phase and works through it step by step, using tools and LLM reasoning to complete each action. After each step, a routing function checks whether all steps are complete or whether to continue execution. This incremental approach allows the agent to accumulate results progressively and adapt its behavior based on intermediate outcomes.
# Executor and routing logic (uses DeepAgentState and llm from above)
def executor(state: DeepAgentState) -> dict:
"""Execute the current step of the plan."""
step_idx = state["current_step"]
step = state["plan"][step_idx]
# Include results from previous steps for context
context = ""
if state["step_results"]:
context = "Previous results:\n" + "\n".join(
f"Step {i+1}: {r}" for i, r in enumerate(state["step_results"])
)
response = llm.invoke(
f"Execute this step of the plan. Be thorough and detailed.\n\n"
f"Overall task: {state['task']}\n"
f"Current step ({step_idx + 1}/{len(state['plan'])}): {step}\n"
f"{context}\n\n"
f"Provide the result of executing this step."
)
return {
"step_results": [f"[Step {step_idx + 1}] {response.content}"],
"current_step": step_idx + 1
}
def should_continue_executing(state: DeepAgentState) -> str:
"""Check if there are more steps to execute."""
if state["current_step"] < len(state["plan"]):
return "execute_next"
return "critique"
1.3 The Critic Phase
The critic closes the loop in a deep agent architecture by evaluating the executor’s combined results against the original objective. If the output is satisfactory, the agent terminates. If not, the critic provides feedback that triggers re-planning — creating a self-improving cycle where each iteration refines the approach. This evaluate-and-revise pattern is what distinguishes deep agents from simple sequential pipelines: they can recover from partial failures and iteratively converge on better solutions.
# Critic, routing, and full graph assembly (uses DeepAgentState, llm, planner, executor from above)
def critic(state: DeepAgentState) -> dict:
"""Evaluate the quality of the execution and decide if revision is needed."""
all_results = "\n\n".join(state["step_results"])
response = llm.invoke(
f"You are a quality critic. Evaluate if this task was completed well.\n\n"
f"Original task: {state['task']}\n\n"
f"Plan:\n" + "\n".join(f" {i+1}. {s}" for i, s in enumerate(state["plan"])) +
f"\n\nExecution results:\n{all_results}\n\n"
f"Respond with:\n"
f"VERDICT: PASS or FAIL\n"
f"FEEDBACK: Specific issues or improvements needed (if FAIL)\n"
f"SUMMARY: One-paragraph summary of the overall result"
)
content = response.content
passed = "VERDICT: PASS" in content.upper() or "PASS" in content.split("\n")[0].upper()
if passed or state["revision_count"] >= state["max_revisions"]:
# Extract summary for final result
summary_start = content.find("SUMMARY:")
summary = content[summary_start + 8:].strip() if summary_start != -1 else content
return {
"critique": content,
"final_result": summary,
"status": "complete"
}
return {
"critique": content,
"revision_count": state["revision_count"] + 1,
"status": "revising",
"step_results": [], # Clear for re-execution
"current_step": 0 # Start plan from beginning
}
def after_critic(state: DeepAgentState) -> str:
"""Route based on critic's verdict."""
if state["status"] == "complete":
return "done"
return "re_plan"
# Build the full Planner-Executor-Critic graph
from langgraph.graph import StateGraph, START, END
graph = StateGraph(DeepAgentState)
graph.add_node("plan", planner)
graph.add_node("execute", executor)
graph.add_node("critique", critic)
graph.add_edge(START, "plan")
graph.add_edge("plan", "execute")
graph.add_conditional_edges("execute", should_continue_executing, {
"execute_next": "execute",
"critique": "critique"
})
graph.add_conditional_edges("critique", after_critic, {
"re_plan": "plan",
"done": END
})
deep_agent = graph.compile()
# Run the deep agent
result = deep_agent.invoke({
"task": "Write a comprehensive comparison of Python and Rust for web development",
"plan": [],
"current_step": 0,
"step_results": [],
"critique": "",
"final_result": "",
"revision_count": 0,
"max_revisions": 2,
"status": "planning"
})
print(f"Status: {result['status']}")
print(f"Result: {result['final_result'][:200]}...")
Key Insight: The Planner-Executor-Critic pattern mirrors how humans tackle complex tasks. A project manager creates a plan, team members execute it, and a reviewer evaluates the output. By separating these roles, each LLM call has a focused, well-defined job, which produces much better results than asking a single agent to do everything at once.
2. Plan-and-Execute Agents
Plan-and-Execute agents formalize the deep agent pattern into a two-stage pipeline: first generate a complete plan, then execute each step with a tool-equipped agent. This architecture is especially effective for multi-step tasks where the solution requires coordinating several tools in sequence. The key advantage over reactive agents is that the planner can reason about the entire task before any action is taken, leading to more coherent and efficient execution paths.
2.1 LangChain Implementation
LangChain provides a built-in Plan-and-Execute pipeline that separates a planner LLM (which generates the step-by-step plan) from an executor agent (which carries out each step using available tools). The planner typically uses a more capable model for strategic reasoning, while the executor can use a faster, cheaper model for tool invocation. This division of labor optimizes both quality and cost.
# pip install langchain langchain-openai
# Plan-and-Execute: Plan first, then execute each step with tools
# This separates "what to do" from "how to do it"
import os
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_classic.agents import create_react_agent, AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
@tool
def search(query: str) -> str:
"""Search the web for current information."""
return f"Search results for: {query}"
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression."""
import math
safe = {k: v for k, v in math.__dict__.items() if not k.startswith("__")}
return str(eval(expression, {"__builtins__": {}}, safe))
# The Planner: creates a high-level plan
def create_plan(task: str) -> list[str]:
planner_llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = planner_llm.invoke(
f"Break this task into 3-5 concrete steps. "
f"Each step should be a clear action.\n\nTask: {task}\n\n"
f"Return ONLY a numbered list."
)
return [
line.strip().lstrip("0123456789.)- ")
for line in response.content.strip().split("\n")
if line.strip() and any(c.isalpha() for c in line)
]
# The Executor: has tools, executes one step at a time
executor_prompt = ChatPromptTemplate.from_messages([
("system", "Execute the given step using available tools. "
"Be precise and return concrete results."),
("human", "Overall task: {task}\n\nCurrent step: {step}\n\n"
"Previous results: {context}"),
MessagesPlaceholder("agent_scratchpad")
])
executor_llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [search, calculate]
executor_agent = create_tool_calling_agent(executor_llm, tools, executor_prompt)
executor = AgentExecutor(agent=executor_agent, tools=tools, max_iterations=5)
# Run the full Plan-and-Execute pipeline
def plan_and_execute(task: str) -> str:
# Phase 1: Plan
plan = create_plan(task)
print(f"Plan ({len(plan)} steps):")
for i, step in enumerate(plan, 1):
print(f" {i}. {step}")
# Phase 2: Execute each step
results = []
for i, step in enumerate(plan):
context = "\n".join(results) if results else "None yet"
result = executor.invoke({
"task": task,
"step": step,
"context": context
})
results.append(f"Step {i+1}: {result['output']}")
print(f"\nCompleted step {i+1}: {result['output'][:100]}...")
return "\n\n".join(results)
# Run the agent
output = plan_and_execute("What is the GDP of France and how does it compare to the UK?")
print(output)
2.2 Adaptive Re-Planning
Static plans break down when execution reveals unexpected information or when a step fails. Adaptive re-planning addresses this by checking after each execution step whether the remaining plan still makes sense given the results so far. If the plan needs updating — perhaps because a search returned unexpected data or a tool failed — the planner generates a revised plan that incorporates what was learned. This makes the agent resilient to the unpredictable nature of real-world tool interactions.
# Adaptive re-planning: update the plan based on execution results
# Uses create_plan and executor from the block above
def adaptive_plan_and_execute(task: str, max_replans: int = 3) -> str:
plan = create_plan(task)
results = []
replan_count = 0
for i, step in enumerate(plan):
context = "\n".join(results) if results else "None"
result = executor.invoke({
"task": task, "step": step, "context": context
})
results.append(f"Step {i+1}: {result['output']}")
# After each step, check if the plan needs updating
if i < len(plan) - 1 and replan_count < max_replans:
replanner_llm = ChatOpenAI(model="gpt-4o", temperature=0)
replan_response = replanner_llm.invoke(
f"Original task: {task}\n\n"
f"Original plan:\n" + "\n".join(f" {j+1}. {s}" for j, s in enumerate(plan)) +
f"\n\nCompleted so far:\n" + "\n".join(results) +
f"\n\nShould the remaining steps be modified? "
f"If yes, return the updated remaining steps. "
f"If no, return 'NO CHANGES NEEDED'."
)
if "NO CHANGES NEEDED" not in replan_response.content.upper():
new_steps = [
line.strip().lstrip("0123456789.)- ")
for line in replan_response.content.strip().split("\n")
if line.strip() and any(c.isalpha() for c in line)
]
if new_steps:
plan = plan[:i+1] + new_steps
replan_count += 1
print(f" [Re-planned! {len(new_steps)} new remaining steps]")
return "\n\n".join(results)
3. Reflexion
Reflexion (Shinn et al., 2023) introduces verbal self-reflection as a form of learning. Instead of updating model weights, the agent reflects on its failures and stores the reflections in memory, which improves future attempts.
3.1 Reflexion Architecture
Reflexion is a self-improvement architecture where an agent attempts a task, evaluates its own output, and — if unsuccessful — writes a natural-language reflection analyzing what went wrong. These reflections are accumulated and included in subsequent attempts, giving the agent an explicit memory of past mistakes. Research shows that Reflexion agents can match or exceed few-shot performance on coding and reasoning tasks by learning from their own trial history, without any weight updates to the underlying model.
# pip install langgraph langchain-openai
# Reflexion: Learn from failures through verbal self-reflection
#
# Trial 1: Attempt -> Fail -> Reflect ("I failed because I didn't check X")
# Trial 2: Attempt (with reflection memory) -> Fail -> Reflect ("Also need to handle Y")
# Trial 3: Attempt (with all reflections) -> Succeed!
#
# Key insight: The model doesn't retrain. It just REMEMBERS what went wrong.
from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
# Requires OPENAI_API_KEY environment variable
llm = ChatOpenAI(model="gpt-4o", temperature=0)
class ReflexionState(TypedDict):
task: str
attempts: Annotated[list, add]
reflections: Annotated[list, add]
current_attempt: str
evaluation: str
success: bool
trial_number: int
max_trials: int
def attempt(state: ReflexionState) -> dict:
"""Make an attempt at the task, informed by past reflections."""
reflections_context = ""
if state["reflections"]:
reflections_context = (
"\n\nLessons from previous attempts:\n" +
"\n".join(f"- {r}" for r in state["reflections"])
)
response = llm.invoke(
f"Complete this task. Be thorough and precise.\n\n"
f"Task: {state['task']}"
f"{reflections_context}\n\n"
f"This is attempt #{state['trial_number']}."
)
return {
"current_attempt": response.content,
"attempts": [response.content]
}
def evaluate(state: ReflexionState) -> dict:
"""Evaluate whether the attempt succeeded."""
response = llm.invoke(
f"Evaluate if this attempt successfully completes the task.\n\n"
f"Task: {state['task']}\n\n"
f"Attempt:\n{state['current_attempt']}\n\n"
f"Respond with SUCCESS or FAILURE, followed by a brief explanation."
)
success = "SUCCESS" in response.content.upper().split("\n")[0]
return {"evaluation": response.content, "success": success}
def reflect(state: ReflexionState) -> dict:
"""Reflect on why the attempt failed and extract lessons."""
response = llm.invoke(
f"The following attempt FAILED. Analyze why and extract specific, "
f"actionable lessons for the next attempt.\n\n"
f"Task: {state['task']}\n"
f"Attempt:\n{state['current_attempt']}\n"
f"Evaluation:\n{state['evaluation']}\n"
f"Previous reflections: {state['reflections']}\n\n"
f"New reflection (be specific and actionable):"
)
return {
"reflections": [response.content],
"trial_number": state["trial_number"] + 1
}
def after_evaluation(state: ReflexionState) -> str:
if state["success"]:
return "done"
if state["trial_number"] >= state["max_trials"]:
return "done" # Give up after max trials
return "reflect"
# Build Reflexion graph
graph = StateGraph(ReflexionState)
graph.add_node("attempt", attempt)
graph.add_node("evaluate", evaluate)
graph.add_node("reflect", reflect)
graph.add_edge(START, "attempt")
graph.add_edge("attempt", "evaluate")
graph.add_conditional_edges("evaluate", after_evaluation, {
"reflect": "reflect",
"done": END
})
graph.add_edge("reflect", "attempt") # Try again with new reflection
reflexion_agent = graph.compile()
3.2 Self-Reflection Loop
The power of Reflexion becomes visible when you run the graph multiple times on the same problem. Each trial produces a reflection that feeds into the next attempt, creating a compounding learning effect. The agent doesn’t just retry — it reasons about why the previous attempt failed and adjusts its strategy accordingly. The following invocation demonstrates this iterative improvement, where accumulated reflections guide increasingly sophisticated solutions.
# The power of Reflexion: each trial gets BETTER because of accumulated reflections
# Uses reflexion_agent compiled in the block above
result = reflexion_agent.invoke({
"task": "Write a Python function that correctly handles all edge cases for "
"parsing dates in formats: YYYY-MM-DD, MM/DD/YYYY, DD-Mon-YYYY",
"attempts": [],
"reflections": [],
"current_attempt": "",
"evaluation": "",
"success": False,
"trial_number": 1,
"max_trials": 4
})
# Trial 1: Might miss timezone handling
# Reflection: "I forgot to handle timezone-aware dates and invalid dates like Feb 30"
# Trial 2: Handles timezones but misses single-digit months
# Reflection: "Need to handle MM with or without leading zeros"
# Trial 3: Correct!
print(f"Success: {result['success']}")
print(f"Trials needed: {result['trial_number']}")
print(f"Reflections accumulated: {len(result['reflections'])}")
4. LATS — Language Agent Tree Search
LATS (Zhou et al., 2023) combines LLM reasoning with Monte Carlo Tree Search (MCTS). Instead of following a single path, the agent explores a tree of possible action sequences, evaluates promising branches, and backtracks from dead ends.
4.1 Tree Search for Agents
Language Agent Tree Search (LATS) applies Monte Carlo Tree Search — the same algorithm behind AlphaGo — to LLM-powered agents. Instead of committing to a single plan, LATS explores multiple solution paths simultaneously, evaluating and expanding the most promising branches. Each node in the search tree represents an agent state, and UCB1 (Upper Confidence Bound) scoring balances exploitation (pursuing high-scoring paths) with exploration (trying under-explored alternatives).
# LATS (Language Agent Tree Search): Explore multiple solution paths simultaneously
#
# [Start]
# / \
# [Search web] [Query DB]
# / \ |
# [Found it] [Not found] [Found partial]
# | | |
# [Done] [Try different] [Combine with web search]
# search |
# [Done]
#
# Key operations:
# 1. SELECT — Choose the most promising node to expand (UCB1)
# 2. EXPAND — Generate possible next actions from that node
# 3. EVALUATE — Score each new action (using LLM as evaluator)
# 4. BACKPROP — Propagate scores back up the tree
# 5. REPEAT — Until a satisfactory solution is found
from dataclasses import dataclass, field
from typing import Optional
import math
@dataclass
class TreeNode:
"""A node in the LATS search tree."""
state: dict
action: str = ""
parent: Optional['TreeNode'] = None
children: list['TreeNode'] = field(default_factory=list)
visits: int = 0
total_score: float = 0.0
depth: int = 0
@property
def average_score(self) -> float:
return self.total_score / max(self.visits, 1)
def ucb1(self, exploration_weight: float = 1.41) -> float:
"""Upper Confidence Bound for tree search."""
if self.visits == 0:
return float("inf")
exploitation = self.average_score
exploration = exploration_weight * math.sqrt(
math.log(self.parent.visits) / self.visits
)
return exploitation + exploration
4.2 LATS Implementation
The full LATS agent wraps the tree search data structure with an LLM-powered expand-evaluate-backpropagate loop. At each iteration, the agent selects the most promising node (via UCB1), generates candidate next steps (expansion), scores them with the LLM (evaluation), and propagates scores back up the tree (backpropagation). After a configurable number of iterations, the best complete solution path is returned. This approach dramatically outperforms single-path agents on complex reasoning tasks.
# Uses TreeNode class from the block above
class LATSAgent:
"""Language Agent Tree Search — explores multiple solution paths."""
def __init__(self, llm, tools, max_depth=5, num_candidates=3):
self.llm = llm
self.tools = tools
self.max_depth = max_depth
self.num_candidates = num_candidates
def expand(self, node: TreeNode) -> list[TreeNode]:
"""Generate candidate next actions from current state."""
response = self.llm.invoke(
f"Given this state, suggest {self.num_candidates} different "
f"next actions to take. Each should be a distinct approach.\n\n"
f"Task: {node.state['task']}\n"
f"Progress so far: {node.state.get('progress', 'None')}\n"
f"Current depth: {node.depth}/{self.max_depth}\n\n"
f"Return {self.num_candidates} actions, one per line."
)
actions = [
line.strip().lstrip("0123456789.)- ")
for line in response.content.strip().split("\n")
if line.strip() and any(c.isalpha() for c in line)
][:self.num_candidates]
children = []
for action in actions:
child_state = {**node.state, "progress": f"{node.state.get('progress', '')} → {action}"}
child = TreeNode(
state=child_state,
action=action,
parent=node,
depth=node.depth + 1
)
children.append(child)
node.children.append(child)
return children
def evaluate(self, node: TreeNode) -> float:
"""Score a node's state (0.0 to 1.0)."""
response = self.llm.invoke(
f"Rate how close this progress is to completing the task.\n\n"
f"Task: {node.state['task']}\n"
f"Progress: {node.state.get('progress', 'None')}\n\n"
f"Return a score from 0.0 (no progress) to 1.0 (task complete)."
)
try:
score = float(response.content.strip().split()[0])
return max(0.0, min(1.0, score))
except (ValueError, IndexError):
return 0.5
def search(self, task: str, num_iterations: int = 20) -> TreeNode:
"""Run LATS to find the best solution path."""
root = TreeNode(state={"task": task, "progress": ""})
for _ in range(num_iterations):
# SELECT: Find best leaf to expand (UCB1)
leaf = self._select(root)
if leaf.depth >= self.max_depth:
continue
# EXPAND: Generate candidate actions
children = self.expand(leaf)
# EVALUATE: Score each candidate
for child in children:
score = self.evaluate(child)
child.visits = 1
child.total_score = score
# BACKPROPAGATE: Update ancestor scores
self._backpropagate(child, score)
# Return the best path
return self._best_path(root)
def _select(self, node: TreeNode) -> TreeNode:
while node.children:
node = max(node.children, key=lambda c: c.ucb1())
return node
def _backpropagate(self, node: TreeNode, score: float):
while node:
node.visits += 1
node.total_score += score
node = node.parent
def _best_path(self, root: TreeNode) -> TreeNode:
node = root
while node.children:
node = max(node.children, key=lambda c: c.average_score)
return node
# Usage example
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
# agent = LATSAgent(llm=llm, tools=[], max_depth=4, num_candidates=3)
# best = agent.search("Design a REST API for a todo app", num_iterations=15)
# print(f"Best path score: {best.average_score:.2f}")
# print(f"Progress: {best.state['progress']}")
LATS vs ReAct vs Reflexion
When to Use Each Architecture
- ReAct: Simple tasks where one path through the solution space is likely sufficient. Fast, low cost.
- Reflexion: Tasks where the agent might fail on the first try but can learn from mistakes. Good for coding, writing, and tasks with clear pass/fail criteria.
- LATS: Complex tasks where multiple solution strategies exist and the best path is unclear. Higher cost but finds better solutions for hard problems.
Architecture Choice
Cost-Quality Trade-off
5. Tool Orchestration & Memory
As agents grow more capable, they need sophisticated strategies for selecting the right tool from a large toolbox and maintaining memory across long-running tasks. Simple agents hard-code tool lists, but deep agents dynamically select tools based on the current task, route complex queries through tool chains, and remember what worked (and what didn’t) across sessions. This section covers the orchestration and memory patterns that enable truly autonomous agent behavior.
When an agent has access to dozens or hundreds of tools, sending all tool descriptions to the LLM wastes context and confuses tool selection. A ToolOrchestrator solves this by categorizing tools, selecting only the relevant subset for each query, and implementing fallback execution when the primary tool fails. This pattern also enables tool chaining — piping the output of one tool into another — and graceful degradation when tools are unavailable.
# Deep agents need sophisticated tool orchestration:
# selecting tools dynamically, chaining tool outputs, and handling failures
# Requires an LLM instance (e.g., ChatOpenAI) and a list of LangChain tools
class ToolOrchestrator:
"""Manage a large set of tools with dynamic selection."""
def __init__(self, tools: list, llm):
self.tools = {t.name: t for t in tools}
self.llm = llm
self.tool_descriptions = "\n".join(
f"- {t.name}: {t.description}" for t in tools
)
def select_tools(self, task: str, max_tools: int = 5) -> list:
"""Dynamically select the most relevant tools for a task."""
response = self.llm.invoke(
f"Given this task, select the {max_tools} most relevant tools.\n\n"
f"Task: {task}\n\n"
f"Available tools:\n{self.tool_descriptions}\n\n"
f"Return ONLY the tool names, one per line."
)
selected_names = [
line.strip() for line in response.content.strip().split("\n")
if line.strip() in self.tools
]
return [self.tools[name] for name in selected_names[:max_tools]]
def execute_with_fallback(self, tool_name: str, args: dict,
fallback_tools: list = None) -> str:
"""Execute a tool with automatic fallback on failure."""
try:
result = self.tools[tool_name].invoke(args)
if result and not str(result).startswith("Error"):
return result
except Exception as e:
pass
# Try fallback tools
for fallback_name in (fallback_tools or []):
try:
result = self.tools[fallback_name].invoke(args)
if result and not str(result).startswith("Error"):
return f"[via {fallback_name}] {result}"
except Exception:
continue
return f"All tools failed for: {args}"
5.2 Memory Integration
Deep agents benefit from multi-layered memory that mirrors cognitive science: working memory (the current task context), episodic memory (past experiences stored as vector embeddings), semantic memory (structured facts and preferences), and procedural memory (learned strategies for recurring tasks). The implementation below integrates all four layers into a unified DeepAgentMemory class that agents can query and update throughout their execution.
# pip install langchain-community langchain-openai chromadb
# Deep agents need multiple memory systems working together
from langchain_community.vectorstores import Chroma
class DeepAgentMemory:
"""Multi-layered memory system for deep agents."""
def __init__(self, embeddings):
# Working memory: current task context (short-term)
self.working_memory = []
# Episodic memory: past task experiences (medium-term)
self.episodic_store = Chroma(
collection_name="episodic",
embedding_function=embeddings
)
# Semantic memory: general knowledge/facts (long-term)
self.semantic_store = Chroma(
collection_name="semantic",
embedding_function=embeddings
)
# Procedural memory: learned strategies (long-term)
self.procedures = {}
def store_episode(self, task: str, plan: list, result: str, success: bool):
"""Store a completed task episode for future reference."""
episode = f"Task: {task}\nPlan: {plan}\nResult: {result}\nSuccess: {success}"
self.episodic_store.add_texts(
texts=[episode],
metadatas=[{"task": task, "success": success}]
)
def recall_similar_episodes(self, task: str, k: int = 3) -> list:
"""Retrieve similar past experiences to inform planning."""
docs = self.episodic_store.similarity_search(task, k=k)
return [doc.page_content for doc in docs]
def store_procedure(self, name: str, steps: list):
"""Store a learned procedure for reuse."""
self.procedures[name] = steps
def get_relevant_context(self, query: str) -> dict:
"""Get all relevant context for a given query."""
return {
"working": self.working_memory[-10:],
"episodes": self.recall_similar_episodes(query),
"procedures": [
name for name in self.procedures
if any(word in name.lower() for word in query.lower().split())
]
}
6. Autonomy Levels
Not all agent tasks require the same level of independence. Autonomy levels (L1 through L4) provide a framework for calibrating how much freedom an agent has to act without human approval. Lower levels require confirmation for every action, while higher levels let the agent plan and execute entire workflows independently. Choosing the right autonomy level for each use case is critical for balancing efficiency with safety — especially when agents interact with production systems, external APIs, or sensitive data.
6.1 L1 through L4
| Level |
Name |
Description |
Human Role |
Example |
| L1 |
Assisted |
Agent suggests actions; human approves every step |
Approve each action |
Copilot suggestions (accept/reject each line) |
| L2 |
Semi-Autonomous |
Agent executes routine actions; asks for approval on high-risk actions |
Approve risky actions only |
Email draft agent (auto-drafts, human approves send) |
| L3 |
Supervised Autonomous |
Agent executes full tasks independently; human reviews results |
Review final output |
Code review agent (writes PR, human reviews) |
| L4 |
Fully Autonomous |
Agent operates independently within defined bounds; human monitors |
Monitor and intervene if needed |
Devin-style coding agent (plan to deploy) |
# Implementing autonomy levels in LangGraph
# Configuration class for controlling how much human oversight an agent needs
class AutonomyConfig:
"""Configure autonomy level for a deep agent."""
LEVELS = {
"L1": {
"description": "Assisted — approve every action",
"interrupt_before_tools": True,
"interrupt_before_execution": True,
"auto_approve_read_only": False,
"require_final_approval": True,
},
"L2": {
"description": "Semi-Autonomous — approve risky actions",
"interrupt_before_tools": False,
"interrupt_before_execution": True, # Only for destructive actions
"auto_approve_read_only": True,
"require_final_approval": True,
},
"L3": {
"description": "Supervised — review final output",
"interrupt_before_tools": False,
"interrupt_before_execution": False,
"auto_approve_read_only": True,
"require_final_approval": True,
},
"L4": {
"description": "Fully Autonomous — monitor only",
"interrupt_before_tools": False,
"interrupt_before_execution": False,
"auto_approve_read_only": True,
"require_final_approval": False,
},
}
def __init__(self, level: str = "L2"):
self.level = level
self.config = self.LEVELS[level]
def should_interrupt(self, action: str, is_destructive: bool) -> bool:
"""Determine if the agent should pause for human approval."""
if self.config["interrupt_before_tools"]:
return True
if self.config["interrupt_before_execution"] and is_destructive:
return True
return False
# Usage example
config = AutonomyConfig(level="L2")
print(f"Level: {config.level} - {config.config['description']}")
print(f"Interrupt for read-only search? {config.should_interrupt('search', is_destructive=False)}")
# Output: False (L2 auto-approves non-destructive actions)
print(f"Interrupt for delete? {config.should_interrupt('delete_file', is_destructive=True)}")
# Output: True (L2 requires approval for destructive actions)
6.2 Safety Bounds
As agents gain autonomy, they need explicit safety constraints that prevent dangerous or unauthorized actions. A SafetyBounds class defines what an agent can and cannot do: action allow/deny lists, budget limits (API costs, compute time), rate limiting, and forbidden operations. Every action the agent attempts passes through this safety layer before execution, providing a hard boundary that the LLM’s reasoning cannot override.
# Safety bounds constrain what an autonomous agent can do
# Use this class to enforce cost limits, action restrictions, and domain allow-lists
class SafetyBounds:
"""Define safety constraints for deep agents."""
def __init__(self):
self.max_cost_per_task = 5.00 # Max $ per task
self.max_api_calls = 50 # Max LLM calls per task
self.max_tool_calls = 20 # Max tool invocations
self.max_execution_time = 300 # Max seconds
self.forbidden_actions = [
"delete_database",
"send_email_to_all",
"modify_production",
"transfer_funds",
"access_personal_data"
]
self.allowed_domains = [
"api.openai.com",
"api.github.com",
"www.google.com"
]
self.require_confirmation_for = [
"send_email",
"create_pull_request",
"deploy",
"modify_config"
]
def check_action(self, action: str, args: dict) -> tuple[bool, str]:
"""Check if an action is within safety bounds."""
if action in self.forbidden_actions:
return False, f"BLOCKED: '{action}' is a forbidden action"
if action in self.require_confirmation_for:
return False, f"REQUIRES_CONFIRMATION: '{action}' needs human approval"
return True, "OK"
def check_budget(self, current_cost: float, current_calls: int) -> tuple[bool, str]:
"""Check if the agent is within budget."""
if current_cost > self.max_cost_per_task:
return False, f"BUDGET_EXCEEDED: ${current_cost:.2f} > ${self.max_cost_per_task}"
if current_calls > self.max_api_calls:
return False, f"CALL_LIMIT: {current_calls} > {self.max_api_calls}"
return True, "OK"
# Usage example
bounds = SafetyBounds()
allowed, msg = bounds.check_action("send_email", {"to": "user@example.com"})
print(f"Action allowed: {allowed}, Message: {msg}")
# Output: Action allowed: False, Message: REQUIRES_CONFIRMATION: 'send_email' needs human approval
allowed, msg = bounds.check_budget(current_cost=3.50, current_calls=25)
print(f"Budget OK: {allowed}, Message: {msg}")
# Output: Budget OK: True, Message: OK
Safety Warning: Never deploy an L4 (fully autonomous) agent without comprehensive safety bounds, logging, and monitoring. An unconstrained agent with access to production databases, email systems, or financial APIs can cause irreversible damage. Start at L1 or L2 and gradually increase autonomy as you build confidence through testing and monitoring.
7. Research Frontier
The field of autonomous AI agents is evolving rapidly, with new architectures and frameworks emerging from both academia and industry. This section surveys the most promising research directions — from open-source AGI platforms to standardized benchmarks — that are shaping the next generation of deep agent capabilities. Understanding these frontiers helps developers anticipate which patterns will mature into production-ready tools.
7.1 OpenAGI
Research
OpenAGI — Open-Source AGI Research Platform
OpenAGI (Ge et al., 2023) provides a benchmark for evaluating LLM-based agents on complex, multi-step tasks that require combining multiple domain-specific models. The agent must select and compose models (vision, NLP, audio) as tools to solve tasks like "describe the objects in this image and translate the description to French."
Key contribution: Demonstrates that LLMs can serve as a controller that orchestrates specialized models, functioning as a "brain" that coordinates "hands" (tools/models). This is the theoretical foundation for the tool-calling agents we built in Parts 7-8.
Model CompositionTask PlanningMulti-Modal
7.2 Voyager
Research
Voyager — Lifelong Learning Agent in Minecraft
Voyager (Wang et al., 2023) is an LLM-powered agent that explores the Minecraft world, learns new skills, and builds a reusable skill library — all without any human intervention or gradient-based training.
Architecture:
- Automatic Curriculum: Agent proposes its own exploration goals based on current capabilities
- Skill Library: Stores verified code as reusable skills (procedural memory)
- Self-Verification: Tests each new skill and only stores it if it works (Reflexion-like)
- Iterative Prompting: Refines code based on execution errors (self-correction)
Key insight: Voyager demonstrates that agents can learn composable skills — simple skills become building blocks for complex behaviors, just like functions compose into programs.
Lifelong LearningSkill LibrarySelf-Curriculum
7.3 WebArena
Research
WebArena — Realistic Web Agent Benchmark
WebArena (Zhou et al., 2023) provides a realistic benchmark for web-browsing agents. It deploys full web applications (e-commerce, Reddit-like forum, GitLab, content management) and tests agents on 812 tasks like "Find the cheapest hotel in NYC for next weekend on the booking site."
Findings:
- GPT-4 agents achieve only ~14% task success rate on realistic web tasks (as of 2024)
- Humans achieve ~78% success rate on the same tasks
- Main failure modes: incorrect element identification, wrong action sequences, failure to recover from errors
Key insight: Real-world agent tasks are much harder than benchmarks suggest. The gap between demo-quality and production-quality agents is enormous. This is why safety bounds and human-in-the-loop (Part 8) are so critical.
Web AgentsBenchmarkingReality Check
Exercises & Self-Assessment
Exercise 1
Build a Planner-Executor-Critic Agent
Implement the full Planner-Executor-Critic pattern in LangGraph with tools (search + calculator). Test with: "Research the GDP of the top 5 economies and calculate their combined share of world GDP." Allow up to 2 revision cycles.
Exercise 2
Reflexion for Code Generation
Build a Reflexion agent that generates Python functions. The evaluate step should actually run the code with test cases. If tests fail, the agent reflects on the error and tries again. Test with: "Write a function that validates email addresses using regex."
Exercise 3
Autonomy Level Comparison
Build the same task-execution agent at all 4 autonomy levels (L1-L4). Run the same 5 tasks through each level and measure: (a) time to completion, (b) number of human interventions, (c) quality of output, (d) safety incidents (if any). Document the trade-offs.
Exercise 4
Safety Bounds Stress Test
Create a deep agent with safety bounds and intentionally try to break them: (a) Send prompts that try to trick the agent into calling forbidden tools, (b) Create tasks that would exceed the budget, (c) Test whether the agent respects domain restrictions. Document all bypass attempts and fix any vulnerabilities.
Exercise 5
Reflective Questions
- Why does the Planner-Executor-Critic pattern produce better results than a single agent doing everything? What cognitive science principle does this mirror?
- Compare Reflexion to human learning. How is "verbal self-reflection" similar to and different from how humans learn from mistakes?
- At what point does increasing agent autonomy become net negative for productivity? Consider the WebArena results (14% success rate) in your answer.
- Design safety bounds for an agent that manages a company's social media accounts. What actions should require approval? What should be forbidden entirely?
- Voyager builds a skill library in Minecraft. How would you adapt this approach for a real-world coding agent that learns reusable patterns from past projects?
Conclusion & Next Steps
You now understand the most advanced agent architectures in use today and at the research frontier. Key takeaways:
- Planner-Executor-Critic — Separating planning, execution, and evaluation into distinct roles dramatically improves agent reliability and enables self-correction cycles
- Plan-and-Execute — Planning before acting reduces wasted tool calls and enables adaptive re-planning when circumstances change
- Reflexion — Verbal self-reflection stores lessons from failures in memory, improving performance on subsequent attempts without retraining
- LATS — Tree search explores multiple solution paths simultaneously, finding better solutions for complex, ambiguous tasks
- Tool Orchestration — Deep agents need dynamic tool selection, fallback chains, and multi-layered memory (working, episodic, semantic, procedural)
- Autonomy Levels (L1-L4) — From fully human-controlled to fully autonomous, with clear safety bounds at each level
- Research Frontier — OpenAGI, Voyager, and WebArena show both the potential and the current limitations of autonomous agents
Next in the Series
In Part 10: Multi-Agent Systems, we build on deep agent patterns to create systems where multiple specialized agents collaborate. Learn supervisor architectures, swarm intelligence, debate patterns, role-based teams, and how to orchestrate multi-agent workflows in LangGraph.
Continue the Series
Part 10: Multi-Agent Systems
Supervisor, swarm, debate, and role-based multi-agent collaboration architectures.
Read Article
Part 8: LangGraph — Stateful Agent Workflows
StateGraph fundamentals, nodes, edges, conditional routing, persistence, subgraphs, and human-in-the-loop.
Read Article
Part 7: Agents — Core of Modern AI Apps
Agent fundamentals — ReAct, tool-calling, AgentExecutor, memory, error handling, debugging.
Read Article