Back to Technology

AI Agents & Agentic Workflows

March 30, 2026 Wasil Zafar 33 min read

From simple tool-calling to fully autonomous multi-agent systems — learn how modern AI agents reason, plan, remember, and act. Master LangChain, AutoGen, and production agentic patterns that power the next generation of AI applications.

Table of Contents

  1. Introduction to AI Agents
  2. Tool Use & Function Calling
  3. Planning & ReAct
  4. Agent Memory Systems
  5. Multi-Agent Systems
  6. Agent Frameworks Compared
  7. Production Considerations
  8. Exercises

Introduction: The Age of AI Agents

Series Context: This is Part 13 of 24 in the AI in the Wild series. Parts 1–12 covered foundations, training, RAG, fine-tuning, safety, and MLOps. Now we turn to agents — AI systems that don't just respond but actively plan and act.

AI in the Wild

Your 24-step learning path • Currently on Step 13
1
Series Introduction
Why AI in the wild matters
2
LLM Foundations
Transformers, tokenization, prompting
3
Prompt Engineering
Few-shot, chain-of-thought, templates
4
RAG Systems
Retrieval-augmented generation
5
Fine-Tuning LLMs
LoRA, QLoRA, PEFT
6
Embeddings & Vector DBs
Semantic search, FAISS, Pinecone
7
Evaluation & Testing
RAGAS, benchmarks, red-teaming
8
AI Safety & Alignment
RLHF, Constitutional AI, guardrails
9
MLOps for LLMs
CI/CD, monitoring, drift detection
10
Multimodal AI
Vision-language, audio, video
11
AI Infrastructure
GPU clusters, serving, quantization
12
Production LLM APIs
OpenAI, Anthropic, Gemini at scale
13
AI Agents & Agentic Workflows
Tool use, planning, multi-agent systems
You Are Here
14
AI in Healthcare
Medical imaging, clinical NLP, drug discovery
15
AI in Finance
Fraud detection, credit scoring, trading
16
AI in Legal & Compliance
Contract analysis, regulatory AI
17
AI in Education
Personalized learning, tutors
18
AI in Manufacturing
Predictive maintenance, quality control
19
AI Ethics & Fairness
Bias, explainability, governance
20
Generative AI & Creativity
DALL-E, Sora, creative workflows
21
AI & Edge Computing
On-device inference, TinyML
22
Future of AI
AGI timelines, frontier models
23
Building AI Products
PM for AI, user research, iteration
24
AI Career Paths
Roles, skills, interview prep

An AI agent is an AI system that perceives its environment, reasons about what to do, takes actions using tools, and iterates toward a goal — without requiring a human in the loop for every step. The shift from LLMs-as-chatbots to LLMs-as-agents represents one of the most consequential developments in applied AI.

The Key Difference: A chatbot answers a question in one shot. An agent observes, plans, acts, observes again, and continues until the task is done. Agents are loops; chatbots are single turns.

What Makes Something an Agent?

Agents have four capabilities that distinguish them from simple LLM applications:

  • Tool Use: The ability to call external functions — search engines, calculators, APIs, code interpreters, databases.
  • Planning: Breaking complex goals into subtasks and sequencing them logically.
  • Memory: Retaining information across steps — either in context (short-term) or in a vector store (long-term).
  • Autonomy: Deciding which actions to take based on observations, without explicit per-step human instruction.
Real-World Example

Devin: The AI Software Engineer

Cognition's Devin agent can read a GitHub issue, plan an implementation, write code across multiple files, run tests, fix failures, and open a pull request — all without human intervention. It uses a code editor, terminal, browser, and memory as tools, and loops until the CI passes.

This is qualitatively different from GitHub Copilot completing a single line. Devin is an autonomous agent; Copilot is an AI-assisted autocomplete. Both are valuable, but they operate at different levels of the stack.

The Agent Loop

Every agent — from the simplest tool-caller to the most sophisticated multi-agent system — runs a variation of the same loop:

OBSERVE → THINK → ACT → OBSERVE → THINK → ACT → ... → DONE

In practice:

  1. Observe: Receive the task + any new information from the environment (tool outputs, user messages).
  2. Think: The LLM reasons about what to do next. This may include explicit chain-of-thought, planning, or self-critique.
  3. Act: Call a tool, generate a response, update memory, or hand off to another agent.
  4. Repeat: Feed the tool output back to the LLM and continue until the stopping condition is met.

Tool Use & Function Calling

Tool use is the foundation of agentic behavior. Without tools, an LLM can only reason over information in its context window — it cannot browse the web, run code, query a database, or call an API. Tools are what connect the LLM's reasoning to the real world.

How Tool Calling Works

Modern LLMs (GPT-4o, Claude 3.5, Gemini 1.5) natively support structured tool calling:

  1. You describe available tools as JSON schemas (name, description, parameters).
  2. The LLM decides whether to call a tool, and if so, which one and with what arguments.
  3. Your code executes the tool and returns results.
  4. The LLM reads the result and decides whether to call another tool or produce a final answer.
Critical Insight — Tool Descriptions Matter: The LLM chooses which tool to call based entirely on your descriptions. A poorly described tool will be misused or ignored. Write tool descriptions as if you're explaining the function to a smart but literal engineer who has never seen your codebase.

LangChain Tool-Using Agent (Code Example 1)

LangChain's create_tool_calling_agent makes it straightforward to build agents that pick from a toolkit based on the task at hand. The example below builds a financial research agent that can search the web, do arithmetic, and look up stock prices.

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate
import requests, json

# Define tools the agent can use
@tool
def search_web(query: str) -> str:
    """Search the web for current information about a topic."""
    # In production: integrate Tavily, Serper, or Brave Search API
    return f"Web search results for '{query}': [simulated results]"

@tool
def calculate(expression: str) -> str:
    """Safely evaluate a mathematical expression."""
    try:
        # Use eval with restricted globals for safety
        result = eval(expression, {"__builtins__": {}}, {})
        return str(result)
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def get_stock_price(ticker: str) -> str:
    """Get the current stock price for a given ticker symbol."""
    # In production: integrate Alpha Vantage, Yahoo Finance, etc.
    prices = {"AAPL": 189.50, "MSFT": 415.20, "GOOGL": 175.30, "NVDA": 875.00}
    price = prices.get(ticker.upper(), "ticker not found")
    return f"{ticker.upper()}: ${price}"

# Create the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [search_web, calculate, get_stock_price]
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful financial research assistant. Use tools to gather data before answering."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")  # tool call/result history
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=5)

result = executor.invoke({"input": "Compare the stock prices of Apple and NVIDIA. Which has a higher P/E ratio if Apple's EPS is $6.25 and NVIDIA's is $1.95?"})
print(result["output"])
# Agent plan: get_stock_price(AAPL) -> get_stock_price(NVDA) -> calculate P/E ratios -> compare

Anatomy of a Tool Call Trace

With verbose=True, you can see the agent's reasoning. A typical trace looks like:

> Entering new AgentExecutor chain...
Thought: I need to get the stock prices first.
Action: get_stock_price
Action Input: {"ticker": "AAPL"}
Observation: AAPL: $189.50

Action: get_stock_price
Action Input: {"ticker": "NVDA"}
Observation: NVDA: $875.00

Thought: Now I can calculate P/E ratios.
Action: calculate
Action Input: {"expression": "189.50 / 6.25"}
Observation: 30.32

Action: calculate
Action Input: {"expression": "875.00 / 1.95"}
Observation: 448.72

Final Answer: Apple's P/E is 30.32x vs NVIDIA's 448.72x. NVIDIA trades at a dramatically higher multiple, reflecting expectations of explosive AI-driven earnings growth.
Best Practice

Tool Design Principles

  • One responsibility: Each tool does exactly one thing. Don't combine search + summarize into one tool.
  • Rich descriptions: Include what the tool does, when to use it, and what it returns.
  • Typed parameters: Use Pydantic models for complex inputs — the LLM follows schemas more reliably.
  • Idempotent where possible: Avoid tools with side effects unless necessary (e.g., write-to-database).
  • Error messages matter: Return descriptive errors — the agent will try to recover based on what you return.

Planning & Reasoning Patterns

Raw tool calling is reactive. For complex tasks, agents need to plan — to decompose goals into steps before executing them. Several reasoning patterns have emerged as reliable approaches.

ReAct: Reason + Act

The ReAct pattern (Yao et al., 2022) interleaves reasoning traces with action calls. Before each tool call, the agent explicitly writes its reasoning in natural language. This improves performance on complex tasks and makes agent behavior auditable.

Pattern: ReAct

ReAct vs. Direct Action

Direct Action (brittle): Task → Tool Call → Answer. No explicit reasoning, hard to debug, misses multi-step dependencies.

ReAct (robust): Task → Thought → Tool Call → Observation → Thought → Tool Call → ... → Final Answer. Each step is justified and the agent can course-correct based on observations.

GPT-4o and Claude 3.5 Sonnet both support ReAct natively through their tool-calling APIs — the "thought" traces appear in the streamed response before each tool call.

Plan-and-Execute

For tasks requiring many steps, Plan-and-Execute separates the planning phase from execution. A "planner" LLM creates a task list upfront; an "executor" works through each step, potentially replanning if a step fails.

from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner

planner = load_chat_planner(llm)
executor = load_agent_executor(llm, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)

# The agent will:
# 1. Create a numbered plan: ["Step 1: Search for...", "Step 2: Calculate...", ...]
# 2. Execute each step, passing outputs to subsequent steps
# 3. Synthesize a final answer from all step outputs
agent.run("Research the top 3 AI chip manufacturers, compare their 2025 revenue, and predict market share in 2027")

Reflection & Self-Critique

Reflection agents evaluate their own outputs and iteratively improve them. This pattern is especially powerful for creative tasks, code generation, and research summaries.

REFLECTION_PROMPT = """
You wrote the following response:
<response>{response}</response>

Critique this response. What is missing? What could be more accurate or clearer?
Then rewrite an improved version addressing your critique.
"""
# Typical improvement: 2-3 reflection cycles yield significantly better outputs

Agentic Design Patterns Comparison

Different patterns suit different task types. The table below maps patterns to their optimal use cases and associated risks:

Pattern Description Use Case Risk Example System
ReAct Interleave reasoning traces with tool calls Multi-hop QA, research tasks Verbose prompts, higher latency Perplexity AI, Bing Copilot
Plan-and-Execute Create full plan upfront, execute step-by-step Complex workflows, project automation Brittle if plan is wrong early Devin, OpenDevin
Reflection Evaluate and refine own outputs iteratively Code generation, writing, analysis Higher cost, may over-iterate GPT-4 with self-critique
Multi-Agent Debate Multiple agents argue different positions, synthesize Fact-checking, complex decisions Expensive, may amplify hallucinations MetaGPT, Society of Mind
Supervisor-Worker Orchestrator delegates to specialist sub-agents Large-scale task decomposition Supervisor bottleneck, coordination overhead CrewAI, LangGraph
Tool-Augmented RAG Agent decides whether to retrieve or call tools Customer support, knowledge Q&A Routing errors between retrieval and action Salesforce Einstein, ServiceNow

Agent Memory Systems

Memory is what transforms a stateless LLM into an agent that learns and adapts. Without memory, every conversation starts from zero — the agent cannot recall user preferences, past decisions, or previously gathered facts.

Types of Agent Memory

Memory Taxonomy

Four Memory Types

  • In-Context Memory (Working): The current conversation history within the context window. Fast but bounded — typically 8K–200K tokens.
  • External Memory (Episodic): Vector store of past conversations and experiences, retrieved via semantic search. Effectively unlimited.
  • Semantic Memory (Knowledge): A structured knowledge base or RAG corpus of facts about the world. Retrieved when needed.
  • Procedural Memory (Skills): Encoded in the model's weights via fine-tuning or in reusable tool/prompt templates. Always available but not updatable at runtime.

Vector-Based Episodic Memory (Code Example 3)

The following class implements a FAISS-backed episodic memory store that lets agents remember past conversations and retrieve relevant context using semantic similarity.

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import faiss, numpy as np, json
from datetime import datetime

class AgentMemory:
    """Vector-based episodic memory for AI agents."""
    def __init__(self, dim: int = 384):
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.index = faiss.IndexFlatIP(dim)
        self.memories = []

    def store(self, content: str, metadata: dict = {}):
        embedding = self.embedder.encode([content], normalize_embeddings=True)
        self.index.add(embedding.astype('float32'))
        self.memories.append({"content": content, "timestamp": datetime.now().isoformat(), **metadata})

    def retrieve(self, query: str, k: int = 3) -> list[dict]:
        if self.index.ntotal == 0: return []
        q_emb = self.embedder.encode([query], normalize_embeddings=True).astype('float32')
        scores, indices = self.index.search(q_emb, min(k, self.index.ntotal))
        return [{"memory": self.memories[i], "relevance": float(scores[0][j])}
                for j, i in enumerate(indices[0]) if i != -1]

# Usage: agent remembers past conversations
memory = AgentMemory()
memory.store("User prefers Python over JavaScript for data science tasks", {"type": "preference"})
memory.store("Previous analysis showed NVIDIA stock outperforming AMD by 40% in 2024", {"type": "fact"})

relevant = memory.retrieve("What programming language does the user prefer?")
print(relevant[0]["memory"]["content"])  # -> "User prefers Python..."

Memory Management Strategies

As agents accumulate memory, several challenges emerge:

  • Recency Bias: Always injecting the most recent context may miss important older facts. Use hybrid scoring: relevance + recency.
  • Memory Summarization: Periodically compress older memories. GPT-4 can summarize 20 past conversations into a 500-word profile.
  • Memory Isolation: In multi-user systems, each user's memories must be namespaced — use metadata filtering on the vector index.
  • Forgetting: Not all memories are worth keeping. Implement TTL-based expiry for time-sensitive facts.
Production Memory Stacks: Mem0, Zep, and Letta are purpose-built memory layers for AI agents in production. They handle persistence, search, summarization, and user-level isolation so you don't have to build it yourself.

Multi-Agent Systems

Single agents have limits. They can lose context on very long tasks, lack specialization, and may hallucinate without a second opinion. Multi-agent systems address these limitations by having several specialized agents collaborate — each doing what it does best.

Why Multi-Agent?

Design Motivation

When Single Agents Break Down

  • Context limits: A task requiring 200+ pages of analysis exceeds any single context window. Split across agents.
  • Specialization: A "researcher" agent trained on browsing is better at search than a "coder" agent. Specialize roles.
  • Verification: A single agent checking its own work catches ~60% of errors. A separate critic agent catches ~85% (empirical estimate from AutoGen paper).
  • Parallelism: Tasks with independent subtasks run faster in parallel agents than sequentially in one agent.

AutoGen Multi-Agent System (Code Example 2)

Microsoft's AutoGen framework makes it straightforward to build group chats where agents with distinct personas collaborate on complex tasks.

import autogen

# Multi-agent research team: planner + researcher + critic + executor
config_list = [{"model": "gpt-4o", "api_key": "..."}]
llm_config = {"config_list": config_list, "temperature": 0.1}

planner = autogen.AssistantAgent(
    name="Planner",
    system_message="""Break complex tasks into steps. Assign each step to the right specialist.
    Always create a plan before work begins. Format: PLAN: step1, step2, step3""",
    llm_config=llm_config
)

researcher = autogen.AssistantAgent(
    name="Researcher",
    system_message="Gather information and data. Cite sources. Focus on facts, not opinions.",
    llm_config=llm_config
)

critic = autogen.AssistantAgent(
    name="Critic",
    system_message="Review outputs for accuracy, completeness, and logical consistency. Point out flaws.",
    llm_config=llm_config
)

executor = autogen.UserProxyAgent(
    name="Executor",
    human_input_mode="NEVER",  # fully automated
    code_execution_config={"work_dir": "workspace", "use_docker": False},
    max_consecutive_auto_reply=5
)

# Group chat: agents collaborate to solve the task
groupchat = autogen.GroupChat(
    agents=[planner, researcher, critic, executor],
    messages=[], max_round=12
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
executor.initiate_chat(manager, message="Analyze the competitive landscape of LLM providers in 2025. Include market share estimates and key differentiators.")

Multi-Agent Orchestration Patterns

Architecture Pattern

Common Topologies

  • Pipeline: Agent A → Agent B → Agent C. Each agent processes the output of the previous. Simple but no parallelism.
  • Supervisor-Worker: A central orchestrator delegates subtasks to worker agents and aggregates results. Most common in production.
  • Peer-to-Peer: Agents communicate directly. Flexible but harder to reason about and debug.
  • Debate: Multiple agents argue for different conclusions; a judge synthesizes. High quality, high cost.
Prompt Injection in Multi-Agent Systems: When agents read content from the web or user-uploaded files, malicious content can hijack the agent's behavior by embedding instructions. Always sanitize inputs and use bounded tool permissions. This is one of the most serious security challenges in production agentic systems.

Agent Frameworks Compared

The agent framework ecosystem has grown rapidly since 2023. Choosing the right framework depends on your use case, team's Python expertise, and whether you need multi-agent support out of the box.

Framework Creator Agent Type Tool Integration Multi-Agent Python/JS Best For
LangChain LangChain Inc. ReAct, tool-calling Excellent (200+ integrations) LangGraph Both General-purpose RAG + agents
AutoGen Microsoft Research Conversational, group chat Good (custom tools) Native (group chat) Python Research, code generation teams
CrewAI CrewAI Inc. Role-based crews Good (LangChain tools) Native (crews) Python Business process automation
Semantic Kernel Microsoft Planner, function calling Excellent (plugins) Agent framework Both + C# Enterprise .NET + Python apps
LlamaIndex LlamaIndex Inc. ReAct, structured Excellent (data-focused) Multi-agent beta Both Document Q&A, data analysis
Decision Guide

Framework Selection Heuristics

  • Building a customer-facing product with RAG + agents? → LangChain + LangGraph
  • Academic/research multi-agent experiments? → AutoGen
  • Business workflow automation with clear roles? → CrewAI
  • Enterprise .NET shop extending existing apps? → Semantic Kernel
  • Heavy document analysis, financial data? → LlamaIndex
  • Full control, production hardening, no abstraction overhead? → Build on raw API + your own orchestration

Production Considerations

Deploying agents in production is fundamentally different from running demos. Agents that work perfectly in testing can fail in unpredictable ways in production — looping indefinitely, calling expensive tools unnecessarily, or executing harmful actions when given malicious inputs.

Safety & Guardrails

Production Safety

Essential Guardrails

  • Max iterations: Always set max_iterations. A buggy tool can cause infinite loops that burn through your API budget in minutes.
  • Confirm before irreversible actions: Email sending, database writes, API calls with side effects should require explicit confirmation or a human-in-the-loop checkpoint.
  • Tool permissions: Apply principle of least privilege. A customer support agent should never have access to billing or admin tools.
  • Input sanitization: Treat all externally sourced content (web pages, user files) as potentially adversarial. Strip HTML, limit length, validate formats.
  • Output validation: Validate agent outputs before acting on them — especially for structured outputs like JSON or SQL.

Observability

Agents are harder to debug than standard software because their behavior is non-deterministic. Investing in observability infrastructure is essential:

  • Trace every run: Log the full reasoning trace — every thought, tool call, and observation. LangSmith, Weights & Biases Weave, and Langfuse are purpose-built for this.
  • Token usage: Track token consumption per run and per tool to identify cost hotspots.
  • Latency breakdown: Separate LLM inference time from tool execution time. Tool latency is usually the bigger issue.
  • Success/failure rates: Define task success criteria and track them. A 90% task completion rate might be acceptable; 60% is not.

Cost Management

Agents can be expensive. A single complex research task might make 10–20 LLM calls and dozens of tool calls. Strategies to control costs:

  • Use smaller models for simpler steps: Route tool selection and formatting steps to GPT-4o Mini; reserve GPT-4o for complex reasoning.
  • Cache tool results: Web search results, API responses, and database queries can be cached. Many tasks re-use the same data.
  • Prompt compression: Use LLMLingua or similar tools to compress conversation history before injecting into context.
  • Budget limits: Implement per-user and per-task token budgets with hard stops.
Real Cost Benchmark: A well-optimized GPT-4o agent handling a moderately complex research task (5–8 LLM calls) costs approximately $0.05–$0.20 per task. Without optimization, the same task can cost $1–$3. At scale, this difference is enormous.

Exercises & Practice

Building agents is a hands-on skill. Work through these exercises in order — each one reinforces a different layer of the agentic stack.

Beginner

Exercise 1: Two-Tool Agent

Create a simple tool-using agent with exactly 2 tools: a calculator and a dictionary lookup (use the Free Dictionary API or a hardcoded dict). Write 10 test questions that require using one or both tools to answer. For each question, log which tools were called and whether the agent arrived at the correct answer.

Goals: Understand tool description quality, observe the tool selection process, practice prompt engineering for tool-using agents.

Stretch: Add a third tool (currency converter) and test questions that require all three tools in sequence.

Intermediate

Exercise 2: Research Agent with Memory

Build a research agent that can: (1) search the web using the Tavily API, (2) read the full content of URLs using a browser tool, and (3) summarize findings into a structured report. Add vector-based episodic memory so the agent remembers past research sessions. Test by researching a topic in multiple sessions — the agent should reference previous findings.

Goals: Implement multi-step tool chaining, build and use vector memory, produce structured output from unstructured tool results.

Evaluation: Compare report quality (factual accuracy, completeness, citation quality) between session 1 (no prior memory) and session 3 (2 prior sessions in memory).

Advanced

Exercise 3: Multi-Agent Code Review Pipeline

Design and implement a 4-agent code review workflow:

  • Agent A (Generator): Given a spec, writes Python code to solve a programming problem.
  • Agent B (Reviewer): Reviews Agent A's code for bugs, style issues, and edge cases.
  • Agent C (Tester): Generates and runs unit tests against Agent A's code, reports failures.
  • Agent D (Improver): Reads B's review and C's test failures, produces an improved version of the code.

Run the pipeline on 10 programming challenges (LeetCode easy/medium). Track: number of iterations per problem, test pass rate after Agent D's revision vs Agent A's original, and qualitative review quality from Agent B.

Key Question: Does the multi-agent pipeline produce meaningfully better code than a single agent with self-reflection? What is the cost difference?

AI Agent Design Document Generator

Document your AI agent design for review, compliance, and team alignment. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Technology