Anthropic SDK Track Part 15: Agent Environments & Cloud

                        
                        What You’ll Learn: Production agents need different configurations for different environments — development (verbose logging, cheap models), staging (production-like but isolated), and production (optimized, monitored, secured). This article covers environment management: feature flags for agent behavior, configuration-driven deployment, and safe testing practices that don’t accidentally charge production API keys.
                    

1. Container Sandboxing

Agents that execute code (via Bash tool) must be sandboxed. Containers provide process isolation, filesystem scoping, and network control:

# Dockerfile for agent sandbox environment
FROM python:3.12-slim

# Install only what the agent needs
RUN pip install anthropic mcp pydantic

# Create non-root user for agent execution
RUN useradd -m -s /bin/bash agent
USER agent

# Set workspace directory
WORKDIR /workspace

# Agent can only write to /workspace
# No access to host filesystem, no root privileges
# Network access controlled by Docker network settings

import subprocess
import json

def run_agent_in_sandbox(task: str, workspace_path: str) -> dict:
    """Execute an agent task inside a Docker container."""

    # Build container with restricted access
    container_config = {
        "image": "agent-sandbox:latest",
        "volumes": {
            workspace_path: {"bind": "/workspace", "mode": "rw"}
        },
        "network_mode": "none",  # No network access
        "mem_limit": "2g",       # Memory cap
        "cpu_period": 100000,
        "cpu_quota": 50000,      # 50% CPU cap
        "read_only": False,      # /workspace is writable
        "security_opt": ["no-new-privileges"],  # Cannot escalate
        "environment": {
            "ANTHROPIC_API_KEY": "${ANTHROPIC_API_KEY}",
            "TASK": task
        }
    }

    # Run with timeout
    result = subprocess.run(
        ["docker", "run", "--rm",
         "--network", "none",
         "--memory", "2g",
         "--cpus", "0.5",
         "--security-opt", "no-new-privileges",
         "-v", f"{workspace_path}:/workspace",
         "-e", f"TASK={task}",
         "agent-sandbox:latest",
         "python", "run_agent.py"],
        capture_output=True, text=True, timeout=300
    )

    return {
        "exit_code": result.returncode,
        "output": result.stdout,
        "errors": result.stderr
    }

2. Cloud Deployment Patterns

Production Agent Infrastructure

flowchart TD
    U["User Request"] --> LB["Load Balancer"]
    LB --> API["API Gateway"]
    API --> Q["Task Queue"]
    Q --> W1["Agent Worker 1
(Container)"]
    Q --> W2["Agent Worker 2
(Container)"]
    Q --> W3["Agent Worker 3
(Container)"]
    W1 --> CL["Claude API"]
    W2 --> CL
    W3 --> CL
    W1 --> MCP["MCP Servers"]
    W2 --> MCP
    W3 --> MCP

# Production agent deployment considerations

# 1. Stateless workers (scale horizontally)
# - Each request gets a fresh container
# - No shared state between requests
# - Session state in external store (Redis, DB)

# 2. Rate limiting
# - Anthropic API has per-org rate limits
# - Implement client-side rate limiting
# - Queue requests during rate limit windows

# 3. Retry logic with exponential backoff
import time
import random

def api_call_with_retry(func, max_retries=3):
    """Retry API calls with exponential backoff + jitter."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise

# 4. Observability
# - Log every tool call (name, input hash, duration, success/fail)
# - Track token usage per request
# - Alert on repeated failures (circuit breaker pattern)

Real-World Application

Zero-Downtime Agent Migration

A fintech company migrated their agent from Claude 3 to Claude 3.5 without downtime using environment-based canary deployment: 5% of traffic to the new model for 24 hours, automated quality checks comparing output distributions, then gradual rollout to 100%. One regression was caught at 10% and rolled back automatically.

Canary DeploymentZero Downtime

3. Security Boundaries

Production Agent Security Boundaries

                            flowchart TD
                                subgraph L1["Layer 1: Network Isolation"]
                                    subgraph L2["Layer 2: Filesystem Scoping"]
                                        subgraph L3["Layer 3: Tool Permissions"]
                                            subgraph L4["Layer 4: Secret Management"]
                                                subgraph L5["Layer 5: Audit Logging"]
                                                    AGENT["Agent
Execution"]
                                                end
                                            end
                                        end
                                    end
                                end

                                N1["No internet access
Allowed-list only"] -.-> L1
                                N2["Read/write within workspace only
Ephemeral storage"] -.-> L2
                                N3["Explicit allowlist per role
No wildcard access"] -.-> L3
                                N4["Vault-injected credentials
Never in context window"] -.-> L4
                                N5["Every tool call logged
Immutable audit trail"] -.-> L5

                                style L1 fill:#f8f9fa,stroke:#3B9797
                                style L2 fill:#f0f8f8,stroke:#16476A
                                style L3 fill:#e8f4f4,stroke:#132440
                                style L4 fill:#fff5f5,stroke:#BF092F
                                style L5 fill:#fafafa,stroke:#666

security_config = {
    "network": {
        "mode": "restricted",
        "allowed_hosts": ["api.anthropic.com"],
        "blocked_ports": [22, 3306, 5432]  # No SSH, no direct DB
    },
    "filesystem": {
        "writable_paths": ["/workspace"],
        "readable_paths": ["/workspace", "/shared/docs"],
        "blocked_paths": ["/etc", "/root", "/var/secrets"]
    },
    "execution": {
        "max_duration_seconds": 300,
        "max_memory_mb": 2048,
        "allow_network_tools": False,
        "allow_subprocess": True  # For Bash tool
    }
}

4. Production Checklist

CCA Exam Checklist

Production Agent Deployment Checklist

✅ Container isolation (no host filesystem access)
✅ Network restrictions (allowlist, not blocklist)
✅ Rate limiting (client-side + API-side)
✅ Timeout enforcement (per-tool and per-session)
✅ Secret management (never in code or env vars)
✅ Audit logging (immutable, every tool call)
✅ Error budget monitoring (alert on failure rate)
✅ Graceful degradation (fallback when API unavailable)

CCA Task 5.5CCA Task 5.6

                        
                        CCA Tasks 5.5 & 5.6: The exam focuses on: (1) principle of least privilege for agent environments, (2) containers provide isolation but aren’t perfect security (need additional layers), (3) network isolation is the most important security control for code-executing agents, (4) ephemeral environments (destroy after use) reduce attack surface.
                    

                        
                        Try It Yourself: Create an environment configuration system for a Claude agent with 3 tiers: dev (uses haiku, verbose logging, mock tools), staging (uses sonnet, real tools, test data), and production (uses sonnet, real tools, monitoring enabled). Implement a config loader that switches behavior based on an ENV environment variable.
                    

5. Hosting Patterns & Isolation (CCA 6.1)

The current Agent SDK does not expose an environments CRUD API. Deployment happens in your infrastructure: how you host the claude subprocess, what filesystem and network it can reach, and how you persist transcripts when containers move across hosts.

5.1 Hosting Patterns

# Hosting patterns for the Agent SDK
# Docs: https://code.claude.com/docs/en/agent-sdk/hosting

import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage


# EPHEMERAL: one container / process per task
async def ephemeral_review():
    async for msg in query(
        prompt="Review this pull request for security issues.",
        options=ClaudeAgentOptions(
            cwd="/work/review-42",  # isolate filesystem state per session
            max_turns=20,
            allowed_tools=["Read", "Glob", "Grep"],
            permission_mode="plan",
        ),
    ):
        if isinstance(msg, ResultMessage) and msg.subtype == "success":
            print(msg.result)


# LONG-RUNNING: keep a session alive across multiple turns in one process
# In Python, use ClaudeSDKClient when the same process owns the conversation.

# HYBRID: resume a captured session ID in a fresh container and mirror
# transcripts via SessionStore when the session must survive host restarts.

print("Hosting patterns: ephemeral | long-running | hybrid | multi-agent container")
print("Each active session maps to its own claude subprocess and transcript file")

5.2 Isolation Technology Comparison

Technology	Security Strength	Operational Complexity	Best For
Sandbox runtime	Good secure defaults	Low	Single-developer and CI runs needing fast isolation
Containers	Setup-dependent	Medium	Most self-hosted agent deployments
gVisor	Excellent	Medium	Multi-tenant or semi-trusted code execution
VMs / microVMs	Excellent	High	Highest-isolation workloads and strong security boundaries

                        
                        CCA Exam Pattern (6.1): Questions test: (1) The SDK hosting model is subprocess-based, not a managed environments API. (2) Each active session owns local disk state unless you mirror transcripts with SessionStore. (3) Choose isolation based on threat model: sandbox runtime, containers, gVisor, or VMs. (4) Per-session cwd is a core isolation boundary. (5) Network controls and credential injection live in your deployment architecture, not a Claude-managed environment object.
                    

6. Self-Hosted Environments (CCA 6.2)

6.0 Permission Modes (Agent SDK)

The permission_mode option controls whether an agent asks for approval before using tools. This is critical for choosing the right autonomy level for each deployment environment:

# Permission Modes — Control Agent Autonomy
# Docs: https://code.claude.com/docs/en/agent-sdk/agent-loop#permission-mode

import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage


# --- Mode 1: "default" (Interactive — prompts for unapproved tools) ---
# Tools in allowed_tools run freely; everything else triggers approval callback
async def interactive_agent():
    async for msg in query(
        prompt="Fix the failing tests",
        options=ClaudeAgentOptions(
            permission_mode="default",
            allowed_tools=["Read", "Grep", "Glob"],  # These run without asking
            # Edit, Write, Bash will trigger approval prompt
        ),
    ):
        pass


# --- Mode 2: "acceptEdits" (Developer mode — auto-approves file edits) ---
# Auto-approves file edits + common filesystem commands (mkdir, touch, mv, cp)
# Other Bash commands still follow allow rules
async def developer_agent():
    async for msg in query(
        prompt="Refactor auth module to use JWT",
        options=ClaudeAgentOptions(
            permission_mode="acceptEdits",
            allowed_tools=["Read", "Write", "Edit", "Glob", "Grep", "Bash"],
        ),
    ):
        pass


# --- Mode 3: "plan" (Read-only exploration — never edits) ---
# Only read-only tools run. Claude explores and produces a plan.
async def planning_agent():
    async for msg in query(
        prompt="Analyze the codebase and propose a microservices split",
        options=ClaudeAgentOptions(
            permission_mode="plan",
            allowed_tools=["Read", "Grep", "Glob"],
        ),
    ):
        pass


# --- Mode 4: "bypassPermissions" (CI/CD — no prompts, full autonomy) ---
# Auto-approves all tool uses that reach the mode check. Use ONLY in isolated environments.
# Cannot be used when running as root on Unix.
async def ci_agent():
    async for msg in query(
        prompt="Run full test suite and generate coverage report",
        options=ClaudeAgentOptions(
            permission_mode="bypassPermissions",
            allowed_tools=["Read", "Bash", "Grep", "Glob"],
            disallowed_tools=["Write", "Edit"],  # Still enforce deny rules
        ),
    ):
        pass


# --- Mode 5: "dontAsk" (Strict — never prompts, denies unknowns) ---
# Pre-approved tools run; everything else is silently denied (no prompt).
async def strict_agent():
    async for msg in query(
        prompt="Search for TODOs in the codebase",
        options=ClaudeAgentOptions(
            permission_mode="dontAsk",
            allowed_tools=["Read", "Grep", "Glob"],  # Only these can run
        ),
    ):
        pass


print("Permission modes: default | acceptEdits | plan | dontAsk | bypassPermissions")
print("Choose based on deployment: interactive → dev → CI → production")

Mode	Behavior	Best For
`default`	Prompts for unapproved tools	Interactive apps, user-facing agents
`acceptEdits`	Auto-approves file edits + mkdir/touch/mv/cp	Developer tools, code agents on dev machines
`plan`	Read-only; produces plan without editing	Architecture analysis, pre-review exploration
`dontAsk`	Never prompts; denies unlisted tools	Strict production agents with known tool sets
`bypassPermissions`	Runs all tools without asking unless blocked earlier by deny rules or hooks	CI/CD, containers, isolated sandboxes

6.1 Hardened Container Pattern

For self-hosted deployments, the docs recommend isolating the SDK subprocess inside a sandboxed container and routing outbound traffic through a proxy. The hardening controls live in your container runtime and network layer, not in a Claude-managed polling API.

# Hardened self-hosted container
docker run \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=100m \
  --tmpfs /workspace:rw,noexec,size=500m \
  --network none \
  --memory 2g \
  --cpus 2 \
  --user 1000:1000 \
  -v /path/to/code:/workspace:ro \
  -v /var/run/proxy.sock:/var/run/proxy.sock:ro \
  agent-image

Self-Hosted Subprocess Model

                            flowchart TD
                                Client[User or API caller] --> App[Your hosting app]
                                App --> SDK[Agent SDK query]
                                SDK --> Claude[claude subprocess]
                                Claude --> Disk[Local transcript + working directory]
                                Claude --> Proxy[Egress proxy]
                                Proxy --> External[Allowed external services]

# Give each hosted session its own working directory
import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions


async def hosted_session(prompt: str, session_id: str):
    async for _ in query(
        prompt=prompt,
        options=ClaudeAgentOptions(
            cwd=f"/workspace/{session_id}",
            max_turns=20,
        ),
    ):
        pass


print("Each active session should get its own cwd and local transcript path")
print("Use SessionStore when transcripts must survive host restarts or re-scheduling")

6.2 Claude on Cloud Platforms (CCA 6.3)

import json

# Claude on AWS (Bedrock) — access via AWS IAM, no Anthropic API key needed
# Claude on GCP (Vertex AI) — access via Google Cloud IAM

# AWS Bedrock Integration
aws_config = {
    "provider": "aws_bedrock",
    "region": "us-east-1",
    "model_id": "anthropic.claude-sonnet-4-6-20250514-v1:0",
    "auth": "IAM role (no API key needed)",
    "cross_account": "Use resource-based policies or assume-role",
    "pricing": "Same per-token, billed through AWS"
}

# GCP Vertex AI Integration
gcp_config = {
    "provider": "vertex_ai",
    "region": "us-central1",
    "model_id": "claude-sonnet-4-6@20250514",
    "auth": "Service account or Workload Identity",
    "pricing": "Same per-token, billed through GCP"
}

# When to use cloud provider vs direct API:
# Direct API: fastest new model access, all features immediately
# AWS Bedrock: existing AWS infra, IAM governance, data stays in AWS
# GCP Vertex: existing GCP infra, Google IAM, data stays in GCP

# Cross-account access (AWS):
# 1. Create IAM role in account with Bedrock access
# 2. Allow assume-role from agent's execution account
# 3. Agent assumes role to call Bedrock (no API keys stored)

# Key differences:
# - Feature availability: Direct API gets features first (2-4 weeks before cloud providers)
# - Data residency: Cloud providers keep data in your chosen region
# - Governance: Cloud IAM integrates with existing org policies
# - Billing: consolidated with other cloud services

print("Direct API: All features, fastest updates, Anthropic billing")
print("AWS Bedrock: IAM auth, AWS billing, regional data residency")
print("GCP Vertex AI: Google IAM, GCP billing, regional data residency")

                        
                        CCA Exam Pattern (6.2, 6.3): Questions test: (1) Each hosted session maps to a subprocess with local disk state. (2) Per-session cwd is important for isolation and correct session resume behavior. (3) Use SessionStore to mirror transcripts across hosts. (4) AWS Bedrock uses IAM roles instead of an Anthropic API key. (5) Harden self-hosted deployments with container isolation, network controls, and an egress proxy.