AI-Native Systems Architecture — Systems Thinking & Architecture Mastery Part 19

Module 41: AI Infrastructure Systems

AI-native systems demand fundamentally different infrastructure than traditional applications. Where conventional systems optimize for CPU cycles, network I/O, and disk throughput, AI systems optimize for GPU compute density, memory bandwidth, and batching efficiency. The cost profile is also radically different — a single H100 GPU costs ~$30,000, making utilization a first-class architectural concern.

GPU Scheduling Strategies

Modern GPU scheduling addresses a core tension: GPUs are expensive and must be shared across workloads, but AI workloads have diverse requirements (training needs sustained throughput, inference needs low latency, fine-tuning needs burst capacity). Three primary strategies have emerged:

                            
                            GPU Scheduling Strategies Compared: Time-slicing offers simplicity but no memory isolation. MPS provides concurrent execution with shared memory. MIG delivers hardware-level isolation with dedicated memory — the gold standard for multi-tenant production clusters.
                        

1. Time-Slicing — The simplest approach. The GPU rapidly switches between workloads (like CPU time-sharing). Each workload gets exclusive access during its time slice, but context switching adds overhead (5-15% GPU utilization loss). No memory isolation — if one workload allocates too much VRAM, others crash.

2. NVIDIA MPS (Multi-Process Service) — Allows multiple CUDA contexts to execute concurrently on the same GPU. Workloads share compute resources simultaneously (not sequentially). Ideal for inference where individual requests use <10% of GPU capacity. Limited fault isolation — one crashed kernel can affect others.

3. NVIDIA MIG (Multi-Instance GPU) — Hardware-level partitioning available on A100/H100 GPUs. Divides a single GPU into up to 7 independent instances, each with dedicated compute units, memory, and memory bandwidth. Full hardware isolation — one instance crashing cannot affect others. The best option for multi-tenant production environments.

# Kubernetes GPU scheduling with MIG and time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      mig:
        strategy: mixed  # Allow both MIG and non-MIG GPUs
        devices:
          - name: A100-SXM4-80GB
            migEnabled: true
            migDevices:
              # 3 large instances for inference serving
              - profile: 3g.40gb
                count: 2
              # 1 small instance for monitoring/metrics
              - profile: 1g.10gb
                count: 1
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Each physical GPU appears as 4 virtual GPUs
            overcommit: true
---
# Pod requesting a MIG slice
apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  containers:
    - name: vllm-server
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1  # Request specific MIG profile
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "MIG-GPU-xxxxx/3/0"  # MIG instance identifier
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.90"  # Use 90% of allocated MIG memory

GPU Memory Management

GPU memory (VRAM) is the most constrained resource in AI systems. A 70B parameter model requires ~140GB in FP16, exceeding even the H100's 80GB. Memory management strategies include:

Quantization: Reduce precision from FP16 (2 bytes) to INT8 (1 byte) or INT4 (0.5 bytes). A 70B model drops from 140GB to 35GB at INT4 with ~3% quality loss on benchmarks.
KV Cache Management: Attention KV caches grow linearly with sequence length. vLLM's PagedAttention allocates KV cache in non-contiguous pages (like OS virtual memory), eliminating fragmentation waste of 60-80%.
Tensor Parallelism: Split model layers across multiple GPUs. A 70B model on 4×H100 GPUs uses ~35GB per GPU with fast NVLink interconnect.
Offloading: Move inactive layers to CPU RAM or NVMe SSD. Enables serving models larger than GPU memory at the cost of latency (CPU offload adds ~50ms per layer swap).

Vector Databases for Semantic Search

Vector databases are purpose-built for storing, indexing, and querying high-dimensional embedding vectors. The fundamental workflow is: embed → index → query. Text (or images, audio) is converted to dense vectors via embedding models, stored in specialized indexes, and queried via cosine similarity or dot product.

AI Inference Pipeline — From Request to Response

flowchart LR
    subgraph Ingestion["Data Ingestion"]
        DOC[Documents] --> CHUNK[Chunking
512-1024 tokens]
        CHUNK --> EMBED[Embedding Model
text-embedding-3-large]
        EMBED --> VEC[1536-dim vectors]
    end

    subgraph Index["Vector Index"]
        VEC --> HNSW[HNSW Index
Approximate NN]
        VEC --> META[Metadata Store
Filters, tags]
    end

    subgraph Query["Query Pipeline"]
        Q[User Query] --> QEMB[Query Embedding]
        QEMB --> SEARCH[ANN Search
top-k neighbors]
        META --> FILTER[Pre/Post Filter]
        SEARCH --> FILTER
        FILTER --> RANK[Re-ranking
Cross-encoder]
    end

    subgraph Serve["LLM Generation"]
        RANK --> CTX[Context Assembly
Retrieved chunks]
        CTX --> LLM[LLM Inference
GPT-4, Claude]
        LLM --> RESP[Response]
    end

Key vector databases compared:

Database	Architecture	Best For	Scale
Pinecone	Fully managed, serverless	Production RAG with zero ops	Billions of vectors
Weaviate	Open-source, hybrid search	Combined vector + keyword search	Hundreds of millions
pgvector	PostgreSQL extension	Existing Postgres stack, moderate scale	Millions of vectors
Qdrant	Rust-based, high-performance	Low-latency filtering + search	Billions of vectors
ChromaDB	Embedded, developer-friendly	Prototyping, local development	Millions of vectors

HNSW Algorithm & Hybrid Search

HNSW (Hierarchical Navigable Small World) is the dominant algorithm for approximate nearest neighbor (ANN) search in production. It builds a multi-layer graph where higher layers have fewer nodes (enabling fast coarse navigation) and lower layers have more nodes (enabling precise local search). Query complexity is O(log n) with recall >95% at typical configurations.

Key tuning parameters:

M (connections per node): Higher M → better recall, more memory. Typical: 16-64.
ef_construction (build-time search width): Higher → better index quality, slower build. Typical: 200-500.
ef_search (query-time search width): Higher → better recall, higher latency. Typical: 50-200.

Hybrid Search combines vector similarity with keyword (BM25) matching. Critical for cases where exact terms matter (product IDs, error codes, proper nouns). The fusion formula is typically: score = α × vector_score + (1-α) × bm25_score where α is tuned per use case (0.7 for semantic-heavy, 0.3 for keyword-heavy).

AI Inference Platforms

Inference platforms bridge the gap between trained models and production serving. The three dominant platforms serve different niches:

NVIDIA TensorRT: Maximum throughput on NVIDIA hardware. Compiles models to optimized GPU kernels with operator fusion, precision calibration, and kernel auto-tuning. Best for fixed-shape workloads (vision models, speech).
vLLM: Purpose-built for LLM serving. PagedAttention for efficient KV cache management, continuous batching for throughput, speculative decoding for latency. The standard for production LLM APIs.
NVIDIA Triton: Model orchestration platform. Serves multiple models/frameworks simultaneously (PyTorch, TensorFlow, TensorRT, ONNX) with dynamic batching, model ensembles, and A/B testing. Best for complex multi-model pipelines.

Inference Optimization Techniques

Modern inference optimization reduces latency and cost through several complementary techniques:

Continuous Batching: Instead of waiting for a batch to fill (static batching), process new requests as soon as any in-flight request completes a token. Increases throughput 2-5× over naive batching.

Speculative Decoding: Use a small "draft" model to generate N candidate tokens quickly, then verify them in parallel with the full model. If the draft model guesses correctly (often 60-80% of tokens), you generate N tokens in the time of 1. Typical speedup: 2-3×.

KV Cache Optimization: PagedAttention (vLLM) manages KV cache like OS virtual memory — non-contiguous allocation eliminates the 60-80% memory waste from pre-allocated contiguous buffers.

"""
vLLM Serving Configuration — Production LLM API
Demonstrates continuous batching, quantization, and tensor parallelism
"""
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

# Production vLLM configuration for a 70B model on 4×H100
engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,           # Split across 4 GPUs
    dtype="float16",                   # FP16 precision
    quantization="awq",                # AWQ 4-bit quantization
    max_model_len=8192,                # Maximum context window
    gpu_memory_utilization=0.90,       # Use 90% of GPU memory
    max_num_batched_tokens=32768,      # Max tokens in a batch
    max_num_seqs=256,                  # Max concurrent sequences
    enable_prefix_caching=True,        # Cache common prefixes (system prompts)
    enable_chunked_prefill=True,       # Overlap prefill with decode
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",  # Draft model
    num_speculative_tokens=5,          # Generate 5 draft tokens
    use_v2_block_manager=True,         # Improved memory management
)

# Initialize async engine for production serving
engine = AsyncLLMEngine.from_engine_args(engine_args)

# Sampling parameters for different use cases
creative_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=2048,
    frequency_penalty=0.1,
)

deterministic_params = SamplingParams(
    temperature=0.0,           # Greedy decoding
    max_tokens=1024,
    stop=["```", "\n\n\n"],    # Stop sequences
)

print("vLLM engine configured:")
print(f"  Model: {engine_args.model}")
print(f"  GPUs: {engine_args.tensor_parallel_size}")
print(f"  Quantization: {engine_args.quantization}")
print(f"  Max sequences: {engine_args.max_num_seqs}")
print(f"  Speculative decoding: {engine_args.num_speculative_tokens} tokens")

Module 42: Agentic Systems Architecture

Agentic systems represent a paradigm shift from "AI as a function" (input → output) to "AI as an actor" (perceive → reason → plan → act → reflect). These systems maintain state, use tools, coordinate with other agents, and operate autonomously over extended time periods. The architectural challenges are fundamentally different from stateless inference.

Model Context Protocol (MCP) Architecture

MCP (Model Context Protocol), introduced by Anthropic, establishes a standard interface between AI models and external tools/data sources. It solves the "N×M integration problem" — instead of every AI application implementing custom integrations with every tool, both sides implement MCP once.

MCP Architecture — Hosts, Clients, and Servers

flowchart TB
    subgraph Host["MCP Host (IDE, Chat App)"]
        APP[Application Logic]
        subgraph Client1["MCP Client 1"]
            C1[Protocol Handler]
        end
        subgraph Client2["MCP Client 2"]
            C2[Protocol Handler]
        end
    end

    subgraph Server1["MCP Server: Database"]
        T1[Tools: query, insert]
        R1[Resources: schema, tables]
        P1[Prompts: SQL templates]
    end

    subgraph Server2["MCP Server: GitHub"]
        T2[Tools: create_issue, PR]
        R2[Resources: repos, files]
        P2[Prompts: review template]
    end

    subgraph Server3["MCP Server: Monitoring"]
        T3[Tools: get_metrics, alert]
        R3[Resources: dashboards]
        S3[Sampling: anomaly check]
    end

    APP --> Client1
    APP --> Client2
    Client1 -->|"JSON-RPC 2.0
stdio/SSE"| Server1
    Client1 -->|"JSON-RPC 2.0"| Server3
    Client2 -->|"JSON-RPC 2.0"| Server2

MCP's four primitives:

Tools: Functions the model can invoke (with parameters and return values). Server-exposed, model-controlled execution. Example: query_database(sql: string) → ResultSet
Resources: Data the model can read (URI-addressable). Application-controlled access to context. Example: file:///project/src/main.py
Prompts: Reusable prompt templates with parameters. User-triggered template expansion. Example: code_review(language: "python", diff: "...")
Sampling: Server-initiated model invocations. Allows MCP servers to request completions from the host model. Example: monitoring server asks model to analyze an anomaly.

Building MCP Servers

"""
MCP Server Implementation — Database Query Tool
Demonstrates tools, resources, and prompts primitives
"""
import asyncio
import json
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import (
    Tool, Resource, Prompt,
    TextContent, PromptMessage,
    GetPromptResult, CallToolResult
)

# Initialize MCP server
server = Server("database-mcp-server")

# --- TOOLS: Functions the model can invoke ---
@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_database",
            description="Execute a read-only SQL query against the production database. "
                        "Returns results as JSON. Maximum 100 rows.",
            inputSchema={
                "type": "object",
                "properties": {
                    "sql": {
                        "type": "string",
                        "description": "SQL SELECT query to execute"
                    },
                    "database": {
                        "type": "string",
                        "enum": ["users", "orders", "analytics"],
                        "description": "Target database"
                    }
                },
                "required": ["sql", "database"]
            }
        ),
        Tool(
            name="explain_query",
            description="Get the execution plan for a SQL query without running it.",
            inputSchema={
                "type": "object",
                "properties": {
                    "sql": {"type": "string", "description": "SQL query to explain"}
                },
                "required": ["sql"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> CallToolResult:
    if name == "query_database":
        # Validate: only SELECT allowed (prevent injection)
        sql = arguments["sql"].strip()
        if not sql.upper().startswith("SELECT"):
            return CallToolResult(
                content=[TextContent(
                    type="text",
                    text="Error: Only SELECT queries are allowed."
                )],
                isError=True
            )
        # Execute query (simplified — use parameterized queries in production)
        results = await execute_query(sql, arguments["database"])
        return CallToolResult(
            content=[TextContent(type="text", text=json.dumps(results, indent=2))]
        )

# --- RESOURCES: Data the model can read ---
@server.list_resources()
async def list_resources():
    return [
        Resource(
            uri="db://schema/users",
            name="Users Table Schema",
            description="Column definitions for the users table",
            mimeType="application/json"
        ),
        Resource(
            uri="db://schema/orders",
            name="Orders Table Schema",
            description="Column definitions for the orders table",
            mimeType="application/json"
        )
    ]

# --- PROMPTS: Reusable templates ---
@server.list_prompts()
async def list_prompts():
    return [
        Prompt(
            name="analyze_slow_queries",
            description="Analyze slow database queries and suggest optimizations",
            arguments=[
                {"name": "threshold_ms", "description": "Latency threshold in ms", "required": True}
            ]
        )
    ]

@server.get_prompt()
async def get_prompt(name: str, arguments: dict) -> GetPromptResult:
    if name == "analyze_slow_queries":
        return GetPromptResult(
            messages=[
                PromptMessage(
                    role="user",
                    content=TextContent(
                        type="text",
                        text=f"Analyze queries slower than {arguments['threshold_ms']}ms. "
                             f"For each: explain why it's slow, suggest index changes, "
                             f"and provide the optimized query."
                    )
                )
            ]
        )

# Run server
async def main():
    async with stdio_server() as (read_stream, write_stream):
        await server.run(read_stream, write_stream)

if __name__ == "__main__":
    asyncio.run(main())

Multi-Agent Patterns

Multi-agent systems use multiple AI agents that collaborate to solve complex tasks. Each agent specializes in a subdomain, and an orchestration layer coordinates their interactions. Three dominant patterns have emerged:

Multi-Agent Supervisor Pattern

flowchart TD
    USER[User Request] --> SUP[Supervisor Agent
Plans, delegates, synthesizes]

    SUP -->|"Research task"| RA[Research Agent
Web search, docs]
    SUP -->|"Code task"| CA[Code Agent
Write, test, debug]
    SUP -->|"Review task"| REV[Review Agent
Quality, security]

    RA -->|"Findings"| SUP
    CA -->|"Code output"| SUP
    REV -->|"Feedback"| SUP

    SUP -->|"Revision needed"| CA
    SUP -->|"Final answer"| USER

    subgraph SharedState["Shared State"]
        MEM[Working Memory]
        ART[Artifacts Store]
    end

    RA --> SharedState
    CA --> SharedState
    REV --> SharedState

Pattern 1: Supervisor — A central orchestrator agent decomposes tasks, delegates to specialist agents, and synthesizes results. Simple to reason about, single point of control, but the supervisor can become a bottleneck. Used by: AutoGen, CrewAI.

Pattern 2: Hierarchical — Multiple layers of supervisors. A top-level planner delegates to mid-level coordinators, which delegate to worker agents. Scales to complex tasks but adds latency and coordination overhead. Used by: enterprise workflow systems.

Pattern 3: Collaborative (Peer-to-Peer) — Agents communicate directly without a central coordinator. Each agent has a role and can message others. Flexible and resilient to individual failures, but harder to reason about and debug. Emergent behavior can be unpredictable.

Workflow Orchestration with State Machines

Agentic workflows need explicit state management. Unlike simple request-response chains, agents loop, branch, retry, and maintain context across many steps. LangGraph models agent workflows as directed graphs with typed state:

Nodes: Functions that process state (LLM calls, tool invocations, human-in-the-loop checkpoints)
Edges: Conditional transitions based on state (if research complete → code, if tests fail → debug)
State: Typed, persistent context passed between nodes (messages, artifacts, iteration count)
Checkpoints: Durable state persistence for long-running workflows (resume after interruption)

                            
                            Orchestration Frameworks Compared: LangGraph (explicit graph, maximum control, Python/JS), CrewAI (role-based agents with natural language task definitions, Python), AutoGen (conversational multi-agent with code execution, Python/.NET). Choose LangGraph for complex workflows needing precise control; CrewAI for rapid prototyping of role-based systems; AutoGen for research and experimentation.
                        

Module 43: AI Reliability & Safety

AI systems introduce a novel class of failures: they don't crash — they confidently produce wrong answers. Traditional reliability engineering (circuit breakers, retries, health checks) is necessary but insufficient. AI reliability requires additional layers: guardrails for input/output validation, grounding for factual accuracy, and observability for quality monitoring.

Guardrails Architecture

Guardrails are validation layers that wrap AI model calls, ensuring inputs are safe and outputs are appropriate. They form a pipeline: pre-processing → model → post-processing, with each stage capable of blocking, modifying, or flagging content.

Guardrails Processing Flow

flowchart LR
    subgraph Pre["Pre-Processing Guards"]
        I[User Input] --> PII[PII Detection
Redact SSN, emails]
        PII --> INJ[Injection Detection
Prompt injection scan]
        INJ --> TOX[Toxicity Filter
Hate speech, violence]
        TOX --> LEN[Length/Rate Limit
Token budget check]
    end

    subgraph Model["Model Layer"]
        LEN -->|"Clean input"| LLM[LLM Inference]
        LLM --> RAW[Raw Output]
    end

    subgraph Post["Post-Processing Guards"]
        RAW --> FACT[Fact Verification
RAG grounding check]
        FACT --> SAFE[Safety Classifier
Harmful content scan]
        SAFE --> SCHEMA[Schema Validation
JSON structure check]
        SCHEMA --> OUT[Validated Output]
    end

    subgraph Fallback["Fallback Paths"]
        INJ -->|"Blocked"| BLOCK[Block Response
Log alert]
        FACT -->|"Ungrounded"| CAVEAT[Add Caveat
Mark uncertain]
        SAFE -->|"Unsafe"| REFUSE[Refuse Response
Escalate to human]
    end

"""
AI Guardrails Chain — Input/Output Validation Pipeline
Demonstrates pre-processing, model call, and post-processing guards
"""
import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class GuardResult:
    passed: bool
    message: str
    modified_content: Optional[str] = None
    severity: str = "info"  # info, warning, critical

class GuardrailsChain:
    """Pipeline of input/output validators wrapping an LLM call."""

    def __init__(self):
        self.input_guards = []
        self.output_guards = []

    def add_input_guard(self, guard_fn):
        self.input_guards.append(guard_fn)
        return self

    def add_output_guard(self, guard_fn):
        self.output_guards.append(guard_fn)
        return self

    async def run(self, user_input: str, model_fn) -> dict:
        # Phase 1: Pre-processing guards
        current_input = user_input
        for guard in self.input_guards:
            result = guard(current_input)
            if not result.passed:
                return {
                    "blocked": True,
                    "stage": "input",
                    "reason": result.message,
                    "severity": result.severity
                }
            if result.modified_content:
                current_input = result.modified_content

        # Phase 2: Model inference
        model_output = await model_fn(current_input)

        # Phase 3: Post-processing guards
        current_output = model_output
        caveats = []
        for guard in self.output_guards:
            result = guard(current_output)
            if not result.passed:
                if result.severity == "critical":
                    return {
                        "blocked": True,
                        "stage": "output",
                        "reason": result.message
                    }
                caveats.append(result.message)
            if result.modified_content:
                current_output = result.modified_content

        return {
            "blocked": False,
            "output": current_output,
            "caveats": caveats,
            "input_modified": current_input != user_input
        }

# --- Guard implementations ---
def pii_detection_guard(text: str) -> GuardResult:
    """Detect and redact PII (SSN, email, phone)."""
    patterns = {
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    }
    redacted = text
    found_pii = []
    for pii_type, pattern in patterns.items():
        if re.search(pattern, text):
            found_pii.append(pii_type)
            redacted = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", redacted)

    if found_pii:
        return GuardResult(
            passed=True,  # Don't block — redact and continue
            message=f"PII detected and redacted: {found_pii}",
            modified_content=redacted,
            severity="warning"
        )
    return GuardResult(passed=True, message="No PII detected")

def prompt_injection_guard(text: str) -> GuardResult:
    """Detect common prompt injection patterns."""
    injection_patterns = [
        r"ignore (?:all |previous |above )instructions",
        r"you are now",
        r"system:\s*",
        r"</?(system|user|assistant)>",
        r"pretend you",
    ]
    for pattern in injection_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return GuardResult(
                passed=False,
                message="Potential prompt injection detected",
                severity="critical"
            )
    return GuardResult(passed=True, message="No injection detected")

def factuality_guard(output: str) -> GuardResult:
    """Check if output contains hedging language indicating uncertainty."""
    uncertain_phrases = [
        "I'm not sure", "I think", "possibly",
        "I don't have information", "as of my knowledge"
    ]
    for phrase in uncertain_phrases:
        if phrase.lower() in output.lower():
            return GuardResult(
                passed=True,
                message="Output contains uncertainty markers — flag for review",
                severity="warning"
            )
    return GuardResult(passed=True, message="Output appears confident")

# --- Usage ---
chain = GuardrailsChain()
chain.add_input_guard(pii_detection_guard)
chain.add_input_guard(prompt_injection_guard)
chain.add_output_guard(factuality_guard)

print("Guardrails chain configured:")
print(f"  Input guards: {len(chain.input_guards)}")
print(f"  Output guards: {len(chain.output_guards)}")
print(f"  Pipeline: PII redaction → Injection detection → [Model] → Factuality check")

Hallucination Containment

Hallucination — the confident generation of incorrect information — is the defining reliability challenge of AI systems. Containment strategies operate at multiple levels:

1. Grounding via RAG (Retrieval-Augmented Generation): Constrain the model to answer only from retrieved context. The system prompt instructs: "Answer ONLY using the provided context. If the answer isn't in the context, say 'I don't have information about that.'"

2. Citation Verification: Require the model to cite specific passages from retrieved documents. Post-processing verifies that cited content actually exists in the source documents (string matching or semantic similarity check).

3. Confidence Calibration: Train models or add post-processing layers that estimate answer confidence. Low-confidence answers are flagged for human review or returned with explicit uncertainty markers.

4. Multi-Model Consensus: Send the same query to multiple models and compare outputs. If models disagree, flag the answer for human review. Expensive but highly effective for critical decisions (medical, legal, financial).

                            
                            Hallucination is not a bug to fix — it's a property to manage. Language models will always have some probability of generating incorrect information. The architectural goal is containment: ensure hallucinations are caught before reaching users, and that users are never harmed by incorrect AI outputs. Design for graceful degradation — a "I don't know" is always safer than a confident wrong answer.
                        

AI Observability

Traditional observability (metrics, logs, traces) is necessary but insufficient for AI systems. AI observability adds new dimensions: quality, cost, and behavioral drift. You need to know not just "is the system up?" but "is the system producing good outputs?"

Core AI observability signals:

Signal	What It Measures	Alert Threshold
Token usage	Input/output tokens per request	P95 > 2× baseline
Latency (TTFT)	Time to first token	P99 > 5s
Latency (TPS)	Tokens per second throughput	<30 TPS sustained
Quality score	LLM-as-judge evaluation (1-5)	Rolling avg < 3.5
Groundedness	% answers supported by retrieved context	<85% grounded
Guardrail triggers	% requests blocked by safety filters	>5% block rate
User feedback	Thumbs up/down ratio	<70% positive
Cost per query	$ per request (tokens × price)	>$0.10/query avg

A/B Testing for AI: AI A/B testing requires different methodology than traditional software. You can't just compare conversion rates — you must compare output quality. Approaches include: LLM-as-judge (have a powerful model rate outputs), human evaluation panels, and automated metric suites (BLEU, ROUGE for summarization; exact match for factual QA).

AI Observability Production Pattern

The "Quality Score Dashboard" Pattern

Run a lightweight LLM-as-judge evaluation on a random 5% sample of production queries. The judge model (GPT-4 or Claude) scores each response on a 1-5 scale across dimensions: relevance, accuracy, completeness, harmlessness. Dashboard shows rolling 7-day average with alerting on statistical drops. This provides continuous quality monitoring without slowing production requests. Cost: ~$0.01 per evaluation (cheap relative to the value of catching quality regressions).

Observability LLM-as-Judge Quality Assurance

Case Studies

OpenAI: Inference Scaling at Unprecedented Scale

Case Study Infrastructure

Serving GPT-4 to 200M+ Weekly Users

OpenAI's inference infrastructure serves hundreds of millions of users with sub-second latency. Key architectural decisions: (1) Custom inference kernels optimized for their specific model architectures (not generic frameworks). (2) Aggressive KV cache management — sharing cached prefixes across users with the same system prompt (saving 30-50% compute for repeated system prompts). (3) Speculative decoding with custom draft models trained specifically to predict the next tokens of their larger models (80%+ acceptance rate). (4) Geographic routing to colocate requests with the closest GPU cluster, with overflow routing to other regions during peaks. (5) Progressive rollout of model updates — new model versions serve 1% → 10% → 50% → 100% of traffic with quality regression checks at each stage.

GPU Optimization KV Cache Speculative Decoding

Anthropic: Building the MCP Ecosystem

Case Study Protocol Design

MCP as the "USB for AI" — Universal Tool Integration

Anthropic designed MCP to solve the fragmentation problem: every AI application reinventing tool integrations. Key design decisions: (1) Transport agnostic — works over stdio (local processes), SSE (HTTP streaming), and WebSocket. This allows MCP servers to run as local processes (fast, secure), remote services (shared), or serverless functions (scalable). (2) Capability negotiation — client and server declare capabilities at connection time, enabling graceful feature discovery. (3) Stateful sessions — unlike stateless REST APIs, MCP maintains session state for multi-turn tool interactions. (4) Security by default — servers declare their capabilities explicitly (no ambient authority), and hosts control which servers can access which resources. The ecosystem grew from 0 to 1000+ community servers in 6 months because the protocol is simple to implement (a basic server is ~50 lines of Python) and immediately useful (connects any AI model to any tool).

Protocol Design Ecosystem Developer Experience

Conclusion & Key Takeaways

AI-native systems architecture represents the frontier of systems thinking. The fundamental challenge is designing for non-deterministic components — systems where the same input can produce different outputs, where failures manifest as quality degradation rather than crashes, and where the system's behavior evolves as models update.

GPU infrastructure is expensive — utilization is king. MIG partitioning, continuous batching, and speculative decoding exist because GPUs cost $30K+ each. Every optimization technique serves the same goal: maximize useful computation per GPU-second.
Vector databases enable semantic understanding at scale. The embed → index → query pipeline transforms AI from "pattern matching on text" to "understanding meaning." HNSW provides O(log n) similarity search across billions of vectors.
MCP standardizes the AI-tool interface. Like HTTP standardized web communication, MCP standardizes how AI models interact with external systems. Build MCP servers once, every AI application can use them.
Multi-agent systems need explicit orchestration. Autonomous agents sound appealing but require careful state management, failure handling, and human oversight. Start with supervisor patterns before attempting peer-to-peer.
Guardrails are architectural, not afterthoughts. Design the guardrails pipeline first, then build features within it. Input validation, output safety, and factuality checking are as critical as the model itself.
AI observability goes beyond uptime. You must continuously measure output quality, not just system health. LLM-as-judge evaluation, user feedback loops, and drift detection are essential for production AI systems.

Next in the Series

In Part 20: Labs & Intellectual Foundations, we'll bring everything together with hands-on labs from beginner to expert level, plus the intellectual foundations that underpin systems thinking — control theory, network theory, economics, and organizational design.

Previous Part 18: Team Topologies & Governance Next Part 20: Labs & Intellectual Foundations

Cookie Consent

Part 19: AI-Native Systems Architecture

Table of Contents

Module 41: AI Infrastructure Systems

GPU Scheduling Strategies

GPU Memory Management

Vector Databases for Semantic Search

HNSW Algorithm & Hybrid Search

AI Inference Platforms

Inference Optimization Techniques

Module 42: Agentic Systems Architecture

Model Context Protocol (MCP) Architecture

Building MCP Servers

Multi-Agent Patterns

Workflow Orchestration with State Machines

Module 43: AI Reliability & Safety

Guardrails Architecture

Hallucination Containment

AI Observability

The "Quality Score Dashboard" Pattern

Case Studies

OpenAI: Inference Scaling at Unprecedented Scale

Serving GPT-4 to 200M+ Weekly Users

Anthropic: Building the MCP Ecosystem

MCP as the "USB for AI" — Universal Tool Integration

Conclusion & Key Takeaways

Next in the Series

Cookie Consent

Part 19: AI-Native Systems Architecture

Table of Contents

Module 41: AI Infrastructure Systems

GPU Scheduling Strategies

GPU Memory Management

Vector Databases for Semantic Search

HNSW Algorithm & Hybrid Search

AI Inference Platforms

Inference Optimization Techniques

Module 42: Agentic Systems Architecture

Model Context Protocol (MCP) Architecture

Building MCP Servers

Multi-Agent Patterns

Workflow Orchestration with State Machines

Module 43: AI Reliability & Safety

Guardrails Architecture

Hallucination Containment

AI Observability

The "Quality Score Dashboard" Pattern

Case Studies

OpenAI: Inference Scaling at Unprecedented Scale

Serving GPT-4 to 200M+ Weekly Users

Anthropic: Building the MCP Ecosystem

MCP as the "USB for AI" — Universal Tool Integration

Conclusion & Key Takeaways

Next in the Series

Related Articles in This Series

Part 18: Team Topologies & Governance

Part 20: Labs & Intellectual Foundations

Part 12: Resilience Engineering