Back to Systems Thinking & Architecture Mastery Series

Part 18: Team Topologies & Governance

May 15, 2026 Wasil Zafar 26 min read

"The goal is not to build teams that follow a structure, but to evolve team structures that minimize cognitive load while maximizing flow." — Matthew Skelton & Manuel Pais. Team Topologies provides a vocabulary and framework for organizing teams to achieve fast flow of change.

Table of Contents

  1. Module 39: Team Topologies
  2. Module 40: Governance & Standardization
  3. Case Studies
  4. Conclusion & Next Steps

Module 39: Team Topologies

Cognitive Load as the Fundamental Constraint

Team Topologies (Skelton & Pais, 2019) starts from a key insight: cognitive load is the fundamental constraint on team effectiveness. A team can only handle so much complexity before productivity collapses. The framework identifies three types of cognitive load:

  • Intrinsic cognitive load: The inherent complexity of the problem domain itself. A payment processing team inherently deals with complex financial regulations, currency conversions, and fraud detection — this can't be eliminated.
  • Extraneous cognitive load: Complexity imposed by poor tooling, unclear processes, or unnecessary organizational friction. Deploying requires 15 manual steps, debugging requires access to 3 different logging systems, onboarding takes 6 weeks. This should be minimized.
  • Germane cognitive load: The "productive" complexity of learning new skills and building domain expertise. Reading a new codebase, understanding a new pattern, mastering a new tool. This should be managed — present but not overwhelming.
The Team Topologies Goal: Minimize extraneous load (via platforms, golden paths, good tooling), manage germane load (via enabling teams, learning time), and ensure intrinsic load fits within a team's cognitive capacity (~7-9 people). If a domain is too complex for one team, split it — don't add more people (Brook's Law).

Four Fundamental Team Types

Team Topologies defines exactly four team types. Every team in your organization should be one of these — if a team doesn't fit, it likely has unclear responsibilities.

Four Team Types and Interaction Modes
flowchart TD
    subgraph StreamAligned["Stream-Aligned Teams"]
        SA1[Search Squad]
        SA2[Checkout Squad]
        SA3[Recommendations Squad]
    end

    subgraph Platform["Platform Team"]
        PT[Internal Developer Platform
Self-service capabilities] end subgraph Enabling["Enabling Team"] ET[Observability Enablers
Help teams adopt new capabilities] end subgraph Complicated["Complicated-Subsystem Team"] CS[ML Ranking Engine
Deep specialist expertise] end SA1 -->|"X-as-a-Service"| PT SA2 -->|"X-as-a-Service"| PT SA3 -->|"X-as-a-Service"| PT ET -->|"Facilitating"| SA1 ET -->|"Facilitating"| SA2 CS -->|"X-as-a-Service"| SA3 SA1 -.->|"Collaboration
(time-limited)"| SA3

Stream-Aligned Teams

The primary team type — aligned to a single stream of work (a product, a service, a user journey, or a business domain). Stream-aligned teams deliver value directly to users or customers.

Responsibilities:

  • Own a portion of the business domain end-to-end (design, build, test, deploy, operate)
  • Have full ownership of their services in production (you build it, you run it)
  • Minimize handoffs and dependencies on other teams
  • Deliver value independently on their own cadence

Anti-patterns:

  • "Component teams" that own a layer (frontend team, backend team) rather than a business capability
  • Teams that can't deploy without coordinating with 3+ other teams
  • Teams that own 50+ microservices (cognitive overload — split them)

Platform Teams

Platform teams provide self-service capabilities that reduce the cognitive load on stream-aligned teams. They build the "paved road" — the easy, fast, well-supported path for common tasks.

Responsibilities:

  • Provide self-service infrastructure (deployment pipelines, observability, databases, messaging)
  • Treat stream-aligned teams as customers (product mindset, not project mindset)
  • Reduce extraneous cognitive load (teams shouldn't need to understand Kubernetes internals to deploy)
  • Maintain and evolve the platform based on customer feedback

Anti-patterns:

  • Platform team as gatekeeper ("submit a ticket and wait 2 weeks")
  • Platform that requires deep expertise to use (defeats the purpose)
  • Platform that mandates specific implementation choices rather than providing abstractions

Enabling Teams

Enabling teams help stream-aligned teams acquire new capabilities. They are temporary consultants — they pair with a team, transfer knowledge, then move on. The goal is to make themselves unnecessary to that team.

Responsibilities:

  • Research and curate emerging practices (observability patterns, testing strategies, security practices)
  • Pair with stream-aligned teams to transfer knowledge (not do the work for them)
  • Create learning materials, workshops, and reference implementations
  • Detect and address capability gaps across the organization

Anti-patterns:

  • Enabling team becomes a permanent dependency ("we can't do X without them")
  • Enabling team does the work instead of teaching (creates bottleneck)
  • Enabling team that never moves on (should engage for weeks/months, not years)

Complicated-Subsystem Teams

Complicated-subsystem teams own components that require deep specialist expertise that most stream-aligned teams shouldn't need to develop. The subsystem's complexity would overwhelm a generalist team.

Examples:

  • ML/AI model training and serving infrastructure
  • Video codec and transcoding pipeline
  • Real-time bidding engine with sub-millisecond latency requirements
  • Cryptography and security protocol implementation

Key distinction: A complicated-subsystem team exposes a simplified API to stream-aligned teams. The stream-aligned team calls rankResults(query, candidates) — they don't need to understand gradient descent, feature engineering, or model serving infrastructure.

Interaction Modes

Team Topologies defines exactly three interaction modes between teams:

1. Collaboration — Two teams work closely together on a shared goal. High bandwidth, high coordination cost.

  • When: Exploring new territory, discovering boundaries, building something novel together
  • Duration: Time-limited (weeks to a few months). Permanent collaboration indicates unclear ownership.
  • Example: Search squad + ML team collaborating to build a new ranking system before the ML team packages it as a service

2. X-as-a-Service — One team provides a capability as a service. Clear API, low coordination needed.

  • When: The providing team has a stable, well-understood capability
  • Duration: Ongoing (this is the target steady-state for most interactions)
  • Example: Platform team provides CI/CD as a service — stream-aligned teams push code and it deploys. No coordination needed.

3. Facilitating — One team helps another acquire new capabilities. Coaching relationship.

  • When: A stream-aligned team needs to adopt a new practice (observability, testing, security)
  • Duration: Temporary (until the coached team is self-sufficient)
  • Example: Enabling team facilitates adoption of distributed tracing — pairs for 3 sprints, then moves on
Interaction Mode Red Flags: If collaboration mode lasts more than a few months, boundaries are unclear — split or merge the teams. If X-as-a-Service requires constant tickets and waiting, the platform isn't self-service enough. If facilitating never ends, the enabling team is doing the work instead of teaching it.

Module 40: Governance & Standardization

Platform as a Product

The internal developer platform (IDP) should be treated as a product with real users (developers), a product roadmap, user research, and success metrics. Tools like Backstage (Spotify), Port, and Cortex provide service catalogs, golden paths, and developer portals.

Platform as a Product — Layers
flowchart TD
    subgraph DX["Developer Experience Layer"]
        Portal[Service Catalog / Portal
Backstage, Port, Cortex] Templates[Golden Path Templates
Scaffolding, cookiecutters] Docs[Self-Service Documentation
How-tos, tutorials, examples] end subgraph Capabilities["Platform Capabilities"] CICD[CI/CD Pipelines
Build, test, deploy] Observ[Observability Stack
Metrics, traces, logs, dashboards] Infra[Infrastructure
Kubernetes, databases, messaging] Security[Security & Compliance
Scanning, secrets, policies] end subgraph Foundation["Foundation"] Cloud[Cloud Infrastructure
AWS, Azure, GCP] Network[Network & Connectivity
VPCs, service mesh, DNS] end DX --> Capabilities --> Foundation style DX fill:#f0f8ff style Capabilities fill:#f0fff0 style Foundation fill:#fff8f0

Golden paths (also called "paved roads") are the pre-built, well-supported default paths for common tasks. They don't restrict — they accelerate:

# golden-path-template.yaml — Service scaffolding template
# Used by: `platform create-service --template=rest-api`
apiVersion: platform.internal/v1
kind: GoldenPathTemplate
metadata:
  name: rest-api-service
  description: "Production-ready REST API with observability, CI/CD, and security"
  category: backend
  owner: platform-team
  tags: ["rest", "api", "production-ready"]

parameters:
  - name: service_name
    description: "Service name (lowercase, hyphens allowed)"
    type: string
    pattern: "^[a-z][a-z0-9-]{2,30}$"
  - name: team
    description: "Owning team name"
    type: string
  - name: language
    description: "Programming language"
    type: enum
    values: ["java", "python", "typescript", "go"]
    default: "go"
  - name: database
    description: "Database type (or 'none')"
    type: enum
    values: ["postgres", "mongodb", "redis", "none"]
    default: "none"

includes:
  # What you get out of the box:
  - ci_cd_pipeline: true           # GitHub Actions / ArgoCD
  - dockerfile: true               # Multi-stage optimized build
  - kubernetes_manifests: true     # Deployment, Service, HPA
  - observability:
      metrics: true                # Prometheus metrics endpoint
      tracing: true                # OpenTelemetry auto-instrumentation
      logging: true                # Structured JSON logging
      dashboards: true             # Grafana dashboard template
  - security:
      sast_scanning: true          # Static analysis in CI
      dependency_scanning: true    # Dependabot / Snyk
      secrets_management: true     # Vault integration
      network_policies: true       # K8s NetworkPolicy
  - testing:
      unit_test_framework: true
      integration_test_setup: true
      load_test_template: true
  - documentation:
      openapi_spec: true           # Swagger/OpenAPI 3.0
      runbook_template: true       # Incident response guide
      adr_template: true           # Architecture Decision Record

success_metrics:
  - "Time from 'platform create-service' to first production deploy < 1 hour"
  - "New engineer can make first commit within 2 hours of onboarding"
  - "> 80% of new services use golden path template"

Technology Radar

The Technology Radar (popularized by ThoughtWorks) is a governance tool that communicates technology recommendations across the organization. It uses four rings to classify technologies:

Technology Radar Rings
flowchart TD
    subgraph Adopt["ADOPT — Use by Default"]
        A1[Go for new services]
        A2[PostgreSQL for relational data]
        A3[Kafka for event streaming]
        A4[OpenTelemetry for observability]
    end

    subgraph Trial["TRIAL — Use with Caution"]
        T1[Rust for performance-critical paths]
        T2[CockroachDB for global distribution]
        T3[Temporal for workflow orchestration]
    end

    subgraph Assess["ASSESS — Explore & Evaluate"]
        AS1[WebAssembly for edge compute]
        AS2[eBPF for kernel-level observability]
        AS3[Spin for serverless WASM]
    end

    subgraph Hold["HOLD — Do Not Start New Work"]
        H1[Ruby on Rails - legacy only]
        H2[MongoDB - migrate to PostgreSQL]
        H3[Jenkins - migrate to GitHub Actions]
    end

    Adopt --> Trial --> Assess --> Hold
                            

Ring definitions:

  • Adopt: Proven technologies that should be the default choice. Teams don't need to justify using these — they're the "golden path."
  • Trial: Technologies that have shown promise and are ready for production use in limited contexts. Teams can use these but should share learnings.
  • Assess: Technologies worth exploring. Teams can prototype and evaluate, but not for production workloads yet.
  • Hold: Technologies that should not be used for new projects. Existing usage is maintained but not expanded. Migration paths should exist.
{
    "radar": {
        "name": "Engineering Technology Radar Q2 2026",
        "version": "2026.2",
        "last_updated": "2026-04-01",
        "review_cadence": "quarterly",
        "governance_team": "architecture-guild"
    },
    "entries": [
        {
            "name": "Go",
            "ring": "adopt",
            "quadrant": "languages-frameworks",
            "moved": "none",
            "description": "Default language for new backend services. Strong concurrency model, fast compilation, excellent observability tooling.",
            "rationale": "3 years in production, 40+ services, strong team expertise. Performance characteristics match our scale requirements.",
            "first_appeared": "2023-Q3",
            "adopted_date": "2024-Q2",
            "teams_using": ["search", "checkout", "recommendations", "platform"],
            "related_adrs": ["ADR-028", "ADR-035"]
        },
        {
            "name": "Temporal",
            "ring": "trial",
            "quadrant": "platforms",
            "moved": "up",
            "previous_ring": "assess",
            "description": "Workflow orchestration platform for long-running, reliable business processes.",
            "rationale": "Successful POC in order fulfillment (ADR-041). Reduced saga complexity by 60%. Moving to trial for 2 more use cases before adopt.",
            "first_appeared": "2025-Q1",
            "trial_criteria": {
                "success_metrics": ["< 5ms overhead per workflow step", "99.99% workflow completion rate"],
                "evaluation_period": "6 months",
                "pilot_teams": ["order-fulfillment", "payment-processing"]
            }
        },
        {
            "name": "Jenkins",
            "ring": "hold",
            "quadrant": "platforms",
            "moved": "none",
            "description": "Legacy CI/CD platform. All new pipelines must use GitHub Actions.",
            "rationale": "High maintenance burden, security vulnerabilities, poor developer experience. Migration to GitHub Actions reduces CI cost by 40%.",
            "hold_date": "2025-Q2",
            "migration_plan": {
                "target": "GitHub Actions",
                "deadline": "2026-Q4",
                "teams_remaining": 4,
                "migration_guide": "https://internal.docs/migrate-jenkins-to-gha"
            }
        }
    ]
}

Architecture Decision Records

Architecture Decision Records (ADRs) document significant architectural decisions, their context, and consequences. They serve as institutional memory — explaining why decisions were made, not just what was decided.

Lightweight ADRs vs Heavyweight Review Boards:

Aspect Lightweight ADR Architecture Review Board
When Most decisions (default) Cross-cutting, high-impact, irreversible decisions
Process Write ADR → PR review → merge RFC → review meeting → formal approval
Turnaround 1-3 days 1-4 weeks
Scope Within one team/service Across multiple teams/org-wide
Examples Choose database, add caching layer, adopt library Change data platform, adopt new language org-wide, major API redesign
ADR Best Practices: Store ADRs in the repository they affect (not a central wiki — it becomes stale). Number them sequentially (ADR-001, ADR-002). Never delete ADRs — mark superseded ones with a link to the replacement. Keep context rich — a future reader should understand WHY without asking anyone.

Technology Lifecycle Management

Every technology in your stack has a lifecycle. Managing that lifecycle proactively prevents the "legacy crisis" where critical systems run on unsupported frameworks.

Introduction criteria (gates before a new technology enters the radar):

  • Does it solve a problem that existing "Adopt" technologies can't?
  • Is there a team willing to champion and maintain expertise?
  • Does it integrate with our observability and security stack?
  • What's the vendor/community health (funding, contributors, release cadence)?
  • What's the exit strategy if it doesn't work out?

Sunset criteria (triggers for moving technology to "Hold"):

  • Vendor end-of-life announcement or community stagnation
  • Security vulnerabilities without timely patches
  • Better alternative reaches "Adopt" status
  • Hiring difficulty (can't find engineers who know/want to use it)
  • Increasing operational burden relative to alternatives

Paved Roads vs Guardrails

Two philosophies for governance — effective organizations use both:

Paved Roads (encourage good patterns):

  • Make the right thing the easy thing (golden path templates)
  • Provide excellent defaults that teams adopt voluntarily
  • Invest in developer experience so the platform path is faster than DIY
  • Measure adoption rate as a product metric (if <70% use it, improve UX)

Guardrails (block dangerous patterns):

  • Automated policy enforcement in CI/CD (no hardcoded secrets, no public S3 buckets)
  • Network policies that prevent unauthorized service-to-service communication
  • Dependency scanning that blocks known vulnerable versions
  • Cost guardrails that alert on unexpected spend increases
#!/bin/bash
# platform-health-dashboard.sh — Platform adoption and health metrics

echo "=== Platform Health Dashboard ==="
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""

# Golden path adoption
TOTAL_SERVICES=$(kubectl get deployments -A --no-headers 2>/dev/null | wc -l)
GOLDEN_PATH_SERVICES=$(kubectl get deployments -A \
    -l "platform.internal/golden-path=true" --no-headers 2>/dev/null | wc -l)

if [ "$TOTAL_SERVICES" -gt 0 ]; then
    ADOPTION_RATE=$(echo "scale=1; $GOLDEN_PATH_SERVICES * 100 / $TOTAL_SERVICES" | bc)
    echo "Golden Path Adoption: ${ADOPTION_RATE}% ($GOLDEN_PATH_SERVICES / $TOTAL_SERVICES services)"
else
    echo "Golden Path Adoption: N/A (no services found)"
fi

echo ""
echo "=== Service Catalog Coverage ==="
# Services with required metadata
CATALOGED=$(kubectl get deployments -A \
    -l "platform.internal/team,platform.internal/tier" --no-headers 2>/dev/null | wc -l)
echo "Cataloged Services: $CATALOGED / $TOTAL_SERVICES"

echo ""
echo "=== Observability Coverage ==="
# Services with metrics endpoints
METRICS_ENABLED=$(kubectl get services -A \
    -l "prometheus.io/scrape=true" --no-headers 2>/dev/null | wc -l)
echo "Metrics Enabled: $METRICS_ENABLED / $TOTAL_SERVICES"

echo ""
echo "=== Technology Radar Compliance ==="
# Check for "Hold" technologies in use
echo "Services using HOLD technologies:"
echo "  - Jenkins pipelines remaining: $(find . -name "Jenkinsfile" 2>/dev/null | wc -l)"
echo "  - MongoDB instances: $(kubectl get statefulsets -A \
    -l "app.kubernetes.io/name=mongodb" --no-headers 2>/dev/null | wc -l)"

echo ""
echo "=== Developer Experience ==="
echo "  - Avg time to first deploy (new service): query from platform metrics"
echo "  - Platform support ticket backlog: query from ticketing system"
echo "  - Self-service success rate: query from platform API logs"
# golden-path-config.yaml — Template configuration for team scaffolding
apiVersion: platform.internal/v1
kind: TeamOnboarding
metadata:
  name: new-stream-aligned-team
  description: "Everything a new stream-aligned team gets on day one"

team_setup:
  # Communication
  slack_channels:
    - "{team-name}"
    - "{team-name}-alerts"
    - "{team-name}-deployments"
  
  # Access & Permissions
  github:
    org: "company"
    team_slug: "{team-name}"
    repositories:
      - name: "{team-name}-service"
        template: "rest-api-service"
        visibility: "internal"
  
  # Infrastructure
  kubernetes:
    namespace: "{team-name}"
    resource_quotas:
      cpu: "8"
      memory: "16Gi"
      pods: "50"
  
  # Observability
  grafana:
    folder: "{team-name}"
    dashboards:
      - "service-overview"
      - "slo-dashboard"
      - "on-call-dashboard"
  
  # On-call
  pagerduty:
    service: "{team-name}"
    escalation_policy: "standard-engineering"
    schedule: "follow-the-sun"

  # Documentation
  backstage:
    component:
      type: "service"
      owner: "{team-name}"
      lifecycle: "production"
      system: "{business-domain}"

Case Studies

Spotify: Platform Team Evolution

Platform Developer Experience Backstage

From Fragmentation to Platform

Context: By 2016, Spotify's "full autonomy" squad model had created fragmentation. Over 200 autonomous squads had chosen different tech stacks, built duplicate infrastructure, and created inconsistent developer experiences. Onboarding a new engineer took weeks because every team did things differently.

Problem signals:

  • 30+ different ways to deploy a service
  • No standard way to discover what services existed or who owned them
  • Duplicated effort across squads (every team built their own CI/CD)
  • New engineers spent 60% of first month learning tooling, not domain

Solution — Backstage (open-sourced 2020):

  • Service catalog: Single registry of all services, their owners, documentation, and dependencies
  • Software templates: Golden paths for creating new services with all infrastructure pre-configured
  • TechDocs: Documentation lives alongside code, rendered in the portal
  • Plugin ecosystem: Teams can extend the portal for their specific needs

Results:

  • Time to first deploy for new services: from 2 weeks to <1 hour
  • New engineer onboarding: from 60 days to ~20 days productive
  • Service discoverability: from "ask on Slack" to instant catalog search
  • Golden path adoption: 87% of new services use templates

Key lesson: Autonomy without platform creates fragmentation. Platform without autonomy creates bottlenecks. The balance is: platform team provides golden paths (self-service, optional but compelling), stream-aligned teams can deviate when justified (documented in ADR).

ThoughtWorks Technology Radar Process

Governance Technology Radar Decision Framework

Quarterly Technology Governance at Scale

Context: ThoughtWorks (global consultancy, ~10,000 technologists) publishes their Technology Radar publicly since 2010. Internally, they use the same framework to govern technology choices across projects.

Process:

  1. Collection (Month 1-2): Technologists across the organization submit "blips" — technology observations from real project experience. Each blip needs: name, proposed ring, rationale from production experience.
  2. Review (Month 2, 2-day session): ~20 senior technologists meet to debate each blip. Every ring placement requires consensus. Disagreements are discussed until resolved or the blip is deferred.
  3. Publication (Month 3): Radar is published with detailed write-ups explaining each movement. "Moved up" and "moved down" are highlighted for visibility.
  4. Application: Project teams use the radar when making technology choices. "Adopt" = default. "Trial" = needs tech lead approval. "Hold" = needs architect approval + migration plan.

Why it works:

  • Evidence-based: Every recommendation comes from real project experience, not theoretical evaluation
  • Living document: Quarterly updates keep it current (unlike annual policies that become stale)
  • Visible governance: Everyone can see what's recommended and why — no "shadow decisions"
  • Balanced autonomy: Teams can still choose "Trial" or "Assess" technologies — they just need to be explicit about the risk

Adaptation for your organization: Start with 2-3 quadrants (Languages, Platforms, Tools). Review biannually if quarterly is too frequent. Require at least 2 teams' production experience before moving anything to "Adopt." Create a public channel where anyone can propose blips — governance should be participatory, not top-down.

Conclusion & Next Steps

The key takeaways from this module:

  • Cognitive load is the constraint. Teams have finite capacity for complexity. Team Topologies minimizes extraneous load (via platforms), manages germane load (via enabling teams), and ensures intrinsic load fits within team capacity. If a team is struggling, the first question is "what load can we remove?" not "how do we add more people?"
  • Four team types, three interaction modes. Stream-aligned (deliver value), platform (reduce load), enabling (transfer capability), complicated-subsystem (encapsulate expertise). Interactions are collaboration (time-limited), X-as-a-Service (steady-state goal), or facilitating (temporary coaching). If your org has other team types, the boundaries are likely unclear.
  • Platform is a product. Treat internal developers as customers. Measure adoption rate, time-to-first-deploy, and developer satisfaction. If your platform requires training to use, it's not a platform — it's another system to maintain.
  • Technology radar enables disciplined innovation. Four rings (Adopt, Trial, Assess, Hold) communicate recommendations clearly. Evidence-based governance (from production experience) trumps theoretical evaluation. Review quarterly to stay current.
  • ADRs are institutional memory. Store decisions alongside code. Explain context and consequences, not just the decision itself. Never delete — supersede. A future engineer should understand why without asking anyone currently employed.
  • Paved roads AND guardrails. Make the right thing easy (golden paths, templates, self-service) AND make the wrong thing hard (automated policy enforcement, network policies, dependency scanning). Neither alone is sufficient — together they create a system where the fast path is also the safe path.

Next in the Series

In Part 19: AI-Native Systems, we'll explore the unique architectural challenges of systems with AI at their core — ML pipelines, model serving patterns, feedback loops, probabilistic behavior, and how traditional architecture principles adapt when your components are non-deterministic.