Module 39: Team Topologies
Cognitive Load as the Fundamental Constraint
Team Topologies (Skelton & Pais, 2019) starts from a key insight: cognitive load is the fundamental constraint on team effectiveness. A team can only handle so much complexity before productivity collapses. The framework identifies three types of cognitive load:
- Intrinsic cognitive load: The inherent complexity of the problem domain itself. A payment processing team inherently deals with complex financial regulations, currency conversions, and fraud detection — this can't be eliminated.
- Extraneous cognitive load: Complexity imposed by poor tooling, unclear processes, or unnecessary organizational friction. Deploying requires 15 manual steps, debugging requires access to 3 different logging systems, onboarding takes 6 weeks. This should be minimized.
- Germane cognitive load: The "productive" complexity of learning new skills and building domain expertise. Reading a new codebase, understanding a new pattern, mastering a new tool. This should be managed — present but not overwhelming.
Four Fundamental Team Types
Team Topologies defines exactly four team types. Every team in your organization should be one of these — if a team doesn't fit, it likely has unclear responsibilities.
flowchart TD
subgraph StreamAligned["Stream-Aligned Teams"]
SA1[Search Squad]
SA2[Checkout Squad]
SA3[Recommendations Squad]
end
subgraph Platform["Platform Team"]
PT[Internal Developer Platform
Self-service capabilities]
end
subgraph Enabling["Enabling Team"]
ET[Observability Enablers
Help teams adopt new capabilities]
end
subgraph Complicated["Complicated-Subsystem Team"]
CS[ML Ranking Engine
Deep specialist expertise]
end
SA1 -->|"X-as-a-Service"| PT
SA2 -->|"X-as-a-Service"| PT
SA3 -->|"X-as-a-Service"| PT
ET -->|"Facilitating"| SA1
ET -->|"Facilitating"| SA2
CS -->|"X-as-a-Service"| SA3
SA1 -.->|"Collaboration
(time-limited)"| SA3
Stream-Aligned Teams
The primary team type — aligned to a single stream of work (a product, a service, a user journey, or a business domain). Stream-aligned teams deliver value directly to users or customers.
Responsibilities:
- Own a portion of the business domain end-to-end (design, build, test, deploy, operate)
- Have full ownership of their services in production (you build it, you run it)
- Minimize handoffs and dependencies on other teams
- Deliver value independently on their own cadence
Anti-patterns:
- "Component teams" that own a layer (frontend team, backend team) rather than a business capability
- Teams that can't deploy without coordinating with 3+ other teams
- Teams that own 50+ microservices (cognitive overload — split them)
Platform Teams
Platform teams provide self-service capabilities that reduce the cognitive load on stream-aligned teams. They build the "paved road" — the easy, fast, well-supported path for common tasks.
Responsibilities:
- Provide self-service infrastructure (deployment pipelines, observability, databases, messaging)
- Treat stream-aligned teams as customers (product mindset, not project mindset)
- Reduce extraneous cognitive load (teams shouldn't need to understand Kubernetes internals to deploy)
- Maintain and evolve the platform based on customer feedback
Anti-patterns:
- Platform team as gatekeeper ("submit a ticket and wait 2 weeks")
- Platform that requires deep expertise to use (defeats the purpose)
- Platform that mandates specific implementation choices rather than providing abstractions
Enabling Teams
Enabling teams help stream-aligned teams acquire new capabilities. They are temporary consultants — they pair with a team, transfer knowledge, then move on. The goal is to make themselves unnecessary to that team.
Responsibilities:
- Research and curate emerging practices (observability patterns, testing strategies, security practices)
- Pair with stream-aligned teams to transfer knowledge (not do the work for them)
- Create learning materials, workshops, and reference implementations
- Detect and address capability gaps across the organization
Anti-patterns:
- Enabling team becomes a permanent dependency ("we can't do X without them")
- Enabling team does the work instead of teaching (creates bottleneck)
- Enabling team that never moves on (should engage for weeks/months, not years)
Complicated-Subsystem Teams
Complicated-subsystem teams own components that require deep specialist expertise that most stream-aligned teams shouldn't need to develop. The subsystem's complexity would overwhelm a generalist team.
Examples:
- ML/AI model training and serving infrastructure
- Video codec and transcoding pipeline
- Real-time bidding engine with sub-millisecond latency requirements
- Cryptography and security protocol implementation
Key distinction: A complicated-subsystem team exposes a simplified API to stream-aligned teams. The stream-aligned team calls rankResults(query, candidates) — they don't need to understand gradient descent, feature engineering, or model serving infrastructure.
Interaction Modes
Team Topologies defines exactly three interaction modes between teams:
1. Collaboration — Two teams work closely together on a shared goal. High bandwidth, high coordination cost.
- When: Exploring new territory, discovering boundaries, building something novel together
- Duration: Time-limited (weeks to a few months). Permanent collaboration indicates unclear ownership.
- Example: Search squad + ML team collaborating to build a new ranking system before the ML team packages it as a service
2. X-as-a-Service — One team provides a capability as a service. Clear API, low coordination needed.
- When: The providing team has a stable, well-understood capability
- Duration: Ongoing (this is the target steady-state for most interactions)
- Example: Platform team provides CI/CD as a service — stream-aligned teams push code and it deploys. No coordination needed.
3. Facilitating — One team helps another acquire new capabilities. Coaching relationship.
- When: A stream-aligned team needs to adopt a new practice (observability, testing, security)
- Duration: Temporary (until the coached team is self-sufficient)
- Example: Enabling team facilitates adoption of distributed tracing — pairs for 3 sprints, then moves on
Module 40: Governance & Standardization
Platform as a Product
The internal developer platform (IDP) should be treated as a product with real users (developers), a product roadmap, user research, and success metrics. Tools like Backstage (Spotify), Port, and Cortex provide service catalogs, golden paths, and developer portals.
flowchart TD
subgraph DX["Developer Experience Layer"]
Portal[Service Catalog / Portal
Backstage, Port, Cortex]
Templates[Golden Path Templates
Scaffolding, cookiecutters]
Docs[Self-Service Documentation
How-tos, tutorials, examples]
end
subgraph Capabilities["Platform Capabilities"]
CICD[CI/CD Pipelines
Build, test, deploy]
Observ[Observability Stack
Metrics, traces, logs, dashboards]
Infra[Infrastructure
Kubernetes, databases, messaging]
Security[Security & Compliance
Scanning, secrets, policies]
end
subgraph Foundation["Foundation"]
Cloud[Cloud Infrastructure
AWS, Azure, GCP]
Network[Network & Connectivity
VPCs, service mesh, DNS]
end
DX --> Capabilities --> Foundation
style DX fill:#f0f8ff
style Capabilities fill:#f0fff0
style Foundation fill:#fff8f0
Golden paths (also called "paved roads") are the pre-built, well-supported default paths for common tasks. They don't restrict — they accelerate:
# golden-path-template.yaml — Service scaffolding template
# Used by: `platform create-service --template=rest-api`
apiVersion: platform.internal/v1
kind: GoldenPathTemplate
metadata:
name: rest-api-service
description: "Production-ready REST API with observability, CI/CD, and security"
category: backend
owner: platform-team
tags: ["rest", "api", "production-ready"]
parameters:
- name: service_name
description: "Service name (lowercase, hyphens allowed)"
type: string
pattern: "^[a-z][a-z0-9-]{2,30}$"
- name: team
description: "Owning team name"
type: string
- name: language
description: "Programming language"
type: enum
values: ["java", "python", "typescript", "go"]
default: "go"
- name: database
description: "Database type (or 'none')"
type: enum
values: ["postgres", "mongodb", "redis", "none"]
default: "none"
includes:
# What you get out of the box:
- ci_cd_pipeline: true # GitHub Actions / ArgoCD
- dockerfile: true # Multi-stage optimized build
- kubernetes_manifests: true # Deployment, Service, HPA
- observability:
metrics: true # Prometheus metrics endpoint
tracing: true # OpenTelemetry auto-instrumentation
logging: true # Structured JSON logging
dashboards: true # Grafana dashboard template
- security:
sast_scanning: true # Static analysis in CI
dependency_scanning: true # Dependabot / Snyk
secrets_management: true # Vault integration
network_policies: true # K8s NetworkPolicy
- testing:
unit_test_framework: true
integration_test_setup: true
load_test_template: true
- documentation:
openapi_spec: true # Swagger/OpenAPI 3.0
runbook_template: true # Incident response guide
adr_template: true # Architecture Decision Record
success_metrics:
- "Time from 'platform create-service' to first production deploy < 1 hour"
- "New engineer can make first commit within 2 hours of onboarding"
- "> 80% of new services use golden path template"
Technology Radar
The Technology Radar (popularized by ThoughtWorks) is a governance tool that communicates technology recommendations across the organization. It uses four rings to classify technologies:
flowchart TD
subgraph Adopt["ADOPT — Use by Default"]
A1[Go for new services]
A2[PostgreSQL for relational data]
A3[Kafka for event streaming]
A4[OpenTelemetry for observability]
end
subgraph Trial["TRIAL — Use with Caution"]
T1[Rust for performance-critical paths]
T2[CockroachDB for global distribution]
T3[Temporal for workflow orchestration]
end
subgraph Assess["ASSESS — Explore & Evaluate"]
AS1[WebAssembly for edge compute]
AS2[eBPF for kernel-level observability]
AS3[Spin for serverless WASM]
end
subgraph Hold["HOLD — Do Not Start New Work"]
H1[Ruby on Rails - legacy only]
H2[MongoDB - migrate to PostgreSQL]
H3[Jenkins - migrate to GitHub Actions]
end
Adopt --> Trial --> Assess --> Hold
Ring definitions:
- Adopt: Proven technologies that should be the default choice. Teams don't need to justify using these — they're the "golden path."
- Trial: Technologies that have shown promise and are ready for production use in limited contexts. Teams can use these but should share learnings.
- Assess: Technologies worth exploring. Teams can prototype and evaluate, but not for production workloads yet.
- Hold: Technologies that should not be used for new projects. Existing usage is maintained but not expanded. Migration paths should exist.
{
"radar": {
"name": "Engineering Technology Radar Q2 2026",
"version": "2026.2",
"last_updated": "2026-04-01",
"review_cadence": "quarterly",
"governance_team": "architecture-guild"
},
"entries": [
{
"name": "Go",
"ring": "adopt",
"quadrant": "languages-frameworks",
"moved": "none",
"description": "Default language for new backend services. Strong concurrency model, fast compilation, excellent observability tooling.",
"rationale": "3 years in production, 40+ services, strong team expertise. Performance characteristics match our scale requirements.",
"first_appeared": "2023-Q3",
"adopted_date": "2024-Q2",
"teams_using": ["search", "checkout", "recommendations", "platform"],
"related_adrs": ["ADR-028", "ADR-035"]
},
{
"name": "Temporal",
"ring": "trial",
"quadrant": "platforms",
"moved": "up",
"previous_ring": "assess",
"description": "Workflow orchestration platform for long-running, reliable business processes.",
"rationale": "Successful POC in order fulfillment (ADR-041). Reduced saga complexity by 60%. Moving to trial for 2 more use cases before adopt.",
"first_appeared": "2025-Q1",
"trial_criteria": {
"success_metrics": ["< 5ms overhead per workflow step", "99.99% workflow completion rate"],
"evaluation_period": "6 months",
"pilot_teams": ["order-fulfillment", "payment-processing"]
}
},
{
"name": "Jenkins",
"ring": "hold",
"quadrant": "platforms",
"moved": "none",
"description": "Legacy CI/CD platform. All new pipelines must use GitHub Actions.",
"rationale": "High maintenance burden, security vulnerabilities, poor developer experience. Migration to GitHub Actions reduces CI cost by 40%.",
"hold_date": "2025-Q2",
"migration_plan": {
"target": "GitHub Actions",
"deadline": "2026-Q4",
"teams_remaining": 4,
"migration_guide": "https://internal.docs/migrate-jenkins-to-gha"
}
}
]
}
Architecture Decision Records
Architecture Decision Records (ADRs) document significant architectural decisions, their context, and consequences. They serve as institutional memory — explaining why decisions were made, not just what was decided.
Lightweight ADRs vs Heavyweight Review Boards:
| Aspect | Lightweight ADR | Architecture Review Board |
|---|---|---|
| When | Most decisions (default) | Cross-cutting, high-impact, irreversible decisions |
| Process | Write ADR → PR review → merge | RFC → review meeting → formal approval |
| Turnaround | 1-3 days | 1-4 weeks |
| Scope | Within one team/service | Across multiple teams/org-wide |
| Examples | Choose database, add caching layer, adopt library | Change data platform, adopt new language org-wide, major API redesign |
Technology Lifecycle Management
Every technology in your stack has a lifecycle. Managing that lifecycle proactively prevents the "legacy crisis" where critical systems run on unsupported frameworks.
Introduction criteria (gates before a new technology enters the radar):
- Does it solve a problem that existing "Adopt" technologies can't?
- Is there a team willing to champion and maintain expertise?
- Does it integrate with our observability and security stack?
- What's the vendor/community health (funding, contributors, release cadence)?
- What's the exit strategy if it doesn't work out?
Sunset criteria (triggers for moving technology to "Hold"):
- Vendor end-of-life announcement or community stagnation
- Security vulnerabilities without timely patches
- Better alternative reaches "Adopt" status
- Hiring difficulty (can't find engineers who know/want to use it)
- Increasing operational burden relative to alternatives
Paved Roads vs Guardrails
Two philosophies for governance — effective organizations use both:
Paved Roads (encourage good patterns):
- Make the right thing the easy thing (golden path templates)
- Provide excellent defaults that teams adopt voluntarily
- Invest in developer experience so the platform path is faster than DIY
- Measure adoption rate as a product metric (if <70% use it, improve UX)
Guardrails (block dangerous patterns):
- Automated policy enforcement in CI/CD (no hardcoded secrets, no public S3 buckets)
- Network policies that prevent unauthorized service-to-service communication
- Dependency scanning that blocks known vulnerable versions
- Cost guardrails that alert on unexpected spend increases
#!/bin/bash
# platform-health-dashboard.sh — Platform adoption and health metrics
echo "=== Platform Health Dashboard ==="
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""
# Golden path adoption
TOTAL_SERVICES=$(kubectl get deployments -A --no-headers 2>/dev/null | wc -l)
GOLDEN_PATH_SERVICES=$(kubectl get deployments -A \
-l "platform.internal/golden-path=true" --no-headers 2>/dev/null | wc -l)
if [ "$TOTAL_SERVICES" -gt 0 ]; then
ADOPTION_RATE=$(echo "scale=1; $GOLDEN_PATH_SERVICES * 100 / $TOTAL_SERVICES" | bc)
echo "Golden Path Adoption: ${ADOPTION_RATE}% ($GOLDEN_PATH_SERVICES / $TOTAL_SERVICES services)"
else
echo "Golden Path Adoption: N/A (no services found)"
fi
echo ""
echo "=== Service Catalog Coverage ==="
# Services with required metadata
CATALOGED=$(kubectl get deployments -A \
-l "platform.internal/team,platform.internal/tier" --no-headers 2>/dev/null | wc -l)
echo "Cataloged Services: $CATALOGED / $TOTAL_SERVICES"
echo ""
echo "=== Observability Coverage ==="
# Services with metrics endpoints
METRICS_ENABLED=$(kubectl get services -A \
-l "prometheus.io/scrape=true" --no-headers 2>/dev/null | wc -l)
echo "Metrics Enabled: $METRICS_ENABLED / $TOTAL_SERVICES"
echo ""
echo "=== Technology Radar Compliance ==="
# Check for "Hold" technologies in use
echo "Services using HOLD technologies:"
echo " - Jenkins pipelines remaining: $(find . -name "Jenkinsfile" 2>/dev/null | wc -l)"
echo " - MongoDB instances: $(kubectl get statefulsets -A \
-l "app.kubernetes.io/name=mongodb" --no-headers 2>/dev/null | wc -l)"
echo ""
echo "=== Developer Experience ==="
echo " - Avg time to first deploy (new service): query from platform metrics"
echo " - Platform support ticket backlog: query from ticketing system"
echo " - Self-service success rate: query from platform API logs"
# golden-path-config.yaml — Template configuration for team scaffolding
apiVersion: platform.internal/v1
kind: TeamOnboarding
metadata:
name: new-stream-aligned-team
description: "Everything a new stream-aligned team gets on day one"
team_setup:
# Communication
slack_channels:
- "{team-name}"
- "{team-name}-alerts"
- "{team-name}-deployments"
# Access & Permissions
github:
org: "company"
team_slug: "{team-name}"
repositories:
- name: "{team-name}-service"
template: "rest-api-service"
visibility: "internal"
# Infrastructure
kubernetes:
namespace: "{team-name}"
resource_quotas:
cpu: "8"
memory: "16Gi"
pods: "50"
# Observability
grafana:
folder: "{team-name}"
dashboards:
- "service-overview"
- "slo-dashboard"
- "on-call-dashboard"
# On-call
pagerduty:
service: "{team-name}"
escalation_policy: "standard-engineering"
schedule: "follow-the-sun"
# Documentation
backstage:
component:
type: "service"
owner: "{team-name}"
lifecycle: "production"
system: "{business-domain}"
Case Studies
Spotify: Platform Team Evolution
From Fragmentation to Platform
Context: By 2016, Spotify's "full autonomy" squad model had created fragmentation. Over 200 autonomous squads had chosen different tech stacks, built duplicate infrastructure, and created inconsistent developer experiences. Onboarding a new engineer took weeks because every team did things differently.
Problem signals:
- 30+ different ways to deploy a service
- No standard way to discover what services existed or who owned them
- Duplicated effort across squads (every team built their own CI/CD)
- New engineers spent 60% of first month learning tooling, not domain
Solution — Backstage (open-sourced 2020):
- Service catalog: Single registry of all services, their owners, documentation, and dependencies
- Software templates: Golden paths for creating new services with all infrastructure pre-configured
- TechDocs: Documentation lives alongside code, rendered in the portal
- Plugin ecosystem: Teams can extend the portal for their specific needs
Results:
- Time to first deploy for new services: from 2 weeks to <1 hour
- New engineer onboarding: from 60 days to ~20 days productive
- Service discoverability: from "ask on Slack" to instant catalog search
- Golden path adoption: 87% of new services use templates
Key lesson: Autonomy without platform creates fragmentation. Platform without autonomy creates bottlenecks. The balance is: platform team provides golden paths (self-service, optional but compelling), stream-aligned teams can deviate when justified (documented in ADR).
ThoughtWorks Technology Radar Process
Quarterly Technology Governance at Scale
Context: ThoughtWorks (global consultancy, ~10,000 technologists) publishes their Technology Radar publicly since 2010. Internally, they use the same framework to govern technology choices across projects.
Process:
- Collection (Month 1-2): Technologists across the organization submit "blips" — technology observations from real project experience. Each blip needs: name, proposed ring, rationale from production experience.
- Review (Month 2, 2-day session): ~20 senior technologists meet to debate each blip. Every ring placement requires consensus. Disagreements are discussed until resolved or the blip is deferred.
- Publication (Month 3): Radar is published with detailed write-ups explaining each movement. "Moved up" and "moved down" are highlighted for visibility.
- Application: Project teams use the radar when making technology choices. "Adopt" = default. "Trial" = needs tech lead approval. "Hold" = needs architect approval + migration plan.
Why it works:
- Evidence-based: Every recommendation comes from real project experience, not theoretical evaluation
- Living document: Quarterly updates keep it current (unlike annual policies that become stale)
- Visible governance: Everyone can see what's recommended and why — no "shadow decisions"
- Balanced autonomy: Teams can still choose "Trial" or "Assess" technologies — they just need to be explicit about the risk
Adaptation for your organization: Start with 2-3 quadrants (Languages, Platforms, Tools). Review biannually if quarterly is too frequent. Require at least 2 teams' production experience before moving anything to "Adopt." Create a public channel where anyone can propose blips — governance should be participatory, not top-down.
Conclusion & Next Steps
The key takeaways from this module:
- Cognitive load is the constraint. Teams have finite capacity for complexity. Team Topologies minimizes extraneous load (via platforms), manages germane load (via enabling teams), and ensures intrinsic load fits within team capacity. If a team is struggling, the first question is "what load can we remove?" not "how do we add more people?"
- Four team types, three interaction modes. Stream-aligned (deliver value), platform (reduce load), enabling (transfer capability), complicated-subsystem (encapsulate expertise). Interactions are collaboration (time-limited), X-as-a-Service (steady-state goal), or facilitating (temporary coaching). If your org has other team types, the boundaries are likely unclear.
- Platform is a product. Treat internal developers as customers. Measure adoption rate, time-to-first-deploy, and developer satisfaction. If your platform requires training to use, it's not a platform — it's another system to maintain.
- Technology radar enables disciplined innovation. Four rings (Adopt, Trial, Assess, Hold) communicate recommendations clearly. Evidence-based governance (from production experience) trumps theoretical evaluation. Review quarterly to stay current.
- ADRs are institutional memory. Store decisions alongside code. Explain context and consequences, not just the decision itself. Never delete — supersede. A future engineer should understand why without asking anyone currently employed.
- Paved roads AND guardrails. Make the right thing easy (golden paths, templates, self-service) AND make the wrong thing hard (automated policy enforcement, network policies, dependency scanning). Neither alone is sufficient — together they create a system where the fast path is also the safe path.
Next in the Series
In Part 19: AI-Native Systems, we'll explore the unique architectural challenges of systems with AI at their core — ML pipelines, model serving patterns, feedback loops, probabilistic behavior, and how traditional architecture principles adapt when your components are non-deterministic.