Module 37: Architectural Evolution
Why Architecture Must Evolve
No architecture survives contact with reality unchanged. Three forces drive architectural evolution:
- Business changes: New markets, acquisitions, regulatory requirements, and business model pivots demand architectural adaptation. The e-commerce platform that served 1,000 orders/day has different constraints at 1,000,000 orders/day.
- Scale changes: What works at 10x doesn't work at 100x. Synchronous calls become bottlenecks, single databases become limitations, and monolithic deployments become coordination nightmares.
- Team changes: Organizations grow, restructure, and distribute. A system built by one team of 8 can't be maintained the same way when split across 4 teams of 12 across 3 time zones.
Technical Debt Quadrant
Martin Fowler's Technical Debt Quadrant (2009) categorizes debt along two axes: reckless vs prudent (quality of decision-making) and deliberate vs inadvertent (awareness of consequences).
quadrantChart
title Technical Debt Quadrant
x-axis Inadvertent --> Deliberate
y-axis Reckless --> Prudent
quadrant-1 "Prudent & Deliberate"
quadrant-2 "Prudent & Inadvertent"
quadrant-3 "Reckless & Inadvertent"
quadrant-4 "Reckless & Deliberate"
"Ship now, refactor later": [0.75, 0.80]
"Now we know how we should have done it": [0.25, 0.75]
"What's layering?": [0.25, 0.20]
"No time for design": [0.75, 0.20]
Prudent-Deliberate: "We must ship now and deal with consequences." You know you're taking a shortcut but have a plan to address it. Example: hardcoding configuration to hit a launch deadline, with a ticket to externalize it next sprint.
Prudent-Inadvertent: "Now we know how we should have done it." You made the best decision with available information, but learned better approaches after delivery. Example: choosing REST when event-driven would have been more suitable — discovered only after observing production traffic patterns.
Reckless-Deliberate: "We don't have time for design." Consciously ignoring known best practices due to pressure. Example: skipping input validation because "we'll add it later" — knowing the risk but accepting it without a plan to remediate.
Reckless-Inadvertent: "What's layering?" The team doesn't know what good looks like. Example: junior team creating a tightly coupled monolith because they've never seen well-structured code — they don't know they're creating debt.
Measuring Technical Debt
You can't manage what you can't measure. Track these DORA-adjacent metrics to quantify debt impact:
- Code complexity: Cyclomatic complexity per module, coupling between modules, depth of inheritance
- Deployment frequency: How often can you safely deploy? Lower frequency signals accumulated coupling
- Change failure rate: What percentage of deployments cause incidents? High rate signals brittle code
- Lead time for changes: How long from commit to production? Increasing time signals growing friction
#!/bin/bash
# deployment-metrics.sh — Track deployment health indicators
REPO_DIR="${1:-.}"
DAYS="${2:-30}"
echo "=== Deployment Metrics (Last ${DAYS} days) ==="
# Deployment frequency
DEPLOYS=$(git -C "$REPO_DIR" log --since="${DAYS} days ago" \
--grep="deploy\|release" --oneline | wc -l)
echo "Deployments: $DEPLOYS"
echo "Frequency: $(echo "scale=1; $DEPLOYS / ($DAYS / 7)" | bc) per week"
# Change failure rate (commits tagged as fix/hotfix after deploy)
FIXES=$(git -C "$REPO_DIR" log --since="${DAYS} days ago" \
--grep="hotfix\|revert\|rollback" --oneline | wc -l)
if [ "$DEPLOYS" -gt 0 ]; then
FAILURE_RATE=$(echo "scale=1; $FIXES * 100 / $DEPLOYS" | bc)
echo "Change Failure Rate: ${FAILURE_RATE}%"
else
echo "Change Failure Rate: N/A (no deploys)"
fi
# Lead time (avg days between first commit and merge to main)
echo ""
echo "=== Complexity Indicators ==="
# Files changed most frequently (hotspots)
echo "Top 10 Hotspot Files (most changes):"
git -C "$REPO_DIR" log --since="${DAYS} days ago" \
--pretty=format: --name-only | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Debt Indicators ==="
# TODO/FIXME/HACK count
TODO_COUNT=$(grep -r "TODO\|FIXME\|HACK\|XXX" "$REPO_DIR/src" 2>/dev/null | wc -l)
echo "Technical Debt Markers (TODO/FIXME/HACK): $TODO_COUNT"
Strangler Fig Pattern
Named after the strangler fig tree that grows around a host tree and eventually replaces it. The pattern incrementally migrates functionality from a legacy system to a new system, routing traffic feature-by-feature until the old system can be decommissioned.
flowchart LR
subgraph Phase1["Phase 1: Intercept"]
C1[Client] --> P1[Proxy/Facade]
P1 --> L1[Legacy System
100% traffic]
end
subgraph Phase2["Phase 2: Parallel"]
C2[Client] --> P2[Proxy/Facade]
P2 -->|Feature A| N2[New System]
P2 -->|Features B,C,D| L2[Legacy System]
end
subgraph Phase3["Phase 3: Complete"]
C3[Client] --> P3[Proxy/Facade]
P3 --> N3[New System
100% traffic]
L3[Legacy System
Decommissioned] -.->|removed| P3
end
Phase1 --> Phase2 --> Phase3
# feature-flags.yaml — Migration routing configuration
# Controls traffic routing between legacy and new systems
migration:
proxy:
type: envoy
listen_port: 8080
routes:
# Phase 1: User authentication migrated to new system
- feature: "user-auth"
status: "migrated"
target: "new-system"
rollback_target: "legacy"
canary_percentage: 100
migrated_date: "2026-01-15"
# Phase 2: Order processing in canary rollout
- feature: "order-processing"
status: "canary"
target: "new-system"
rollback_target: "legacy"
canary_percentage: 25
canary_start: "2026-05-01"
success_criteria:
error_rate_max: 0.1
p99_latency_max_ms: 500
comparison_window: "7d"
# Phase 3: Inventory still on legacy
- feature: "inventory-management"
status: "legacy"
target: "legacy"
planned_migration: "2026-Q3"
blockers:
- "Requires event sourcing implementation"
- "Dependent on order-processing migration"
rollback:
automatic: true
trigger:
error_rate_above: 0.5
latency_p99_above_ms: 2000
cooldown_minutes: 30
Parallel Run
Run old and new systems simultaneously, sending the same input to both and comparing outputs. The legacy system remains the "source of truth" while you validate the new system's correctness. This is essential for high-risk migrations where incorrect behavior has severe consequences (financial calculations, healthcare records).
Key principles:
- Shadow traffic: Fork incoming requests to both systems. Only return the legacy response to users.
- Output comparison: Log differences between legacy and new system responses. Categorize as critical (wrong amount), cosmetic (different formatting), or timing (eventual consistency delay).
- Graduated confidence: Once difference rate drops below threshold (e.g., <0.01% critical differences over 14 days), switch traffic to new system.
Branch by Abstraction
Introduce an abstraction layer (interface) over the code you want to replace. Write a new implementation behind the same interface. Toggle between implementations using feature flags — no branching in source control needed.
Steps:
- Create abstraction: Extract interface from existing implementation
- Implement new version: Build new implementation behind the same interface
- Toggle: Use feature flags to route to old or new implementation
- Verify: Run both in production, compare results
- Remove: Delete old implementation and feature flag once confident
Expand-Contract for APIs
When evolving APIs with existing consumers, use the expand-contract pattern (also called "parallel change"):
- Expand: Add new fields/endpoints alongside existing ones. Both old and new consumers work.
- Migrate: Update consumers to use new fields/endpoints. Track adoption metrics.
- Contract: Remove old fields/endpoints once all consumers have migrated (verified by traffic logs showing zero usage).
Module 38: Conway's Law
The Original Paper
In 1967, Melvin Conway submitted a paper titled "How Do Committees Invent?" (rejected by Harvard Business Review, later published in Datamation). His core observation:
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
This isn't just an observation — it's a sociological law. The communication pathways between teams become the interfaces between system components. A team can't effectively design a component that requires coordination patterns their organization doesn't support.
Modern Interpretation
Conway's Law manifests everywhere in modern software:
flowchart LR
subgraph OrgA["Organization A: 3 Teams"]
FE[Frontend Team]
BE[Backend Team]
DB[Database Team]
end
subgraph SysA["System A: 3-Tier"]
UI[UI Layer]
API[API Layer]
Data[Data Layer]
end
FE --> UI
BE --> API
DB --> Data
subgraph OrgB["Organization B: Product Squads"]
S1[Search Squad]
S2[Checkout Squad]
S3[Recommendations Squad]
end
subgraph SysB["System B: Microservices"]
MS1[Search Service]
MS2[Checkout Service]
MS3[Recommendations Service]
end
S1 --> MS1
S2 --> MS2
S3 --> MS3
Concrete examples:
- 3-tier architecture ↔ 3 teams: Frontend team, backend team, database team → produces layered architecture with clear horizontal boundaries but poor vertical feature delivery
- Microservices ↔ many small teams: Product-aligned squads → produces independently deployable services aligned to business capabilities
- Monolith ↔ single team: One team owns everything → produces tightly coupled system because intra-team communication is cheap
- Distributed monolith ↔ teams forced to use microservices: Teams without clear boundaries trying to do microservices → produces services that can't be deployed independently (worst of both worlds)
Inverse Conway Maneuver
If Conway's Law says "org structure shapes architecture," then the Inverse Conway Maneuver says: "design your org structure to get the architecture you want." Restructure teams to match your target architecture — the system will follow.
flowchart TD
subgraph Step1["Step 1: Define Target Architecture"]
TA[Desired System Design
Domain-driven boundaries]
end
subgraph Step2["Step 2: Design Team Topology"]
TT[Organize teams to match
architectural boundaries]
end
subgraph Step3["Step 3: Natural Emergence"]
NA[Architecture emerges from
team communication patterns]
end
Step1 -->|"What structure do we want?"| Step2
Step2 -->|"Conway's Law works FOR us"| Step3
style Step1 fill:#f0f8ff
style Step2 fill:#f0fff0
style Step3 fill:#fff8f0
Architecture as Team API
Treat each team's outputs as an API contract — not just code APIs, but the complete interface between teams. A "Team API" includes:
- Code: Service endpoints, libraries, SDKs provided to other teams
- Documentation: API docs, runbooks, architecture decision records
- Communication: How to contact the team (Slack channel, office hours, on-call rotation)
- Delivery: Release cadence, SLAs, deprecation policies
# team-api.yaml — Team API definition (Team Topologies format)
team:
name: "Search Platform"
type: "platform"
mission: "Enable any product team to add search to their domain in < 1 day"
cognitive_load: "medium"
provided_services:
- name: "Search Indexing API"
type: "X-as-a-Service"
sla:
availability: "99.95%"
index_latency_p99: "2s"
query_latency_p99: "100ms"
documentation: "https://internal.docs/search/api"
onboarding: "Self-service via portal — no tickets needed"
- name: "Search SDK"
type: "library"
languages: ["Java", "Python", "TypeScript"]
versioning: "semver"
deprecation_policy: "12 months notice, migration guide provided"
communication:
slack: "#search-platform"
office_hours: "Tuesdays 2-3pm UTC"
on_call: "PagerDuty rotation — search-platform-oncall"
rfc_process: "Submit RFC to #search-rfcs, 1-week review period"
dependencies:
consumed:
- team: "Infrastructure"
service: "Kubernetes Platform"
interaction_mode: "X-as-a-Service"
- team: "Data Engineering"
service: "Event Stream"
interaction_mode: "X-as-a-Service"
boundaries:
owns:
- "Search index infrastructure"
- "Query parsing and ranking algorithms"
- "Search relevance and A/B testing"
does_not_own:
- "Domain-specific data schemas (owned by product teams)"
- "UI search components (owned by frontend platform)"
Fitness Functions for Evolutionary Architecture
A fitness function is an automated, objective measure that evaluates how well an architecture meets a specific quality attribute. Borrowed from genetic algorithms — just as biological evolution uses fitness to select organisms, architectural evolution uses fitness functions to guide system changes.
Fitness functions run in CI/CD pipelines and block deployments that violate architectural constraints:
"""
fitness_functions.py — Automated architecture governance
Run as part of CI/CD pipeline to enforce architectural constraints.
"""
import json
import subprocess
import sys
from pathlib import Path
def check_cyclic_dependencies(src_dir: str) -> bool:
"""Fitness function: No circular dependencies between modules."""
# Parse import graph from source directory
modules = {}
src_path = Path(src_dir)
for py_file in src_path.rglob("*.py"):
module_name = str(py_file.relative_to(src_path)).replace("/", ".").rstrip(".py")
imports = []
with open(py_file) as f:
for line in f:
if line.startswith("from ") or line.startswith("import "):
parts = line.split()
if len(parts) >= 2:
imports.append(parts[1].split(".")[0])
modules[module_name] = imports
# DFS cycle detection
visited = set()
rec_stack = set()
def has_cycle(node):
visited.add(node)
rec_stack.add(node)
for neighbor in modules.get(node, []):
if neighbor in modules:
if neighbor not in visited:
if has_cycle(neighbor):
return True
elif neighbor in rec_stack:
print(f"CYCLE DETECTED: {node} -> {neighbor}")
return True
rec_stack.discard(node)
return False
for module in modules:
if module not in visited:
if has_cycle(module):
return False
return True
def check_service_coupling(max_dependencies: int = 5) -> bool:
"""Fitness function: No service depends on more than N other services."""
# Read service dependency manifest
manifest_path = Path("architecture/dependencies.json")
if not manifest_path.exists():
print("WARNING: No dependency manifest found")
return True
with open(manifest_path) as f:
deps = json.load(f)
violations = []
for service, dependencies in deps.items():
if len(dependencies) > max_dependencies:
violations.append(
f"{service} has {len(dependencies)} dependencies (max: {max_dependencies})"
)
if violations:
print("COUPLING VIOLATIONS:")
for v in violations:
print(f" - {v}")
return False
return True
def check_api_versioning() -> bool:
"""Fitness function: All public APIs must be versioned."""
openapi_files = list(Path("api").rglob("openapi.yaml"))
for api_file in openapi_files:
with open(api_file) as f:
content = f.read()
if "/v1/" not in content and "/v2/" not in content:
print(f"UNVERSIONED API: {api_file}")
return False
return True
if __name__ == "__main__":
results = {
"No Cyclic Dependencies": check_cyclic_dependencies("src"),
"Service Coupling Limit": check_service_coupling(max_dependencies=5),
"API Versioning": check_api_versioning(),
}
print("\n=== Architecture Fitness Report ===")
all_pass = True
for name, passed in results.items():
status = "PASS" if passed else "FAIL"
print(f" [{status}] {name}")
if not passed:
all_pass = False
sys.exit(0 if all_pass else 1)
{
"title": "ADR-042: Adopt Event Sourcing for Order Domain",
"status": "accepted",
"date": "2026-05-10",
"deciders": ["@arch-lead", "@order-team-lead", "@platform-lead"],
"context": {
"problem": "Order state changes are lost after mutation. We cannot reconstruct order history, audit changes, or replay events for debugging. Change failure rate for order service is 12% (target: < 5%).",
"constraints": [
"Must maintain backward compatibility with existing order API consumers",
"Migration must be zero-downtime",
"Team has limited event sourcing experience"
]
},
"decision": "Adopt event sourcing for the Order aggregate. Use strangler fig pattern to migrate incrementally. Start with new order creation, then migrate existing order mutations over 3 sprints.",
"consequences": {
"positive": [
"Full audit trail of all order state changes",
"Ability to replay events for debugging and analytics",
"Natural fit for event-driven architecture (already using Kafka)"
],
"negative": [
"Increased complexity for developers unfamiliar with event sourcing",
"Eventually consistent read models require CQRS",
"Event schema evolution requires careful versioning"
],
"risks": [
"Team ramp-up time may be longer than estimated (mitigation: enabling team pairing for first sprint)"
]
},
"alternatives_considered": [
{
"option": "CDC (Change Data Capture) from PostgreSQL WAL",
"rejected_because": "Captures row-level changes, not domain events. Doesn't solve the semantic audit trail problem."
},
{
"option": "Audit table with triggers",
"rejected_because": "Tight coupling to relational schema. Doesn't support event replay or projections."
}
]
}
Case Studies
Amazon's Monolith-to-Services Migration (2002–2006)
The Bezos API Mandate (2002)
Context: In 2002, Amazon's codebase was a massive C++ monolith called "Obidos." Teams couldn't deploy independently, coordination overhead was enormous, and feature delivery slowed to a crawl as the company scaled from hundreds to thousands of engineers.
The mandate (paraphrased):
- All teams will expose their data and functionality through service interfaces
- Teams must communicate with each other through these interfaces
- There will be no other form of inter-process communication
- All service interfaces must be designed to be externalizable (could be exposed to external developers)
- Anyone who doesn't do this will be fired
Migration approach: Classic strangler fig. Amazon didn't rewrite Obidos — they extracted services one at a time. First the catalog service, then product search, then recommendations. Each extraction created a team that owned that service end-to-end (Inverse Conway). Over 4 years, the monolith shrank as services grew around it.
Result: By 2006, Amazon had hundreds of services. This architecture enabled AWS — the same service interfaces that powered Amazon.com could be offered as cloud services (S3, SQS, etc.). The "Two Pizza Teams" model emerged naturally: if a service is owned by one team, and that team must be small enough to feed with two pizzas (~6-8 people), then services stay small and focused.
Conway's Law in action: Amazon restructured teams FIRST (around services), and the architecture followed. They applied the Inverse Conway Maneuver before it had a name.
Spotify Squad Model as Inverse Conway
Designing Teams to Shape Architecture
Context: Spotify (2012-2015) needed to scale from 30 to 250+ engineers while maintaining fast feature delivery and technical autonomy. They deliberately designed team structure to produce the architecture they wanted.
Structure:
- Squads: Autonomous teams (6-12 people) aligned to a business feature area. Each squad owns its services end-to-end. Squads decide their own tech stack, architecture, and deployment cadence.
- Tribes: Collections of squads in related areas (~40-150 people). Minimize dependencies between tribes; maximize within tribes.
- Chapters: Horizontal communities of practice (all backend engineers across squads). Share knowledge without coupling code.
- Guilds: Cross-cutting interest groups (optional, voluntary). Spread innovation across organizational boundaries.
Architecture result: Because squads owned independent feature areas with minimal inter-squad communication, the system naturally decomposed into independent services. The architecture mirrored the team boundaries — Conway's Law working for them rather than against them.
Caveats (2024 retrospective): Spotify has since acknowledged the model had challenges. Full autonomy led to fragmentation (30+ data stores, inconsistent APIs). They've since introduced more platform teams and standardization — balancing autonomy with coherence. The lesson: Inverse Conway works, but pure autonomy without platform teams creates a different kind of debt.
Conclusion & Next Steps
The key takeaways from this module:
- Architecture must evolve or it dies. Business changes, scale changes, and team changes all force architectural adaptation. The question isn't "should we evolve?" but "how do we evolve safely?"
- Technical debt is a choice spectrum. Prudent-deliberate debt (shipping fast with a plan) is a valid business decision. Reckless-inadvertent debt (not knowing better) is an education problem. Measure debt impact through deployment metrics, not code smell counts.
- Strangler fig over big bang. Incremental migration (strangler fig, parallel run, branch by abstraction) succeeds where rewrites fail. You get value early, can stop at any point, and validate with real traffic.
- Conway's Law is inescapable. Your system architecture WILL mirror your organizational communication. Accept this and use the Inverse Conway Maneuver — design teams to produce the architecture you want.
- Fitness functions automate governance. Instead of architecture review boards (slow, subjective, inconsistent), encode constraints as automated checks in CI/CD. If coupling exceeds threshold → pipeline fails. If cyclic dependencies appear → pipeline fails. Architecture governance at the speed of deployment.
- Team APIs make boundaries explicit. When teams publish their interfaces (code + docs + communication + delivery), boundaries are clear and enforceable. Implicit boundaries always erode under deadline pressure.
Next in the Series
In Part 18: Team Topologies & Governance, we'll dive deep into the four fundamental team types (platform, stream-aligned, enabling, complicated-subsystem), interaction modes, technology radar governance, and architecture decision frameworks that scale across organizations.