Introduction
Every deployment strategy, every CI/CD pipeline, every progressive delivery system we have discussed in this series ultimately relies on one thing: tests that tell us whether the software is safe to ship. Without tests, continuous integration is impossible — you cannot continuously integrate what you cannot continuously verify.
Yet testing is one of the most misunderstood disciplines in software engineering. Teams either test too little (shipping bugs), test the wrong things (false confidence), or test too much of the wrong kind (slow pipelines with brittle tests).
Testing as Confidence
Think of tests as a confidence score. Each test you write increases your confidence that a specific behaviour works correctly. The question is not "do we have enough tests?" but rather "do we have enough confidence to deploy?"
Different organisations need different confidence levels:
- Medical device software — 99.999% confidence required (lives at stake)
- Financial trading systems — 99.99% confidence (money at stake)
- E-commerce checkout — 99.9% confidence (revenue at stake)
- Internal admin tool — 95% confidence (inconvenience at stake)
- Prototype/MVP — 80% confidence (learning at stake)
Your testing strategy should be calibrated to the cost of failure in your context.
Verification vs Validation
The IEEE Standard 1012 defines two fundamental quality activities that sound similar but answer very different questions:
| Activity | Question | Focus | Methods |
|---|---|---|---|
| Verification | "Are we building the product right?" | Conformance to specification | Code reviews, unit tests, static analysis, formal proofs |
| Validation | "Are we building the right product?" | Fitness for user needs | User acceptance testing, beta testing, usability studies |
V&V in Practice
Consider a login form that requires passwords to be at least 8 characters:
- Verification: Does the code correctly reject passwords shorter than 8 characters? (Testing against the specification)
- Validation: Is an 8-character minimum the right security choice for our users? Should we require 12 characters? Should we use passkeys instead? (Testing whether the specification itself is correct)
You can build a system that passes all verification checks (perfectly implements the spec) but fails validation (the spec itself was wrong, and users hate the product). Conversely, you can build something users love (validates well) but is riddled with implementation bugs (fails verification).
Black-Box vs White-Box Testing
These two approaches differ in what the tester knows about the internal implementation:
| Aspect | Black-Box | White-Box |
|---|---|---|
| Knowledge | No access to source code; test only via inputs/outputs | Full access to source code; test internal paths |
| Focus | What the system does | How the system does it |
| Derived from | Requirements and specifications | Code structure and logic |
| Finds | Missing features, incorrect behaviour | Unreachable code, logic errors, edge cases |
Black-Box Techniques
Equivalence Partitioning
Divide input domain into groups (partitions) where all values in a partition should produce the same behaviour. Test one value from each partition.
// Function: calculateDiscount(age)
// Spec: children (0-12) get 50%, teens (13-17) get 25%, adults (18-64) get 0%, seniors (65+) get 30%
// Equivalence partitions:
// Partition 1: age 0-12 (child) → test with age=6
// Partition 2: age 13-17 (teen) → test with age=15
// Partition 3: age 18-64 (adult) → test with age=30
// Partition 4: age 65+ (senior) → test with age=70
// Partition 5: invalid (negative) → test with age=-1
// Partition 6: invalid (non-integer) → test with age="abc"
function testCalculateDiscount() {
console.assert(calculateDiscount(6) === 0.50, "Child discount");
console.assert(calculateDiscount(15) === 0.25, "Teen discount");
console.assert(calculateDiscount(30) === 0.00, "Adult no discount");
console.assert(calculateDiscount(70) === 0.30, "Senior discount");
}
Boundary Value Analysis
Bugs cluster at the boundaries between partitions. Test values at, just below, and just above each boundary:
// Boundaries for calculateDiscount(age):
// Child/Teen boundary: 12, 13
// Teen/Adult boundary: 17, 18
// Adult/Senior boundary: 64, 65
// Lower bound: 0, -1
// Upper bound: depends on max (e.g., 150, 151)
function testBoundaryValues() {
// Child/Teen boundary
console.assert(calculateDiscount(12) === 0.50, "Age 12 = child");
console.assert(calculateDiscount(13) === 0.25, "Age 13 = teen");
// Teen/Adult boundary
console.assert(calculateDiscount(17) === 0.25, "Age 17 = teen");
console.assert(calculateDiscount(18) === 0.00, "Age 18 = adult");
// Adult/Senior boundary
console.assert(calculateDiscount(64) === 0.00, "Age 64 = adult");
console.assert(calculateDiscount(65) === 0.30, "Age 65 = senior");
// Edge cases
console.assert(calculateDiscount(0) === 0.50, "Age 0 = child");
}
White-Box Techniques
Statement Coverage
Every line of code is executed at least once. The weakest coverage criterion — it only proves that code can execute, not that it produces correct results for all inputs.
Branch Coverage
Every branch (if/else, switch case) is taken at least once in both directions. Stronger than statement coverage because it exercises decision paths.
Path Coverage
Every possible execution path through the code is tested. The strongest criterion but often impractical — a function with 10 if/else statements has 2^10 = 1,024 possible paths.
// Example: testing all branches
function processOrder(order) {
if (order.total > 100) { // Branch 1
order.discount = 0.1;
}
if (order.isPremiumMember) { // Branch 2
order.shipping = 'free';
} else {
order.shipping = 'standard';
}
if (order.items.length > 5) { // Branch 3
order.giftWrap = true;
}
return order;
}
// Statement coverage: 1 test (all true) covers all statements
// Branch coverage: need tests for each branch true AND false
// Path coverage: 2^3 = 8 tests for all combinations
The Testing Pyramid
Mike Cohn introduced the testing pyramid in his 2009 book Succeeding with Agile. It is the single most influential model for understanding how to structure a test suite:
flowchart TD
E2E["E2E Tests\n(Few, Slow, Expensive)"]
INT["Integration Tests\n(Some, Medium Speed)"]
UNIT["Unit Tests\n(Many, Fast, Cheap)"]
E2E --- INT
INT --- UNIT
| Layer | Count | Speed | Cost to Write | Cost to Maintain | Confidence |
|---|---|---|---|---|---|
| Unit Tests (base) | Thousands | Milliseconds each | Low | Low | High for individual units |
| Integration Tests (middle) | Hundreds | Seconds each | Medium | Medium | High for component interactions |
| E2E Tests (top) | Dozens | Minutes each | High | Very high | High for user journeys |
Why the Pyramid Shape Matters
The pyramid shape reflects the economics of testing:
- Unit tests are cheap and fast — run thousands in seconds, easy to write, easy to maintain, provide precise failure messages
- Integration tests are moderate — verify that components work together, slower because they involve real databases/APIs/queues
- E2E tests are expensive and slow — simulate real user interactions, require full environment, are brittle (break when UI changes), slow to execute
If you invert the pyramid (many E2E tests, few unit tests), you get:
- Slow CI pipelines (30+ minutes instead of 2 minutes)
- Flaky tests that fail randomly (network timeouts, race conditions)
- Vague failure messages ("E2E test failed" vs "calculateDiscount returned 0.3 instead of 0.25 for age=15")
- High maintenance cost (every UI change breaks multiple tests)
The Testing Trophy
Kent C. Dodds proposed an alternative model called the Testing Trophy (2018), which argues that for modern web applications, integration tests should form the bulk of your test suite — not unit tests.
flowchart TD
E2E2["E2E Tests (Few)"]
INT2["Integration Tests\n(MOST — biggest section)"]
UNIT2["Unit Tests (Some)"]
STATIC["Static Analysis (Base)"]
E2E2 --- INT2
INT2 --- UNIT2
UNIT2 --- STATIC
Trophy vs Pyramid — When Each Applies
| Model | Best For | Rationale |
|---|---|---|
| Pyramid | Complex business logic, algorithms, libraries, microservices | Many isolated units with complex internal logic benefit from fast unit tests |
| Trophy | Web applications, API services, UI-heavy applications | Value comes from components working together; mocking everything defeats the purpose |
The trophy model adds static analysis at the base — TypeScript, ESLint, and other tools that catch bugs without running code. This "free" layer catches an entire class of errors (typos, type mismatches, unused variables) that unit tests would otherwise need to cover.
Test Levels
| Level | Scope | What It Validates | Who Writes It | Example |
|---|---|---|---|---|
| Unit | Single function/class/module | Individual component logic | Developer | Testing a sorting function |
| Integration | Multiple components together | Component interactions, contracts | Developer / QA | API endpoint + database |
| System | Entire application | End-to-end functional requirements | QA / Test engineer | Complete checkout flow |
| Acceptance | Business requirements | User/stakeholder satisfaction | Product owner / User | UAT sign-off on new feature |
Test Types
While test levels describe the scope of testing, test types describe the quality attribute being tested:
| Type | Quality Attribute | Key Question | Tools |
|---|---|---|---|
| Functional | Correctness | Does it produce the right output? | Jest, pytest, JUnit |
| Performance | Speed & throughput | Is it fast enough under load? | k6, JMeter, Gatling |
| Security | Confidentiality, integrity | Can it be exploited? | OWASP ZAP, Snyk, Burp Suite |
| Usability | User experience | Can users accomplish their goals? | User testing sessions, heuristic evaluation |
| Accessibility | Inclusivity | Can all users (including disabled) use it? | axe-core, Lighthouse, screen readers |
| Compatibility | Cross-platform support | Does it work on all target environments? | BrowserStack, Sauce Labs |
| Regression | Stability | Did new changes break existing features? | Automated test suite (any framework) |
| Smoke | Basic viability | Does the application start and respond? | Health check endpoints, basic scripts |
| Sanity | Targeted verification | Does the specific fix/feature work? | Focused subset of test suite |
Google's Testing Philosophy
Google categorises tests by size rather than traditional type labels. Small tests run in a single process with no I/O (equivalent to unit tests). Medium tests can use multiple processes and localhost networking (integration). Large tests can access external systems (E2E). This size-based taxonomy maps directly to resource constraints and execution time, making it easier to enforce pipeline speed targets. Google mandates that 70% of tests are small, 20% medium, and 10% large — closely mirroring the testing pyramid's proportions. Their internal build system (Blaze/Bazel) enforces size constraints via timeout limits: small tests must complete in 60 seconds, medium in 300 seconds.
Test Design Techniques
Beyond equivalence partitioning and boundary value analysis (covered above), several additional techniques help you design effective tests:
Decision Tables
When a function has multiple input conditions that interact, a decision table systematically covers all combinations:
| Condition | Rule 1 | Rule 2 | Rule 3 | Rule 4 |
|---|---|---|---|---|
| Order total > $100 | T | T | F | F |
| Premium member | T | F | T | F |
| Action: Discount | 20% | 10% | 15% | 0% |
| Action: Free shipping | Yes | Yes | Yes | No |
State Transition Testing
For systems with states (e.g., order status: pending → confirmed → shipped → delivered), test all valid transitions and verify that invalid transitions are rejected.
stateDiagram-v2
[*] --> Pending: Order placed
Pending --> Confirmed: Payment received
Pending --> Cancelled: User cancels
Confirmed --> Shipped: Items dispatched
Confirmed --> Cancelled: Admin cancels
Shipped --> Delivered: Delivery confirmed
Shipped --> Returned: Return initiated
Delivered --> [*]
Cancelled --> [*]
Returned --> [*]
Pairwise Testing
When testing all combinations of inputs is impractical (e.g., 5 browsers × 4 OS × 3 screen sizes × 2 languages = 120 combinations), pairwise testing guarantees that every pair of parameters is tested together at least once, dramatically reducing the number of test cases (typically to 10-20) while catching most combinatorial bugs.
Test Independence
Who should test the software? The traditional model assumes "developers write code, QA tests it." This creates a dangerous bottleneck and a false sense of security.
Levels of Test Independence
| Level | Who Tests | Advantage | Risk |
|---|---|---|---|
| Self-testing | Developer tests their own code | Fast, deep knowledge of implementation | Author bias — hard to see own mistakes |
| Peer testing | Another developer on the team | Fresh eyes, shared context | Team groupthink |
| Dedicated QA | Independent QA engineer | Objective, specialised testing skills | Communication overhead, slower feedback |
| External testing | Third-party testing organisation | Complete independence, fresh perspective | Expensive, no domain context |
Economics of Testing
Testing is not free. Writing tests takes time. Maintaining tests takes time. Running tests takes compute resources. The question is not "should we test?" but "how much testing provides the best return on investment?"
The Cost of Bugs by Stage
| Detection Stage | Relative Cost | Time to Fix | Who Finds It |
|---|---|---|---|
| Requirements review | 1× | Minutes | Analyst/PM |
| Design review | 5× | Hours | Architect |
| Unit test | 10× | Minutes | Developer |
| Integration test | 20× | Hours | CI pipeline |
| System/E2E test | 50× | Hours to days | QA team |
| Production (customer report) | 100–1000× | Days to weeks | Customer/monitoring |
When to Stop Testing
You cannot test everything. The risk-based approach says: test the most critical and most likely-to-break areas first. Stop adding tests when the cost of writing the next test exceeds the expected cost of the bug it would catch.
Coverage Is Not Quality
Code coverage (e.g., "we have 90% line coverage") measures how much code is exercised by tests — not how well the tests verify behaviour. You can achieve 100% coverage with tests that assert nothing:
// 100% coverage but 0% value — tests exercise code but verify nothing
function testProcessPayment() {
processPayment({ amount: 100, card: '4111111111111111' });
// No assertion! The test "passes" even if processPayment is broken
}
// Better: assert specific outcomes
function testProcessPayment() {
const result = processPayment({ amount: 100, card: '4111111111111111' });
console.assert(result.status === 'approved', 'Payment should be approved');
console.assert(result.chargedAmount === 100, 'Should charge exact amount');
}
Testing Anti-Patterns
| Anti-Pattern | Symptom | Solution |
|---|---|---|
| Ice Cream Cone | Mostly manual/E2E tests, few unit tests | Invest in unit tests; automate bottom of pyramid first |
| No tests at all | "We'll add tests later" (never happens) | Write tests alongside code; enforce in PR reviews |
| Testing implementation details | Tests break when refactoring (even though behaviour unchanged) | Test behaviour (inputs→outputs), not internals (method calls) |
| 100% coverage religion | Pointless tests on trivial code to hit metrics | Focus on critical paths; use coverage to find gaps, not as a target |
| Flaky tests ignored | "That test always fails, just re-run" becomes team culture | Fix or delete flaky tests immediately; quarantine if needed |
| Manual regression suites | QA team manually runs 200-step test plan every sprint | Automate regression; humans should do exploratory testing |
| Testing only happy paths | All tests pass but production fails on edge cases | Test error cases, boundaries, and failure modes explicitly |
The Flaky Test Problem at Scale
A 2020 study at Google found that approximately 16% of their test suite exhibited flaky behaviour — tests that sometimes pass and sometimes fail without any code change. At Google's scale (4.2 million tests), this meant 672,000 flaky tests consuming engineering time. The team found that each flaky test wasted an average of 3.7 developer-hours per quarter in investigation and re-runs. Their solution: a dedicated infrastructure that monitors test reliability, automatically quarantines consistently flaky tests, and assigns them to owners for fixing. Tests that remain flaky for more than two weeks are automatically disabled and flagged as technical debt.
Exercises
calculateShipping(weight, distance) where: free shipping for orders over 5kg, $5 flat rate for distances under 50km, $10 for 50-200km, and $20 for over 200km. Identify all boundary values and write the test assertions.
Conclusion & Next Steps
Testing fundamentals form the vocabulary and mental models you need for every testing-related decision in your career. The testing pyramid (or trophy) guides your test distribution. Verification vs validation ensures you ask both "does it work?" and "does it matter?" Black-box and white-box techniques give you systematic approaches to test design.
The critical takeaway: testing is economics. Every test has a cost (writing + maintaining + running) and a benefit (bugs caught, confidence gained, regression prevented). The art of testing is maximising the benefit-to-cost ratio, not maximising coverage numbers.
Next in the Series
In Part 19: Unit Testing & Test-Driven Development, we dive deep into the base of the pyramid — writing effective unit tests, the Red-Green-Refactor cycle, mocking strategies, and when TDD is the right approach versus when it creates overhead.