Table of Contents

  1. Understanding Build Bottlenecks
  2. Precompiled Headers
  3. Unity/Jumbo Builds
  4. Compiler Cache Integration
  5. Parallel Compilation
  6. Link-Time Optimization
  7. Reducing Header Dependencies
  8. Object Libraries
  9. Ninja as Preferred Generator
  10. Distributed Compilation
  11. Build Profiling
  12. C++20 Modules Impact
  13. Conclusion & Next Steps
Back to CMake Mastery Series

Part 30: Optimizing Build Performance

June 4, 2026 Wasil Zafar 35 min read

Dramatically reduce build times using precompiled headers, unity builds, compiler caches, Ninja generator, link-time optimization, distributed compilation, and C++20 modules — with real-world benchmarks and CMake integration patterns.

Understanding Build Bottlenecks

Before optimizing, you must understand where your build spends time. A typical C++ build pipeline has three major phases — compilation, linking, and dependency resolution — each with distinct characteristics and optimization strategies. The CMake build system documentation describes how targets flow through these phases.

Build Pipeline Bottleneck Visualization
        flowchart LR
            A[Source Files] --> B[Preprocessing]
            B --> C[Compilation]
            C --> D[Assembly]
            D --> E[Object Files]
            E --> F[Linking]
            F --> G[Binary]

            style C fill:#BF092F,color:#fff
            style F fill:#16476A,color:#fff
            style B fill:#3B9797,color:#fff
    

In most projects, compilation (parsing headers, template instantiation, code generation) consumes 70–90% of total build time. Linking dominates for large monolithic binaries with heavy template use or LTO enabled. Dependency resolution (CMake configure step, package downloads) matters most on cold CI environments.

Measuring Where Time Goes

Before optimizing, measure your baseline. CMake 3.18+ supports profiling output:

# Generate profiling data during configure
cmake -B build -S . --profiling-output=cmake-profile.json --profiling-format=google-trace

# Time a full build with Ninja (shows per-target timing)
cmake --build build --parallel -- -d stats

# Time a full build with Make
time cmake --build build -- -j$(nproc)

# Ninja's built-in build log for analysis
ninja -C build -t compdb > compile_commands.json
Key Insight: Always measure before and after each optimization. Build performance is highly project-specific — what helps a header-heavy project may not help a template-heavy one. Keep a build-time baseline in your CI pipeline to detect regressions.

Precompiled Headers (PCH)

Precompiled headers serialize the compiler's internal representation of frequently-included headers into a binary file, eliminating redundant parsing across translation units. CMake 3.16+ provides native PCH support via target_precompile_headers().

PCH Dependency Graph — Without vs With PCH
        flowchart TD
            subgraph Without PCH
                A1[main.cpp] --> H1[vector]
                A1 --> H2[string]
                A1 --> H3[map]
                B1[utils.cpp] --> H1
                B1 --> H2
                B1 --> H4[algorithm]
                C1[engine.cpp] --> H1
                C1 --> H2
                C1 --> H3
            end

            subgraph With PCH
                PCH[pch.h.gch] --> H5[vector]
                PCH --> H6[string]
                PCH --> H7[map]
                PCH --> H8[algorithm]
                A2[main.cpp] --> PCH
                B2[utils.cpp] --> PCH
                C2[engine.cpp] --> PCH
            end
    
cmake_minimum_required(VERSION 3.16)
project(MyApp LANGUAGES CXX)

add_library(core
    src/engine.cpp
    src/utils.cpp
    src/renderer.cpp
    src/physics.cpp
)

# Precompile commonly-used standard library headers
target_precompile_headers(core PRIVATE
    <vector>
    <string>
    <unordered_map>
    <memory>
    <algorithm>
    <functional>
    <optional>
    <filesystem>
)

# PUBLIC headers propagate to dependents
target_precompile_headers(core PUBLIC
    <nlohmann/json.hpp>
)

Reusing PCH Across Targets

Multiple targets with similar header sets can share a single PCH using REUSE_FROM, avoiding redundant PCH compilation:

cmake_minimum_required(VERSION 3.16)
project(GameEngine LANGUAGES CXX)

# Primary library builds the PCH
add_library(engine src/engine.cpp src/renderer.cpp)
target_precompile_headers(engine PRIVATE
    <vector>
    <string>
    <memory>
    <unordered_map>
    <glm/glm.hpp>
)

# Secondary targets reuse the same PCH binary
add_library(physics src/physics.cpp src/collision.cpp)
target_precompile_headers(physics REUSE_FROM engine)

add_library(audio src/audio.cpp src/mixer.cpp)
target_precompile_headers(audio REUSE_FROM engine)

# Executable also reuses
add_executable(game src/main.cpp)
target_precompile_headers(game REUSE_FROM engine)
target_link_libraries(game PRIVATE engine physics audio)
Warning: PCH REUSE_FROM requires identical compile definitions and include paths between the source and reusing targets. Mismatches cause subtle ODR violations. Also avoid putting project-specific headers in PCH — only include stable, rarely-changed headers (STL, third-party libraries). Frequently-changing headers in PCH cause full rebuilds.
cmake_minimum_required(VERSION 3.16)
project(Selective LANGUAGES CXX)

add_library(mylib src/a.cpp src/b.cpp src/c.cpp)
target_precompile_headers(mylib PRIVATE <vector> <string>)

# Exclude specific files from using PCH
set_source_files_properties(src/c.cpp PROPERTIES
    SKIP_PRECOMPILE_HEADERS ON
)

Unity/Jumbo Builds

Unity builds concatenate multiple source files into a single translation unit, reducing repeated header parsing and enabling better cross-file optimization. CMake 3.16+ supports this natively via the UNITY_BUILD target property.

cmake_minimum_required(VERSION 3.16)
project(LargeProject LANGUAGES CXX)

add_library(core
    src/module_a.cpp
    src/module_b.cpp
    src/module_c.cpp
    src/module_d.cpp
    src/module_e.cpp
    src/module_f.cpp
    src/module_g.cpp
    src/module_h.cpp
)

# Enable unity build for this target
set_target_properties(core PROPERTIES
    UNITY_BUILD ON
    UNITY_BUILD_BATCH_SIZE 8  # Files per unity source (default: 8)
)

# Or enable globally for all targets
set(CMAKE_UNITY_BUILD ON)
set(CMAKE_UNITY_BUILD_BATCH_SIZE 6)

Unity Build Tradeoffs and Exclusions

Unity builds can cause issues with static variables, anonymous namespaces, and identically-named symbols across files. Exclude problematic sources:

cmake_minimum_required(VERSION 3.16)
project(UnityExample LANGUAGES CXX)

add_library(renderer
    src/opengl_backend.cpp
    src/vulkan_backend.cpp
    src/shader_compiler.cpp
    src/mesh_loader.cpp
    src/texture_manager.cpp
)

set_target_properties(renderer PROPERTIES UNITY_BUILD ON)

# Exclude files with conflicting static symbols
set_source_files_properties(
    src/opengl_backend.cpp
    src/vulkan_backend.cpp
    PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON
)

# Group related files into the same unity batch
set_source_files_properties(
    src/shader_compiler.cpp
    src/mesh_loader.cpp
    PROPERTIES UNITY_GROUP "assets"
)
Key Insight: Unity builds provide 30–70% speedup on full rebuilds but slow down incremental builds (changing one file recompiles the entire batch). Use them in CI for clean builds and disable for local development: cmake -B build -DCMAKE_UNITY_BUILD=OFF.
Case Study Chromium Project
Unity Builds in Chromium

The Chromium project adopted unity builds for its 35,000+ source files and reported 30–40% faster full builds on CI. They use a batch size of 8 and maintain an exclusion list of ~200 files with naming conflicts. Incremental builds remain non-unity for developer productivity, controlled via a GN variable mapped to CMake's UNITY_BUILD in ports.

Large-scale CI Optimization Clean Builds

Compiler Cache Integration

Compiler caches store compilation results keyed by preprocessed source content plus compiler flags. Cache hits skip compilation entirely, turning minutes into milliseconds. CMake integrates with caches via CMAKE_<LANG>_COMPILER_LAUNCHER.

ccache Hit/Miss Decision Flow
        flowchart TD
            A[Compile Request] --> B{Hash preprocessed source + flags}
            B --> C{Cache lookup}
            C -->|Hit| D[Return cached .o]
            C -->|Miss| E[Run compiler]
            E --> F[Store result in cache]
            F --> G[Return .o]
            D --> H[Done — milliseconds]
            G --> I[Done — full compile time]

            style D fill:#3B9797,color:#fff
            style H fill:#3B9797,color:#fff
            style E fill:#BF092F,color:#fff
            style I fill:#BF092F,color:#fff
    
cmake_minimum_required(VERSION 3.16)
project(CachedBuild LANGUAGES C CXX)

# Auto-detect ccache or sccache
find_program(CCACHE_PROGRAM ccache)
find_program(SCCACHE_PROGRAM sccache)

if(SCCACHE_PROGRAM)
    set(CMAKE_C_COMPILER_LAUNCHER "${SCCACHE_PROGRAM}")
    set(CMAKE_CXX_COMPILER_LAUNCHER "${SCCACHE_PROGRAM}")
    message(STATUS "Using sccache: ${SCCACHE_PROGRAM}")
elseif(CCACHE_PROGRAM)
    set(CMAKE_C_COMPILER_LAUNCHER "${CCACHE_PROGRAM}")
    set(CMAKE_CXX_COMPILER_LAUNCHER "${CCACHE_PROGRAM}")
    message(STATUS "Using ccache: ${CCACHE_PROGRAM}")
endif()

add_executable(app src/main.cpp src/engine.cpp)

sccache — Shared Compilation Cache

Mozilla's sccache supports cloud-backed storage (S3, GCS, Azure Blob), making it ideal for distributed teams sharing cache hits across CI and developer machines:

# Install sccache
cargo install sccache
# Or via package managers
brew install sccache       # macOS
choco install sccache      # Windows

# Configure S3 backend for team-wide sharing
export SCCACHE_BUCKET="my-team-build-cache"
export SCCACHE_REGION="us-east-1"
export SCCACHE_S3_USE_SSL=true

# Start the sccache server
sccache --start-server

# Configure CMake to use sccache
cmake -B build -S . \
    -DCMAKE_C_COMPILER_LAUNCHER=sccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=sccache

# Check cache statistics after a build
sccache --show-stats
# Optimize ccache hit rates
# Set ccache to ignore __DATE__ and __TIME__ macros
export CCACHE_SLOPPINESS="time_macros,include_file_mtime,file_stat_matches"

# Increase cache size for large projects
ccache --max-size=20G

# Enable compression to fit more entries
export CCACHE_COMPRESS=1
export CCACHE_COMPRESSLEVEL=6

# Share cache across branches (hash content, not path)
export CCACHE_BASEDIR="${HOME}/projects"

# View hit/miss statistics
ccache --show-stats
ccache --zero-stats  # Reset counters
Key Insight: The CCACHE_SLOPPINESS setting is crucial for cache hit rates. Without time_macros, any file using __DATE__ or __TIME__ will never cache. Without include_file_mtime, touching a header without changing its content causes misses. Expect 85–95% hit rates on incremental developer builds.

Parallel Compilation

Modern hardware has many cores — using them all dramatically reduces wall-clock build time. CMake provides multiple mechanisms for parallel compilation depending on the generator and platform.

cmake_minimum_required(VERSION 3.12)
project(ParallelBuild LANGUAGES CXX)

# Detect available processors at configure time
include(ProcessorCount)
ProcessorCount(NPROC)
if(NOT NPROC EQUAL 0)
    message(STATUS "Detected ${NPROC} processors")
endif()

add_executable(app src/main.cpp src/module_a.cpp src/module_b.cpp)

# MSVC parallel compilation within a single target
if(MSVC)
    target_compile_options(app PRIVATE /MP${NPROC})
endif()
# Method 1: CMAKE_BUILD_PARALLEL_LEVEL environment variable
export CMAKE_BUILD_PARALLEL_LEVEL=16
cmake --build build

# Method 2: --parallel flag (CMake 3.12+)
cmake --build build --parallel 16

# Method 3: Pass directly to underlying build tool
cmake --build build -- -j16          # Make/Ninja
cmake --build build -- /maxcpucount:16  # MSBuild

# Method 4: Ninja automatic detection (uses all cores by default)
cmake -G Ninja -B build -S .
cmake --build build  # Ninja auto-detects core count

# Keep one core free for system responsiveness
cmake --build build --parallel $(($(nproc) - 1))
Warning: Excessive parallelism can cause out-of-memory kills, especially with template-heavy C++ code where each compiler instance uses 1–4 GB RAM. Monitor memory usage: ninja -j$(nproc) -l$(nproc) limits load average. For 16 GB RAM machines, -j4 to -j8 is often optimal despite having 16 cores.

Link-Time Optimization (LTO)

LTO allows the compiler to optimize across translation unit boundaries during linking, producing smaller and faster binaries. However, it significantly increases link time and memory usage. CMake provides native LTO support via the INTERPROCEDURAL_OPTIMIZATION property.

CMake LTO Integration

cmake_minimum_required(VERSION 3.9)
project(LTOProject LANGUAGES CXX)

# Check if LTO is supported by the compiler
include(CheckIPOSupported)
check_ipo_supported(RESULT ipo_supported OUTPUT ipo_error)

add_executable(app src/main.cpp src/engine.cpp src/utils.cpp)

if(ipo_supported)
    # Enable LTO for Release builds only
    set_target_properties(app PROPERTIES
        INTERPROCEDURAL_OPTIMIZATION_RELEASE ON
    )
    message(STATUS "LTO enabled for Release builds")
else()
    message(WARNING "LTO not supported: ${ipo_error}")
endif()

# Or enable globally for all targets in Release
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION_RELEASE ON)
cmake_minimum_required(VERSION 3.9)
project(ThinLTO LANGUAGES CXX)

add_executable(app src/main.cpp src/module_a.cpp src/module_b.cpp)

# Thin LTO — parallel, lower memory, nearly same optimization
# GCC/Clang: -flto=thin (Clang) or -flto=auto (GCC for parallel)
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    target_compile_options(app PRIVATE -flto=thin)
    target_link_options(app PRIVATE -flto=thin)
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
    target_compile_options(app PRIVATE -flto=auto -ffat-lto-objects)
    target_link_options(app PRIVATE -flto=auto)
endif()

# Parallel LTO linking (Clang)
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    include(ProcessorCount)
    ProcessorCount(NPROC)
    target_link_options(app PRIVATE
        "LINKER:--thinlto-jobs=${NPROC}"
    )
endif()
Case Study LTO Performance Tradeoffs
Thin LTO vs Full LTO — Firefox Build

Mozilla's Firefox project benchmarks show: Full LTO produces 10–15% faster runtime code but increases link time from 2 minutes to 45 minutes with 32 GB peak memory. Thin LTO achieves 8–12% improvement with only 8 minutes link time and 12 GB peak memory. The compromise: use Thin LTO in CI, Full LTO for release builds only.

Thin LTO Link Time Memory Usage

Reducing Header Dependencies

The single most impactful long-term strategy for build performance is reducing the transitive header inclusion graph. Every unnecessary #include multiplies parsing work across all translation units that include that header.

// BAD: widget.h pulls in entire engine dependency tree
// widget.h
#include "engine.h"      // 50,000 lines of transitive includes
#include "renderer.h"    // 30,000 lines
#include <vector>
#include <string>

class Widget {
    Engine* engine_;
    Renderer* renderer_;
    std::vector<std::string> labels_;
public:
    void render();
};
// GOOD: Forward declarations minimize header dependencies
// widget.h
#include <vector>
#include <string>

// Forward declarations — no #include needed for pointer/reference types
class Engine;
class Renderer;

class Widget {
    Engine* engine_;
    Renderer* renderer_;
    std::vector<std::string> labels_;
public:
    void render();
};

// widget.cpp — includes only here, in the translation unit
#include "widget.h"
#include "engine.h"
#include "renderer.h"

void Widget::render() {
    engine_->beginFrame();
    renderer_->draw(labels_);
}
// PIMPL (Pointer to Implementation) — complete header isolation
// database.h — stable ABI, minimal includes
#include <memory>
#include <string>

class Database {
public:
    Database(const std::string& connection_string);
    ~Database();
    Database(Database&&) noexcept;
    Database& operator=(Database&&) noexcept;

    bool execute(const std::string& query);
    int rowCount() const;

private:
    struct Impl;
    std::unique_ptr<Impl> impl_;
};

// database.cpp — heavy includes only here
#include "database.h"
#include <pqxx/pqxx>          // PostgreSQL — only compiled once
#include <spdlog/spdlog.h>    // Logging library
#include <nlohmann/json.hpp>  // JSON parsing

struct Database::Impl {
    pqxx::connection conn;
    spdlog::logger logger;
    int last_row_count = 0;
};

Database::Database(const std::string& cs) : impl_(std::make_unique<Impl>()) {
    impl_->conn = pqxx::connection(cs);
}
Database::~Database() = default;
Database::Database(Database&&) noexcept = default;
Database& Database::operator=(Database&&) noexcept = default;
cmake_minimum_required(VERSION 3.16)
project(IWYU_Integration LANGUAGES CXX)

# Include-What-You-Use integration
find_program(IWYU_PROGRAM NAMES include-what-you-use iwyu)

add_executable(app src/main.cpp src/widget.cpp src/database.cpp)

if(IWYU_PROGRAM)
    set_target_properties(app PROPERTIES
        CXX_INCLUDE_WHAT_YOU_USE "${IWYU_PROGRAM};-Xiwyu;--mapping_file=${CMAKE_SOURCE_DIR}/iwyu.imp"
    )
    message(STATUS "IWYU enabled: ${IWYU_PROGRAM}")
endif()
Key Insight: A single heavy header included transitively in 200 translation units means the compiler parses it 200 times (without PCH). Moving from #include "heavy.h" to a forward declaration in a widely-included header can save minutes on full rebuilds. Run include-what-you-use periodically to identify unnecessary includes.

Object Libraries for Faster Iteration

CMake OBJECT libraries produce compiled object files without creating an archive or shared library. This avoids the linking step entirely for intermediate build products, speeding up incremental development cycles.

cmake_minimum_required(VERSION 3.12)
project(ObjectLibs LANGUAGES CXX)

# Object library — compiles but doesn't link
add_library(core_objects OBJECT
    src/engine.cpp
    src/renderer.cpp
    src/physics.cpp
    src/audio.cpp
)
target_include_directories(core_objects PUBLIC include)
target_compile_features(core_objects PUBLIC cxx_std_17)

# Multiple final targets share the same object files
# No re-compilation, just different linking
add_executable(game
    src/main.cpp
    $<TARGET_OBJECTS:core_objects>
)

add_executable(editor
    src/editor_main.cpp
    $<TARGET_OBJECTS:core_objects>
)

# Test executable reuses objects without rebuilding
add_executable(tests
    tests/test_engine.cpp
    tests/test_physics.cpp
    $<TARGET_OBJECTS:core_objects>
)

# Modern CMake: link against OBJECT library directly (3.12+)
add_executable(benchmark src/benchmark.cpp)
target_link_libraries(benchmark PRIVATE core_objects)
Key Insight: Object libraries shine when multiple executables share the same source files. Without them, each executable either re-compiles the sources (slow) or links against a static library (archiving overhead). Object libraries eliminate both costs — object files are compiled once and directly consumed by multiple link steps.

Ninja as the Preferred Generator

Ninja consistently outperforms Make for C++ projects due to its minimal overhead, superior dependency tracking, and optimized scheduling. It was specifically designed for large codebases where Make's startup time and serialized recipe evaluation become bottlenecks.

# Generate Ninja build files (single-config)
cmake -G Ninja -B build -S . -DCMAKE_BUILD_TYPE=Release

# Generate Ninja Multi-Config (build all configurations from one tree)
cmake -G "Ninja Multi-Config" -B build -S .
cmake --build build --config Release
cmake --build build --config Debug

# Install Ninja
pip install ninja          # Python/pip (cross-platform)
brew install ninja         # macOS
apt install ninja-build    # Ubuntu/Debian
choco install ninja        # Windows
cmake_minimum_required(VERSION 3.17)
project(NinjaMultiConfig LANGUAGES CXX)

# Ninja Multi-Config: specify all desired configurations
set(CMAKE_CONFIGURATION_TYPES "Debug;Release;RelWithDebInfo" CACHE STRING "" FORCE)

# Per-config output directories (avoid collisions)
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_DEBUG ${CMAKE_BINARY_DIR}/Debug)
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_RELEASE ${CMAKE_BINARY_DIR}/Release)

add_executable(app src/main.cpp src/engine.cpp)

# Cross-config dependencies work automatically
add_custom_target(check
    COMMAND $<TARGET_FILE:app> --self-test
    DEPENDS app
)
Case Study Ninja vs Make Benchmark
LLVM Build — Ninja vs Unix Makefiles

Building LLVM (~3,500 targets) on a 32-core machine: Ninja completes no-op builds in 0.4 seconds vs Make's 12 seconds. Full parallel builds complete 5–8% faster with Ninja due to better job scheduling and reduced fork overhead. The difference is more dramatic on Windows where process creation is expensive — Ninja avoids spawning shells for each recipe.

LLVM No-op Speed Scheduling

Distributed Compilation

For very large projects, distributing compilation across multiple machines provides near-linear speedup. Tools like distcc, icecream, and Incredibuild integrate transparently with CMake via the compiler launcher mechanism.

cmake_minimum_required(VERSION 3.16)
project(Distributed LANGUAGES CXX)

# distcc — distribute compilation across network hosts
find_program(DISTCC_PROGRAM distcc)
if(DISTCC_PROGRAM)
    set(CMAKE_C_COMPILER_LAUNCHER "${DISTCC_PROGRAM}")
    set(CMAKE_CXX_COMPILER_LAUNCHER "${DISTCC_PROGRAM}")
    message(STATUS "Using distcc: ${DISTCC_PROGRAM}")
endif()

# icecream (icecc) — automatic load balancing
find_program(ICECC_PROGRAM icecc)
if(ICECC_PROGRAM)
    set(CMAKE_C_COMPILER_LAUNCHER "${ICECC_PROGRAM}")
    set(CMAKE_CXX_COMPILER_LAUNCHER "${ICECC_PROGRAM}")
endif()

add_executable(app src/main.cpp src/engine.cpp src/renderer.cpp)
# distcc setup — configure available build hosts
export DISTCC_HOSTS="localhost/8 build-server-1/16 build-server-2/16"

# Verify host connectivity
distcc --show-hosts

# Set parallelism to match total distributed core count
# Local(8) + Server1(16) + Server2(16) = 40 jobs
cmake --build build --parallel 40

# icecream setup — automatic scheduler-based distribution
# Start the scheduler on one machine
icecc-scheduler &

# Start the daemon on each build node
iceccd --nice 5 --max-jobs 16 &

# Monitor distributed builds
icemon  # GUI monitor showing job distribution

# Combine with ccache: ccache wraps distcc
export CCACHE_PREFIX="distcc"
cmake -B build -DCMAKE_CXX_COMPILER_LAUNCHER="ccache"
Warning: Distributed compilation only accelerates the compilation phase — linking is always local. Projects with few large translation units benefit less than projects with many small ones. Also, PCH files cannot be distributed (they're compiler-specific binary blobs), so consider the interaction between PCH and distcc carefully.

Build Profiling

To identify the specific files and operations consuming the most build time, use compiler-specific profiling features and CMake's own profiling output.

cmake_minimum_required(VERSION 3.18)
project(BuildProfiling LANGUAGES CXX)

add_executable(app src/main.cpp src/engine.cpp src/heavy_templates.cpp)

# Clang: -ftime-trace generates per-file Chrome trace JSON
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    target_compile_options(app PRIVATE -ftime-trace)
endif()

# MSVC: /d1reportTime shows per-header parsing time
if(MSVC)
    target_compile_options(app PRIVATE /d1reportTime)
endif()

# GCC: -ftime-report prints per-pass timing
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
    target_compile_options(app PRIVATE -ftime-report)
endif()
# CMake configure-step profiling (3.18+)
cmake -B build -S . \
    --profiling-output=cmake-profile.json \
    --profiling-format=google-trace

# Open cmake-profile.json in Chrome's trace viewer:
# chrome://tracing or https://ui.perfetto.dev

# Clang -ftime-trace: generates .json next to each .o file
# View in Chrome tracing or ClangBuildAnalyzer
cmake --build build --parallel

# Analyze all trace files with ClangBuildAnalyzer
ClangBuildAnalyzer --all build/ capture.bin
ClangBuildAnalyzer --analyze capture.bin

# Ninja build graph visualization
ninja -C build -t graph | dot -Tsvg > build-graph.svg

# Ninja: show longest compilation paths (critical path)
ninja -C build -t targets depth 3
Key Insight: Clang's -ftime-trace is the most actionable profiling tool — it shows exactly which headers, template instantiations, and code generation phases consume time per file. Run ClangBuildAnalyzer to aggregate across all translation units and identify the "most expensive headers" and "slowest template instantiations" project-wide.
Case Study Build Profiling in Practice
Identifying a 40-Second Header

Using -ftime-trace on a game engine project revealed that boost/spirit.hpp was transitively included in 180 translation units, adding 40 seconds per file of template instantiation. Moving the Spirit-dependent parser into a single .cpp file and exposing only a simple interface reduced total build time from 22 minutes to 8 minutes — a 63% improvement from one header refactoring.

-ftime-trace Template Bloat Header Isolation

C++20 Modules Impact on Build Times

C++20 modules fundamentally change the compilation model by replacing textual inclusion with pre-compiled binary module interfaces (BMIs). This eliminates redundant parsing — each module is compiled once into a BMI that dependents consume directly, similar to PCH but with proper encapsulation and dependency tracking.

// math_utils.cppm — Module interface unit
export module math_utils;

import <cmath>;
import <vector>;
import <algorithm>;

export namespace math {
    double magnitude(const std::vector<double>& vec) {
        double sum = 0.0;
        for (auto v : vec) sum += v * v;
        return std::sqrt(sum);
    }

    std::vector<double> normalize(std::vector<double> vec) {
        double mag = magnitude(vec);
        std::for_each(vec.begin(), vec.end(), [mag](double& v){ v /= mag; });
        return vec;
    }
}
// main.cpp — Consumer only imports the module interface
import math_utils;   // Binary import — no re-parsing of cmath, vector, algorithm
#include <iostream>

int main() {
    std::vector<double> v = {3.0, 4.0};
    auto n = math::normalize(v);
    std::cout << "Magnitude: " << math::magnitude(v) << "\n";
    return 0;
}
cmake_minimum_required(VERSION 3.28)
project(ModulesExample LANGUAGES CXX)

# C++20 modules require CMake 3.28+ and supported compilers
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Experimental module support
set(CMAKE_EXPERIMENTAL_CXX_MODULE_CMAKE_API "aa1f7df0-828a-4fcd-9afc-2dc80491dar7")
set(CMAKE_EXPERIMENTAL_CXX_MODULE_DYNDEP ON)

add_library(math_module)
target_sources(math_module
    PUBLIC
        FILE_SET CXX_MODULES FILES
            src/math_utils.cppm
)

add_executable(app src/main.cpp)
target_link_libraries(app PRIVATE math_module)
Key Insight: C++20 modules provide the build-time benefits of PCH with proper modularity. Unlike headers parsed textually N times, a module interface is compiled to a BMI once and imported as binary data. Early benchmarks show 40–80% reduction in full build time for header-heavy projects. However, module support in CMake is still maturing — production adoption requires CMake 3.28+ with Ninja and recent Clang/MSVC/GCC versions.
Warning: C++20 modules introduce build-order dependencies — a module interface must be compiled before its importers. This can reduce parallelism compared to traditional headers where all translation units can compile simultaneously. Ninja handles this via dynamic dependencies (dyndep), but Make generators do not support modules well. Always use Ninja when building with modules.

Conclusion & Next Steps

Build performance optimization is not a single switch — it's a layered strategy combining multiple techniques. Here's a recommended adoption order based on effort-to-impact ratio:

  1. Switch to Ninja — Zero-effort, immediate 5–15% improvement on incremental builds
  2. Enable ccache/sccache — One-line CMake change, massive improvement on repeated builds
  3. Add precompiled headers — Moderate effort, 20–50% faster compilation
  4. Enable unity builds for CI — Low effort, 30–70% faster clean builds
  5. Reduce header dependencies — High effort, highest long-term payoff (architectural change)
  6. Profile with -ftime-trace — Identifies specific bottlenecks unique to your project
  7. Consider C++20 modules — Future-facing, best benefit for new code
  8. Distributed compilation — Infrastructure investment, beneficial for 100k+ LOC projects
Key Insight: The combination of Ninja + ccache + PCH delivers the best cost-to-benefit ratio for most projects. Together they can reduce a 10-minute build to 1–2 minutes with minimal code changes. Add unity builds for CI and header refactoring for long-term gains. Profile first, optimize second.

Next in the Series

In Part 31: Apple Platform Development, we'll explore building for macOS, iOS, tvOS, and watchOS with CMake — including Universal Binaries, code signing, framework bundles, and Xcode project generation.