Table of Contents

  1. C/C++ Extension Overview
  2. pybind11 Integration
  3. scikit-build-core
  4. Python CFFI with CMake
  5. Cython Integration
  6. Boost.Python Legacy Projects
  7. PyPI Distribution
  8. Conda Package Distribution
  9. Hybrid Project Structure
  10. Testing Native Extensions
  11. Versioning and Metadata
  12. Performance Profiling
  13. Conclusion & Next Steps
Back to CMake Mastery Series

Part 32: Distributing Python Extensions

June 4, 2026 Wasil Zafar 35 min read

Master the complete pipeline for building, packaging, and distributing C/C++ Python extensions using CMake — from pybind11 and scikit-build-core to cibuildwheel and conda-forge, covering every step from source code to installable package.

C/C++ Python Extension Overview

Python's popularity in scientific computing, machine learning, and data analysis owes much to its ability to seamlessly integrate high-performance C and C++ code. Native extensions allow developers to write performance-critical algorithms in compiled languages while exposing them through Python's familiar interface. CMake serves as the ideal build system for these extensions because it handles cross-platform compilation, dependency management, and integration with Python's packaging ecosystem.

There are compelling reasons to build native Python extensions: numerical kernels that achieve 100–1000× speedups over pure Python, wrapping existing C/C++ libraries for Python consumption, interfacing with hardware or system APIs not accessible from Python, and implementing algorithms that require precise memory control. The CPython interpreter provides a C API that extension modules use to create Python objects, manage reference counts, and interact with the garbage collector.

// example_module.c — Minimal CPython extension module
#define PY_SSIZE_T_CLEAN
#include <Python.h>

// The actual function implementation
static PyObject* example_add(PyObject* self, PyObject* args) {
    double a, b;
    if (!PyArg_ParseTuple(args, "dd", &a, &b))
        return NULL;
    return PyFloat_FromDouble(a + b);
}

// Method table
static PyMethodDef ExampleMethods[] = {
    {"add", example_add, METH_VARARGS, "Add two numbers."},
    {NULL, NULL, 0, NULL}  // Sentinel
};

// Module definition
static struct PyModuleDef examplemodule = {
    PyModuleDef_HEAD_INIT,
    "example",      // module name
    NULL,           // docstring
    -1,             // per-interpreter state size (-1 = global)
    ExampleMethods
};

// Module initialization function
PyMODINIT_FUNC PyInit_example(void) {
    return PyModule_Create(&examplemodule);
}

CPython API and Extension Module Structure

Every CPython extension module follows a standard pattern: define functions that accept and return PyObject* pointers, register them in a method table, and create a module definition structure. The module initialization function — named PyInit_<modulename> — is called by the interpreter when import modulename executes. Building this with CMake requires finding the Python development headers and libraries:

cmake_minimum_required(VERSION 3.25)
project(example_extension LANGUAGES C)

# Find Python interpreter and development files
find_package(Python3 REQUIRED COMPONENTS Interpreter Development.Module)

# Build the extension module as a shared library
Python3_add_library(example MODULE example_module.c)

# Set output name without 'lib' prefix
set_target_properties(example PROPERTIES
    PREFIX ""
    OUTPUT_NAME "example"
)

# On Windows, extension modules use .pyd suffix
if(WIN32)
    set_target_properties(example PROPERTIES SUFFIX ".pyd")
endif()

# Install to the Python site-packages
install(TARGETS example
    LIBRARY DESTINATION ${Python3_SITEARCH}
)
Key Insight: CMake 3.26+ provides Python3_add_library() which automatically handles the correct suffix (.so, .pyd, .cpython-312-x86_64-linux-gnu.so), link flags, and RPATH settings for extension modules. Always prefer this over manual add_library(MODULE) when targeting Python specifically.
Extension Module Build Pipeline
        flowchart LR
            A[C/C++ Source] --> B[CMake Configure]
            B --> C[Compile .o/.obj]
            C --> D[Link .so/.pyd]
            D --> E[Package into Wheel]
            E --> F[Upload to PyPI]
            F --> G[pip install]
            
            B --> H[Find Python Headers]
            B --> I[Find pybind11/CFFI]
            
            style A fill:#132440,color:#fff
            style D fill:#3B9797,color:#fff
            style E fill:#16476A,color:#fff
            style F fill:#BF092F,color:#fff
            style G fill:#3B9797,color:#fff
    

pybind11 Integration with CMake

pybind11 is the modern standard for creating Python bindings for C++ code. It provides a header-only library that uses C++ template metaprogramming to automatically generate the boilerplate required by the CPython API. pybind11 integrates deeply with CMake, providing the pybind11_add_module() function that handles all the platform-specific details of building a Python extension module.

cmake_minimum_required(VERSION 3.25)
project(mylib LANGUAGES CXX)

# Option 1: FetchContent (recommended for reproducibility)
include(FetchContent)
FetchContent_Declare(
    pybind11
    GIT_REPOSITORY https://github.com/pybind/pybind11.git
    GIT_TAG        v2.13.6
)
FetchContent_MakeAvailable(pybind11)

# Option 2: find_package (if installed via pip install pybind11)
# find_package(pybind11 REQUIRED)

# Create the extension module
pybind11_add_module(mylib
    src/bindings.cpp
    src/core_algorithm.cpp
)

# Set C++ standard
target_compile_features(mylib PRIVATE cxx_std_17)

# Add include directories for the C++ library
target_include_directories(mylib PRIVATE include)

# Install the module
install(TARGETS mylib DESTINATION .)

The corresponding C++ binding code uses pybind11's expressive API to wrap classes, functions, and even NumPy arrays:

// src/bindings.cpp — pybind11 bindings for a matrix library
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>
#include "core_algorithm.h"

namespace py = pybind11;

// Wrap a C++ class
class Matrix {
public:
    Matrix(int rows, int cols) : rows_(rows), cols_(cols), data_(rows * cols, 0.0) {}
    
    double get(int r, int c) const { return data_[r * cols_ + c]; }
    void set(int r, int c, double v) { data_[r * cols_ + c] = v; }
    int rows() const { return rows_; }
    int cols() const { return cols_; }
    
    // Accept NumPy array directly
    static Matrix from_numpy(py::array_t<double> arr) {
        auto buf = arr.request();
        if (buf.ndim != 2) throw std::runtime_error("Expected 2D array");
        Matrix m(buf.shape[0], buf.shape[1]);
        auto ptr = static_cast<double*>(buf.ptr);
        std::copy(ptr, ptr + m.data_.size(), m.data_.begin());
        return m;
    }

private:
    int rows_, cols_;
    std::vector<double> data_;
};

PYBIND11_MODULE(mylib, m) {
    m.doc() = "High-performance matrix library";
    
    py::class_<Matrix>(m, "Matrix")
        .def(py::init<int, int>(), py::arg("rows"), py::arg("cols"))
        .def("get", &Matrix::get)
        .def("set", &Matrix::set)
        .def_property_readonly("rows", &Matrix::rows)
        .def_property_readonly("cols", &Matrix::cols)
        .def_static("from_numpy", &Matrix::from_numpy,
                    py::arg("array"), "Create from NumPy array");
    
    // Wrap a free function
    m.def("fast_multiply", &core::fast_multiply,
          py::arg("a"), py::arg("b"),
          "Multiply two matrices using SIMD-optimized kernel");
}

FetchContent vs find_package Strategies

The choice between FetchContent and find_package for pybind11 depends on your distribution strategy. FetchContent pins an exact version and downloads it during configuration — ideal for reproducible builds and CI. Using find_package is faster for development since it reuses a pre-installed pybind11, but introduces version uncertainty.

cmake_minimum_required(VERSION 3.25)
project(mylib LANGUAGES CXX)

# Hybrid approach: try find_package first, fall back to FetchContent
find_package(pybind11 2.13 QUIET)

if(NOT pybind11_FOUND)
    message(STATUS "pybind11 not found locally, fetching from GitHub...")
    include(FetchContent)
    FetchContent_Declare(
        pybind11
        GIT_REPOSITORY https://github.com/pybind/pybind11.git
        GIT_TAG        v2.13.6
        GIT_SHALLOW    TRUE
    )
    FetchContent_MakeAvailable(pybind11)
endif()

# Smart holder support (pybind11 v2.12+)
# Enables shared_ptr/unique_ptr interoperability
pybind11_add_module(mylib src/bindings.cpp)
target_compile_definitions(mylib PRIVATE PYBIND11_USE_SMART_HOLDER_AS_DEFAULT)
target_compile_features(mylib PRIVATE cxx_std_17)
Key Insight: When using FetchContent with pybind11, always set GIT_SHALLOW TRUE to avoid downloading the full git history. This reduces configure time from ~30 seconds to ~3 seconds on typical CI runners. For even faster builds, use URL mode pointing to a release tarball.

Building with scikit-build-core

scikit-build-core is the modern Python build backend that bridges CMake and Python's packaging standards (PEP 517/518). It replaces the older scikit-build (which wrapped setuptools) with a clean, standards-compliant implementation. It reads your pyproject.toml, invokes CMake to build the extension, and produces a wheel — all while integrating with pip, build, and other standard tools.

scikit-build-core Integration Architecture
        flowchart TD
            A[pyproject.toml] --> B[scikit-build-core Backend]
            B --> C[CMake Configure]
            C --> D[CMake Build]
            D --> E[Install to Wheel Staging]
            E --> F[.whl Package]
            
            G[CMakeLists.txt] --> C
            H[Python Source] --> E
            I[Compiled Extensions] --> E
            
            B --> J[Version Detection]
            B --> K[Metadata Generation]
            
            J --> F
            K --> F
            
            style A fill:#3B9797,color:#fff
            style B fill:#132440,color:#fff
            style F fill:#BF092F,color:#fff
            style G fill:#16476A,color:#fff
    
# pyproject.toml — scikit-build-core configuration
[build-system]
requires = ["scikit-build-core>=0.10", "pybind11>=2.13"]
build-backend = "scikit_build_core.build"

[project]
name = "mylib"
version = "1.2.0"
description = "High-performance matrix operations"
readme = "README.md"
license = "MIT"
requires-python = ">=3.9"
authors = [
    { name = "Developer Name", email = "dev@example.com" }
]
dependencies = ["numpy>=1.24"]

[project.optional-dependencies]
test = ["pytest>=7.0", "numpy"]
dev = ["pytest", "numpy", "sphinx"]

[tool.scikit-build]
# Minimum CMake version required
cmake.minimum-version = "3.25"

# Build type
cmake.build-type = "Release"

# CMake arguments passed to configure step
cmake.args = ["-DBUILD_TESTING=OFF"]

# Wheel settings
wheel.packages = ["src/mylib"]
wheel.install-dir = "mylib"

# Only build specific targets
build.targets = ["mylib"]

# Source file matching for sdist
sdist.include = ["src/*", "CMakeLists.txt", "cmake/*"]
sdist.exclude = ["tests/*", "docs/*"]

CMake Integration with scikit-build-core

The CMakeLists.txt for a scikit-build-core project uses standard CMake with one key addition: installing targets to the directory scikit-build-core expects. The build backend sets CMAKE_INSTALL_PREFIX to a staging directory that becomes the wheel contents.

cmake_minimum_required(VERSION 3.25)
project(mylib VERSION 1.2.0 LANGUAGES CXX)

# scikit-build-core sets this automatically
if(SKBUILD)
    message(STATUS "Building via scikit-build-core (SKBUILD=${SKBUILD})")
endif()

# Find pybind11 (available because it's in build-system.requires)
find_package(pybind11 REQUIRED)

# Build the extension
pybind11_add_module(_core src/bindings.cpp src/algorithm.cpp)
target_compile_features(_core PRIVATE cxx_std_17)
target_include_directories(_core PRIVATE include)

# Optional: Link external libraries
find_package(Eigen3 QUIET)
if(Eigen3_FOUND)
    target_link_libraries(_core PRIVATE Eigen3::Eigen)
    target_compile_definitions(_core PRIVATE HAS_EIGEN=1)
endif()

# Install the extension module into the package directory
install(TARGETS _core DESTINATION mylib)

The corresponding directory structure combines pure Python with compiled extensions:

# Project structure for scikit-build-core
mylib-project/
├── pyproject.toml
├── CMakeLists.txt
├── include/
│   └── algorithm.h
├── src/
│   ├── mylib/
│   │   ├── __init__.py       # Pure Python package
│   │   ├── utils.py          # Pure Python utilities
│   │   └── _core.pyi         # Type stubs for compiled extension
│   ├── bindings.cpp           # pybind11 bindings
│   └── algorithm.cpp          # C++ implementation
└── tests/
    ├── test_core.py
    └── conftest.py
Real-World Example
Building and Installing Locally

With scikit-build-core configured, the standard Python tooling handles everything:

# Install in development mode (editable)
pip install -e . --no-build-isolation

# Build a wheel
pip wheel . --no-deps -w dist/

# Build with verbose CMake output
pip install . -v --config-settings=cmake.verbose=true

# Override CMake build type
pip install . --config-settings=cmake.build-type=Debug

# Pass extra CMake arguments
pip install . --config-settings=cmake.args="-DWITH_OPENMP=ON"
scikit-build-core pip wheel

Python CFFI with CMake

CFFI (C Foreign Function Interface) provides a different approach to Python extensions: instead of writing C++ bindings, you build a plain C shared library and use CFFI to generate Python wrappers. This approach works well for wrapping existing C libraries, is compatible with PyPy (unlike CPython-only extensions), and keeps the binding logic in Python rather than C++.

cmake_minimum_required(VERSION 3.25)
project(fastmath LANGUAGES C)

# Build the C library as a shared library
add_library(fastmath SHARED
    src/fastmath.c
    src/linalg.c
    src/statistics.c
)

target_include_directories(fastmath PUBLIC
    $<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/include>
    $<INSTALL_INTERFACE:include>
)

# Export symbols on Windows
include(GenerateExportHeader)
generate_export_header(fastmath)

# Install the shared library where Python can find it
install(TARGETS fastmath
    LIBRARY DESTINATION fastmath_py
    RUNTIME DESTINATION fastmath_py  # .dll on Windows
)

# Install headers for CFFI to parse
install(FILES include/fastmath.h DESTINATION fastmath_py)

The CFFI build script parses the C header file and generates a Python extension that wraps the shared library:

# build_cffi.py — CFFI out-of-line API mode builder
from cffi import FFI
from pathlib import Path

ffi = FFI()

# Read the C header file that CFFI will parse
header_path = Path(__file__).parent / "include" / "fastmath.h"
header_content = header_path.read_text()

# Remove #include directives and preprocessor guards
# CFFI only needs function declarations and type definitions
cleaned = "\n".join(
    line for line in header_content.splitlines()
    if not line.strip().startswith("#")
)

ffi.cdef(cleaned)

# Set source for the wrapper module
ffi.set_source(
    "fastmath_py._cffi_backend",  # module name
    '#include "fastmath.h"',       # C source for verification
    include_dirs=["include"],
    libraries=["fastmath"],
    library_dirs=["build/lib"],
)

if __name__ == "__main__":
    ffi.compile(verbose=True)

C Wrapper Generation and CMake Integration

For a complete build pipeline, integrate the CFFI compilation step into CMake using a custom command that runs after the C library is built:

# Add CFFI wrapper generation as a post-build step
find_package(Python3 REQUIRED COMPONENTS Interpreter)

add_custom_command(
    OUTPUT ${CMAKE_BINARY_DIR}/fastmath_py/_cffi_backend${Python3_SOABI}.so
    COMMAND ${Python3_EXECUTABLE} ${CMAKE_SOURCE_DIR}/build_cffi.py
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
    DEPENDS fastmath ${CMAKE_SOURCE_DIR}/build_cffi.py
    COMMENT "Generating CFFI wrapper module"
)

add_custom_target(cffi_wrapper ALL
    DEPENDS ${CMAKE_BINARY_DIR}/fastmath_py/_cffi_backend${Python3_SOABI}.so
)
Common Pitfall: CFFI's cdef() only accepts a subset of C syntax — no preprocessor directives, no #include, no comments, and no compiler-specific attributes. You must pre-process headers to strip these before passing to ffi.cdef(). Use a separate "CFFI-clean" header file that contains only the public API declarations.

Cython Integration

Cython compiles Python-like .pyx files into C/C++ extension modules, offering a gradual path from pure Python to native performance. CMake can orchestrate the full Cython compilation pipeline: finding the Cython compiler, transpiling .pyx to C, then compiling and linking the resulting C code into a Python extension module.

cmake_minimum_required(VERSION 3.25)
project(cymath LANGUAGES C CXX)

find_package(Python3 REQUIRED COMPONENTS Interpreter Development.Module)
find_package(Cython REQUIRED)

# Method 1: Manual Cython compilation pipeline
add_custom_command(
    OUTPUT ${CMAKE_BINARY_DIR}/fast_ops.c
    COMMAND Cython::cython
        --output-file ${CMAKE_BINARY_DIR}/fast_ops.c
        -3    # Python 3 mode
        ${CMAKE_SOURCE_DIR}/src/fast_ops.pyx
    DEPENDS ${CMAKE_SOURCE_DIR}/src/fast_ops.pyx
    COMMENT "Cythonizing fast_ops.pyx"
)

Python3_add_library(fast_ops MODULE ${CMAKE_BINARY_DIR}/fast_ops.c)
target_include_directories(fast_ops PRIVATE ${Python3_NumPy_INCLUDE_DIRS})

# Method 2: Using scikit-build-core's Cython support
# (configured via pyproject.toml, no manual commands needed)

The .pyx Compilation Pipeline

A Cython source file mixes Python syntax with C type annotations for optimal performance:

# src/fast_ops.pyx — Cython extension with typed memoryviews
# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as cnp
from libc.math cimport sqrt, exp

cnp.import_array()

def euclidean_distance(double[:] a, double[:] b):
    """Compute Euclidean distance between two vectors."""
    cdef Py_ssize_t n = a.shape[0]
    cdef double total = 0.0
    cdef Py_ssize_t i
    
    if a.shape[0] != b.shape[0]:
        raise ValueError("Arrays must have same length")
    
    for i in range(n):
        total += (a[i] - b[i]) ** 2
    
    return sqrt(total)

def pairwise_distances(double[:, :] X):
    """Compute pairwise distance matrix."""
    cdef Py_ssize_t n = X.shape[0]
    cdef Py_ssize_t d = X.shape[1]
    cdef double[:, :] result = np.zeros((n, n), dtype=np.float64)
    cdef Py_ssize_t i, j, k
    cdef double diff, total
    
    for i in range(n):
        for j in range(i + 1, n):
            total = 0.0
            for k in range(d):
                diff = X[i, k] - X[j, k]
                total += diff * diff
            result[i, j] = sqrt(total)
            result[j, i] = result[i, j]
    
    return np.asarray(result)
# Complete Cython project CMakeLists.txt with NumPy support
cmake_minimum_required(VERSION 3.25)
project(cymath LANGUAGES C)

find_package(Python3 REQUIRED COMPONENTS Interpreter Development.Module NumPy)

# Find Cython executable
find_program(CYTHON_EXECUTABLE
    NAMES cython cython3
    HINTS ${Python3_SITELIB}/../../../bin
          ${Python3_SITELIB}/../../Scripts
)

if(NOT CYTHON_EXECUTABLE)
    message(FATAL_ERROR "Cython not found. Install with: pip install cython")
endif()

# Cythonize .pyx → .c
set(PYX_SOURCE ${CMAKE_SOURCE_DIR}/src/fast_ops.pyx)
set(C_OUTPUT ${CMAKE_BINARY_DIR}/fast_ops.c)

add_custom_command(
    OUTPUT ${C_OUTPUT}
    COMMAND ${CYTHON_EXECUTABLE} -3 --fast-fail -o ${C_OUTPUT} ${PYX_SOURCE}
    DEPENDS ${PYX_SOURCE}
    VERBATIM
)

# Build extension module
Python3_add_library(fast_ops MODULE ${C_OUTPUT})
target_include_directories(fast_ops PRIVATE ${Python3_NumPy_INCLUDE_DIRS})

install(TARGETS fast_ops DESTINATION cymath)

Boost.Python Legacy Projects

Boost.Python predates pybind11 and remains in use across many legacy codebases. While new projects should prefer pybind11 (which is header-only and lighter), understanding Boost.Python is essential for maintaining and migrating existing code. The migration path from Boost.Python to pybind11 is straightforward since pybind11 was designed as a spiritual successor with a similar API.

cmake_minimum_required(VERSION 3.25)
project(legacy_bindings LANGUAGES CXX)

# Find Boost with the Python component
find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
find_package(Boost REQUIRED COMPONENTS python${Python3_VERSION_MAJOR}${Python3_VERSION_MINOR})

# Build Boost.Python module
add_library(legacy_module MODULE src/legacy_bindings.cpp)
target_link_libraries(legacy_module PRIVATE
    Boost::python${Python3_VERSION_MAJOR}${Python3_VERSION_MINOR}
    Python3::Module
)

# Remove lib prefix and set correct suffix
set_target_properties(legacy_module PROPERTIES
    PREFIX ""
    SUFFIX "${Python3_SOABI}.so"
)

install(TARGETS legacy_module DESTINATION .)
// src/legacy_bindings.cpp — Boost.Python module
#include <boost/python.hpp>
#include <boost/python/numpy.hpp>
#include "legacy_algorithm.h"

namespace bp = boost::python;
namespace np = boost::python::numpy;

// Wrapping a legacy class
struct LegacyProcessor {
    LegacyProcessor(int buffer_size) : buf_size(buffer_size) {}
    
    bp::list process(bp::list input) {
        bp::list result;
        for (int i = 0; i < bp::len(input); ++i) {
            double val = bp::extract<double>(input[i]);
            result.append(val * 2.0);  // simplified processing
        }
        return result;
    }
    
    int buf_size;
};

BOOST_PYTHON_MODULE(legacy_module) {
    np::initialize();
    
    bp::class_<LegacyProcessor>("LegacyProcessor", bp::init<int>())
        .def("process", &LegacyProcessor::process)
        .def_readonly("buffer_size", &LegacyProcessor::buf_size);
}
Migration Strategy
Boost.Python → pybind11 Migration Checklist

Migrating from Boost.Python to pybind11 follows a mechanical transformation:

  • boost::python::class_<T>py::class_<T>
  • BOOST_PYTHON_MODULE(name)PYBIND11_MODULE(name, m)
  • bp::extract<T>(obj)obj.cast<T>()
  • bp::listpy::list
  • bp::init<Args...>()py::init<Args...>()
  • Remove Boost dependency from CMake (header-only pybind11 replaces it)
  • Build size reduction: typical 50–80% smaller binary
migration Boost.Python pybind11

PyPI Distribution

Distributing compiled Python extensions on PyPI requires building platform-specific binary wheels for every combination of operating system, CPU architecture, and Python version your users need. cibuildwheel automates this process by running builds inside standardized containers (manylinux for Linux) and virtual machines (macOS, Windows) to produce wheels that work on any compatible system.

cibuildwheel CI/CD Workflow
        flowchart TD
            A[Push to GitHub] --> B[GitHub Actions Trigger]
            B --> C[Linux Builds]
            B --> D[macOS Builds]
            B --> E[Windows Builds]
            
            C --> C1[manylinux_2_28 x86_64]
            C --> C2[manylinux_2_28 aarch64]
            C --> C3[musllinux_1_2 x86_64]
            
            D --> D1[macOS x86_64]
            D --> D2[macOS arm64]
            D --> D3[macOS universal2]
            
            E --> E1[Windows x86_64]
            E --> E2[Windows ARM64]
            
            C1 --> F[Wheel Artifacts]
            C2 --> F
            C3 --> F
            D1 --> F
            D2 --> F
            D3 --> F
            E1 --> F
            E2 --> F
            
            F --> G[twine upload → PyPI]
            
            style A fill:#132440,color:#fff
            style B fill:#16476A,color:#fff
            style F fill:#3B9797,color:#fff
            style G fill:#BF092F,color:#fff
    
# .github/workflows/build_wheels.yml — cibuildwheel with GitHub Actions
name: Build and Publish Wheels

on:
  push:
    tags: ["v*"]
  pull_request:
    branches: [main]

jobs:
  build_wheels:
    name: Build wheels on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-14, windows-latest]

    steps:
      - uses: actions/checkout@v4

      - name: Build wheels
        uses: pypa/cibuildwheel@v2.21
        env:
          # Build for CPython 3.9–3.13
          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-* cp313-*"
          
          # Skip 32-bit and musl (optional)
          CIBW_SKIP: "*-win32 *-manylinux_i686"
          
          # Linux: use manylinux_2_28 for modern glibc
          CIBW_MANYLINUX_X86_64_IMAGE: manylinux_2_28
          CIBW_MANYLINUX_AARCH64_IMAGE: manylinux_2_28
          
          # macOS: build universal2 wheels
          CIBW_ARCHS_MACOS: "x86_64 arm64"
          CIBW_ENVIRONMENT_MACOS: >
            CMAKE_OSX_ARCHITECTURES="$CIBW_ARCHS"
            MACOSX_DEPLOYMENT_TARGET=11.0
          
          # Install test dependencies and run tests
          CIBW_TEST_REQUIRES: pytest numpy
          CIBW_TEST_COMMAND: pytest {project}/tests -x
          
          # Before build: install system dependencies
          CIBW_BEFORE_ALL_LINUX: yum install -y openblas-devel || apk add openblas-dev

      - uses: actions/upload-artifact@v4
        with:
          name: wheels-${{ matrix.os }}
          path: ./wheelhouse/*.whl

  build_sdist:
    name: Build source distribution
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pipx run build --sdist
      - uses: actions/upload-artifact@v4
        with:
          name: sdist
          path: dist/*.tar.gz

  publish:
    name: Publish to PyPI
    needs: [build_wheels, build_sdist]
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
    permissions:
      id-token: write  # Trusted publishing
    steps:
      - uses: actions/download-artifact@v4
        with:
          pattern: "*"
          merge-multiple: true
          path: dist/
      - uses: pypa/gh-action-pypi-publish@release/v1

Platform-Specific Wheel Details

Each platform has unique requirements for binary compatibility. Linux wheels must link against a minimum glibc version (manylinux policy), macOS wheels need delocate to bundle dylibs, and Windows wheels must include the Visual C++ runtime or link statically.

# pyproject.toml — cibuildwheel configuration section
[tool.cibuildwheel]
build-verbosity = 1
test-requires = ["pytest>=7.0", "numpy>=1.24"]
test-command = "pytest {project}/tests -v --tb=short"

[tool.cibuildwheel.linux]
# manylinux_2_28 supports glibc >= 2.28 (RHEL 8+, Ubuntu 20.04+)
manylinux-x86_64-image = "manylinux_2_28"
manylinux-aarch64-image = "manylinux_2_28"
# Also build musllinux for Alpine Docker users
musllinux-x86_64-image = "musllinux_1_2"

# Install build dependencies inside the container
before-all = "yum install -y cmake3 openblas-devel || apk add cmake openblas-dev"

# Repair wheel: bundle shared libs and set RPATH
repair-wheel-command = "auditwheel repair -w {dest_dir} {wheel}"

[tool.cibuildwheel.macos]
archs = ["x86_64", "arm64"]
# delocate bundles dylibs into the wheel
repair-wheel-command = "delocate-wheel --require-archs {delocate_archs} -w {dest_dir} -v {wheel}"

[tool.cibuildwheel.windows]
archs = ["AMD64"]
# delvewheel bundles DLLs into the wheel
before-build = "pip install delvewheel"
repair-wheel-command = "delvewheel repair -w {dest_dir} {wheel}"
Common Pitfall: The most frequent manylinux compatibility failure is linking against a glibc symbol newer than the target policy allows. If your extension uses getentropy(), copy_file_range(), or other modern libc functions, auditwheel will reject the wheel. Fix by either patching your code to use older alternatives or targeting a newer manylinux tag (e.g., manylinux_2_28 instead of manylinux_2_17).

Conda Package Distribution

Conda provides an alternative distribution channel that excels at managing complex native dependencies. Unlike PyPI wheels (which must bundle all non-Python libraries), Conda packages declare runtime dependencies that the solver resolves — meaning your extension can link against system-provided BLAS, HDF5, or CUDA without bundling them.

# recipe/meta.yaml — conda-build recipe
{% set version = "1.2.0" %}

package:
  name: mylib
  version: {{ version }}

source:
  url: https://github.com/myorg/mylib/archive/v{{ version }}.tar.gz
  sha256: abc123def456...

build:
  number: 0
  script: {{ PYTHON }} -m pip install . -vv --no-deps --no-build-isolation

requirements:
  build:
    - {{ compiler('c') }}
    - {{ compiler('cxx') }}
    - cmake >=3.25
    - ninja
    - python                                # [build_platform != target_platform]
    - cross-python_{{ target_platform }}     # [build_platform != target_platform]
    - pybind11 >=2.13
  host:
    - python
    - pip
    - scikit-build-core >=0.10
    - pybind11 >=2.13
    - numpy
  run:
    - python
    - numpy >=1.24

test:
  imports:
    - mylib
    - mylib._core
  requires:
    - pytest
  commands:
    - pytest --pyargs mylib.tests -x

about:
  home: https://github.com/myorg/mylib
  license: MIT
  license_file: LICENSE
  summary: High-performance matrix operations for Python
  description: |
    mylib provides optimized C++ implementations of common
    matrix operations with a Pythonic interface.

conda-forge Submission

conda-forge is the community-maintained channel with automated CI for building packages across Linux, macOS, and Windows. Submitting a package involves creating a feedstock repository with your recipe:

# recipe/conda_build_config.yaml — variant configuration
python:
  - "3.9"
  - "3.10"
  - "3.11"
  - "3.12"
  - "3.13"

numpy:
  - "1.24"
  - "2.0"

pin_run_as_build:
  python:
    min_pin: x.x
    max_pin: x.x
# Submit to conda-forge (one-time setup)
# 1. Fork conda-forge/staged-recipes
# 2. Add your recipe to recipes/mylib/meta.yaml
# 3. Open a PR — CI builds test your recipe on all platforms
# 4. After merge, a feedstock repo is created automatically

# Local testing before submission
conda build recipe/ --python 3.12
conda build recipe/ --python 3.12 --variants "{'numpy': ['2.0']}"

# Install from local build
conda install --use-local mylib
Key Insight: For packages with complex native dependencies (CUDA, MPI, HDF5), Conda is significantly easier than PyPI distribution. Conda's solver handles dependency versions and ABI compatibility automatically, while PyPI would require you to bundle or statically link these libraries into every wheel.

Hybrid Project Structure

Most real-world Python packages combine pure Python code (for high-level APIs, configuration, and glue logic) with compiled extensions (for performance-critical inner loops). Structuring this correctly ensures that editable installs, type checking, and IDE support all work seamlessly alongside the compiled components.

# Recommended hybrid project layout
mylib/
├── pyproject.toml              # Build system configuration
├── CMakeLists.txt              # Top-level CMake file
├── cmake/
│   └── FindNumPy.cmake         # Custom CMake modules
├── src/
│   ├── mylib/                  # Python package source
│   │   ├── __init__.py         # Package init, re-exports from _core
│   │   ├── py.typed            # PEP 561 marker file
│   │   ├── _core.pyi           # Type stubs for compiled extension
│   │   ├── api.py              # High-level Python API
│   │   ├── utils.py            # Pure Python utilities
│   │   └── visualization.py   # Plotting (pure Python, uses matplotlib)
│   └── _core/                  # C++ source for compiled extension
│       ├── module.cpp          # pybind11 module definition
│       ├── linalg.cpp          # Linear algebra implementations
│       ├── linalg.h
│       ├── signal.cpp          # Signal processing
│       └── signal.h
├── tests/
│   ├── conftest.py
│   ├── test_linalg.py
│   ├── test_signal.py
│   └── test_api.py
└── docs/
    └── conf.py
# src/mylib/__init__.py — Package initialization
"""mylib: High-performance numerical library."""

from mylib._core import (
    Matrix,
    fast_multiply,
    svd_decompose,
    fft_forward,
    fft_inverse,
)
from mylib.api import (
    solve_linear_system,
    fit_polynomial,
    bandpass_filter,
)
from mylib.utils import timer, validate_array

# Version from importlib.metadata (single source of truth)
from importlib.metadata import version, PackageNotFoundError

try:
    __version__ = version("mylib")
except PackageNotFoundError:
    __version__ = "0.0.0-dev"

__all__ = [
    "Matrix",
    "fast_multiply",
    "svd_decompose",
    "fft_forward",
    "fft_inverse",
    "solve_linear_system",
    "fit_polynomial",
    "bandpass_filter",
    "timer",
    "validate_array",
    "__version__",
]
# src/mylib/_core.pyi — Type stubs for the compiled extension
import numpy as np
import numpy.typing as npt

class Matrix:
    def __init__(self, rows: int, cols: int) -> None: ...
    @property
    def rows(self) -> int: ...
    @property
    def cols(self) -> int: ...
    def get(self, row: int, col: int) -> float: ...
    def set(self, row: int, col: int, value: float) -> None: ...
    @staticmethod
    def from_numpy(array: npt.NDArray[np.float64]) -> "Matrix": ...
    def to_numpy(self) -> npt.NDArray[np.float64]: ...

def fast_multiply(a: Matrix, b: Matrix) -> Matrix: ...
def svd_decompose(m: Matrix) -> tuple[Matrix, npt.NDArray[np.float64], Matrix]: ...
def fft_forward(signal: npt.NDArray[np.complex128]) -> npt.NDArray[np.complex128]: ...
def fft_inverse(spectrum: npt.NDArray[np.complex128]) -> npt.NDArray[np.complex128]: ...

Testing Native Extensions

Testing compiled Python extensions requires running both C++-level unit tests (via CTest/GoogleTest) and Python-level integration tests (via pytest). The key challenge is ensuring the compiled extension is importable during testing — either by installing it into the test environment or by manipulating sys.path to include the build directory.

# tests/conftest.py — pytest configuration for native extensions
import sys
from pathlib import Path
import pytest

# Ensure the built extension is importable during development
# (scikit-build-core editable installs handle this automatically)

@pytest.fixture
def sample_matrix():
    """Create a sample matrix for testing."""
    import numpy as np
    from mylib import Matrix
    
    data = np.array([[1.0, 2.0, 3.0],
                     [4.0, 5.0, 6.0],
                     [7.0, 8.0, 9.0]])
    return Matrix.from_numpy(data)

@pytest.fixture
def random_signal():
    """Generate a random test signal."""
    import numpy as np
    np.random.seed(42)
    t = np.linspace(0, 1, 1024)
    return np.sin(2 * np.pi * 50 * t) + 0.5 * np.sin(2 * np.pi * 120 * t)
# tests/test_linalg.py — Testing the compiled linear algebra extension
import numpy as np
import numpy.testing as npt
import pytest
from mylib import Matrix, fast_multiply, svd_decompose

class TestMatrix:
    def test_creation(self):
        m = Matrix(3, 4)
        assert m.rows == 3
        assert m.cols == 4
    
    def test_from_numpy_roundtrip(self):
        original = np.array([[1.0, 2.0], [3.0, 4.0]])
        m = Matrix.from_numpy(original)
        result = m.to_numpy()
        npt.assert_array_almost_equal(result, original)
    
    def test_multiply_identity(self, sample_matrix):
        identity = Matrix.from_numpy(np.eye(3))
        result = fast_multiply(sample_matrix, identity)
        npt.assert_array_almost_equal(
            result.to_numpy(),
            sample_matrix.to_numpy()
        )
    
    def test_multiply_dimensions_mismatch(self):
        a = Matrix(2, 3)
        b = Matrix(4, 2)
        with pytest.raises(ValueError, match="dimension mismatch"):
            fast_multiply(a, b)

class TestSVD:
    def test_reconstruction(self):
        data = np.random.randn(5, 3)
        m = Matrix.from_numpy(data)
        u, s, vt = svd_decompose(m)
        
        # Reconstruct: U @ diag(S) @ Vt should equal original
        reconstructed = u.to_numpy() @ np.diag(s) @ vt.to_numpy()
        npt.assert_array_almost_equal(reconstructed, data, decimal=10)

CTest Wrapping Python Tests

Integrate pytest into your CMake test suite so that ctest runs both C++ and Python tests:

# CMakeLists.txt — Integrating pytest with CTest
include(CTest)

if(BUILD_TESTING)
    # C++ unit tests with GoogleTest
    find_package(GTest REQUIRED)
    
    add_executable(test_linalg_cpp tests/cpp/test_linalg.cpp)
    target_link_libraries(test_linalg_cpp PRIVATE GTest::gtest_main mylib_core)
    
    add_test(NAME cpp_linalg COMMAND test_linalg_cpp)
    
    # Python tests via pytest
    find_package(Python3 REQUIRED COMPONENTS Interpreter)
    
    add_test(
        NAME python_tests
        COMMAND ${Python3_EXECUTABLE} -m pytest
            ${CMAKE_SOURCE_DIR}/tests
            -v --tb=short
            --junitxml=${CMAKE_BINARY_DIR}/pytest_results.xml
    )
    
    # Set environment so Python can find the built extension
    set_tests_properties(python_tests PROPERTIES
        ENVIRONMENT "PYTHONPATH=${CMAKE_BINARY_DIR}/src"
        WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
    )
    
    # Timeout for slow tests
    set_tests_properties(python_tests PROPERTIES TIMEOUT 120)
endif()
# Run all tests (C++ and Python) via CTest
cmake -B build -DBUILD_TESTING=ON
cmake --build build
cd build && ctest --output-on-failure -j$(nproc)

# Run only Python tests
ctest -R python_tests --verbose

# Run only C++ tests
ctest -R cpp_ --verbose

Versioning and Metadata

Maintaining a single source of truth for the package version across pyproject.toml, CMake, and the Python runtime avoids version drift. scikit-build-core supports several strategies for version synchronization:

# pyproject.toml — Dynamic version from Git tags
[project]
name = "mylib"
dynamic = ["version"]

[tool.scikit-build]
metadata.version.provider = "scikit_build_core.metadata.setuptools_scm"

[tool.setuptools_scm]
write_to = "src/mylib/_version.py"
version_scheme = "guess-next-dev"
local_scheme = "node-and-date"
# CMakeLists.txt — Reading version from Git or pyproject.toml
cmake_minimum_required(VERSION 3.25)

# Strategy 1: Extract version from Git tags
find_package(Git QUIET)
if(Git_FOUND)
    execute_process(
        COMMAND ${GIT_EXECUTABLE} describe --tags --abbrev=0
        WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
        OUTPUT_VARIABLE GIT_TAG
        OUTPUT_STRIP_TRAILING_WHITESPACE
        ERROR_QUIET
    )
    string(REGEX REPLACE "^v" "" PROJECT_VERSION "${GIT_TAG}")
endif()

# Strategy 2: Parse version from pyproject.toml
if(NOT PROJECT_VERSION)
    file(READ ${CMAKE_SOURCE_DIR}/pyproject.toml PYPROJECT_CONTENT)
    string(REGEX MATCH "version = \"([0-9]+\\.[0-9]+\\.[0-9]+)\"" _ ${PYPROJECT_CONTENT})
    set(PROJECT_VERSION ${CMAKE_MATCH_1})
endif()

project(mylib VERSION ${PROJECT_VERSION} LANGUAGES CXX)

message(STATUS "Building mylib version: ${PROJECT_VERSION}")

# Pass version to C++ code
target_compile_definitions(_core PRIVATE
    MYLIB_VERSION="${PROJECT_VERSION}"
    MYLIB_VERSION_MAJOR=${PROJECT_VERSION_MAJOR}
    MYLIB_VERSION_MINOR=${PROJECT_VERSION_MINOR}
    MYLIB_VERSION_PATCH=${PROJECT_VERSION_PATCH}
)
# src/mylib/__init__.py — Runtime version access
from importlib.metadata import version, PackageNotFoundError

try:
    __version__ = version("mylib")
except PackageNotFoundError:
    # Package not installed (development mode without metadata)
    try:
        from mylib._version import version as __version__
    except ImportError:
        __version__ = "0.0.0-dev"

def get_version_info():
    """Return structured version information."""
    parts = __version__.split(".")
    return {
        "major": int(parts[0]) if len(parts) > 0 else 0,
        "minor": int(parts[1]) if len(parts) > 1 else 0,
        "patch": int(parts[2].split("+")[0].split("-")[0]) if len(parts) > 2 else 0,
        "full": __version__,
    }
Key Insight: Using importlib.metadata.version() at runtime is the most robust approach — it reads from the installed package metadata (PKG-INFO/METADATA) and works regardless of how the version was set at build time. Combined with setuptools-scm or scikit-build-core's dynamic versioning, you get automatic version bumps from Git tags with zero manual maintenance.

Performance Profiling

Profiling native Python extensions requires mixed-mode tools that can correlate Python call stacks with C/C++ function calls. Standard Python profilers (cProfile, line_profiler) only see the Python-level function call boundary — they cannot look inside your compiled extension to identify which C++ function is the bottleneck.

# py-spy: Sampling profiler that sees through native extensions
# Install: pip install py-spy

# Profile a running Python script (no code changes needed)
py-spy record -o profile.svg --native -- python benchmark.py

# Top-like live view with native call stacks
py-spy top --native --pid $(pgrep -f "python benchmark.py")

# Generate a speedscope-compatible JSON for detailed analysis
py-spy record -o profile.json --format speedscope --native -- python benchmark.py

# Profile with subprocesses (useful for multiprocessing)
py-spy record --subprocesses -o profile.svg -- python parallel_bench.py
# benchmark.py — Profiling script for native extension
import numpy as np
import time
from mylib import Matrix, fast_multiply, fft_forward

def benchmark_multiply(n=1000, iterations=100):
    """Benchmark matrix multiplication."""
    a_np = np.random.randn(n, n)
    b_np = np.random.randn(n, n)
    
    a = Matrix.from_numpy(a_np)
    b = Matrix.from_numpy(b_np)
    
    # Warm up
    fast_multiply(a, b)
    
    # Timed iterations
    start = time.perf_counter()
    for _ in range(iterations):
        result = fast_multiply(a, b)
    elapsed = time.perf_counter() - start
    
    gflops = (2 * n**3 * iterations) / elapsed / 1e9
    print(f"Matrix multiply {n}x{n}: {elapsed/iterations*1000:.2f} ms, {gflops:.1f} GFLOPS")

def benchmark_fft(n=1048576, iterations=50):
    """Benchmark FFT on large signal."""
    signal = np.random.randn(n) + 1j * np.random.randn(n)
    
    start = time.perf_counter()
    for _ in range(iterations):
        spectrum = fft_forward(signal)
    elapsed = time.perf_counter() - start
    
    print(f"FFT n={n}: {elapsed/iterations*1000:.2f} ms")

if __name__ == "__main__":
    benchmark_multiply(500, 200)
    benchmark_multiply(1000, 50)
    benchmark_fft()

For deeper integration with NumPy's C API in your extension, ensure you're using buffer protocols correctly to avoid unnecessary copies:

// NumPy C API integration in pybind11
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

// Zero-copy access to NumPy array data
py::array_t<double> inplace_normalize(py::array_t<double> input) {
    // Request mutable buffer access (no copy)
    auto buf = input.mutable_unchecked<1>();
    
    // Compute norm
    double sum_sq = 0.0;
    for (py::ssize_t i = 0; i < buf.shape(0); i++) {
        sum_sq += buf(i) * buf(i);
    }
    double norm = std::sqrt(sum_sq);
    
    // Normalize in-place
    if (norm > 1e-10) {
        for (py::ssize_t i = 0; i < buf.shape(0); i++) {
            buf(i) /= norm;
        }
    }
    
    return input;  // Return the same array (modified in-place)
}

// Create a new NumPy array from C++ computation
py::array_t<double> compute_histogram(py::array_t<double> data, int bins) {
    auto d = data.unchecked<1>();
    
    // Allocate output array
    auto result = py::array_t<double>(bins);
    auto r = result.mutable_unchecked<1>();
    
    // Initialize to zero
    for (int i = 0; i < bins; i++) r(i) = 0.0;
    
    // Find min/max
    double min_val = d(0), max_val = d(0);
    for (py::ssize_t i = 1; i < d.shape(0); i++) {
        if (d(i) < min_val) min_val = d(i);
        if (d(i) > max_val) max_val = d(i);
    }
    
    // Bin the data
    double bin_width = (max_val - min_val) / bins;
    for (py::ssize_t i = 0; i < d.shape(0); i++) {
        int bin = static_cast<int>((d(i) - min_val) / bin_width);
        if (bin == bins) bin = bins - 1;  // Edge case: max value
        r(bin) += 1.0;
    }
    
    return result;
}

PYBIND11_MODULE(_core, m) {
    m.def("inplace_normalize", &inplace_normalize,
          py::arg("input").noconvert(),  // Require exact dtype match
          "Normalize array in-place (zero-copy)");
    m.def("compute_histogram", &compute_histogram,
          py::arg("data"), py::arg("bins") = 50);
}
Performance Tip
Avoiding GIL Bottlenecks in Extensions

For CPU-intensive operations that don't access Python objects, release the GIL to enable true parallelism with Python threads:

// Release GIL during computation
py::array_t<double> parallel_compute(py::array_t<double> input) {
    auto buf = input.unchecked<1>();
    auto result = py::array_t<double>(buf.shape(0));
    auto out = result.mutable_unchecked<1>();
    
    // Release GIL — allows other Python threads to run
    py::gil_scoped_release release;
    
    #pragma omp parallel for
    for (py::ssize_t i = 0; i < buf.shape(0); i++) {
        out(i) = expensive_computation(buf(i));
    }
    
    // GIL automatically re-acquired when 'release' goes out of scope
    return result;
}
GIL parallelism OpenMP

Conclusion & Next Steps

Distributing Python extensions with CMake combines the power of native compilation with Python's packaging ecosystem. The modern toolchain — scikit-build-core for build backend integration, pybind11 for C++ bindings, cibuildwheel for cross-platform wheel generation, and conda-forge for complex dependency management — makes it possible to ship high-performance code to millions of Python users with pip install simplicity.

Key takeaways from this article:

  • pybind11 + CMake is the gold standard for new C++ Python extensions — header-only, expressive API, and deep NumPy integration
  • scikit-build-core replaces setuptools for CMake-based projects, providing PEP 517 compliance and editable install support
  • cibuildwheel automates the complex task of building wheels for every platform/Python combination via CI
  • CFFI excels when wrapping existing C libraries without modifying their source code, and works with PyPy
  • Conda is the better distribution channel when your extension depends on heavy native libraries (CUDA, MPI, HDF5)
  • Single-source versioning via Git tags + setuptools-scm eliminates version drift between build system and runtime
  • py-spy with --native provides the mixed-mode profiling needed to optimize extension performance

Next in the Series

In Part 33: Professional CMake Project, we'll bring everything together into a complete, production-grade CMake project demonstrating best practices for structure, testing, packaging, documentation, and CI/CD in a real-world application.