Explainability & Attribution Mathematics

Prerequisites: Part 4: Probability (expectations, conditional probability), Part 7: Linear Algebra (linear regression, matrix inversion), Part 8: Calculus (gradients, integrals, chain rule). This extension is the canonical reference for explainability content in AI in the Wild: XAI.

The Attribution Problem

Given a model $f: \mathbb{R}^d \to \mathbb{R}$ and an input $x = (x_1, \ldots, x_d)$ with prediction $f(x)$, the attribution problem asks: how much did each feature $x_i$ contribute to the output?

We want an attribution vector $\phi = (\phi_1, \ldots, \phi_d)$ satisfying desirable properties. Different methods formalize "contribution" differently — game theory, local approximation, or path integrals — leading to Shapley values, LIME, and integrated gradients respectively.

Shapley Values

Axioms & Uniqueness Theorem

Shapley values come from cooperative game theory. Define a "game" with $d$ players (features) and a value function $v: 2^{\{1,\ldots,d\}} \to \mathbb{R}$ where $v(S)$ is the model's output using only features in subset $S$ (other features marginalized out).

Definition. The Shapley value for feature $i$ is:

$$\boxed{\phi_i = \sum_{S \subseteq \{1,\ldots,d\} \setminus \{i\}} \frac{|S|!\,(d-|S|-1)!}{d!} \left[v(S \cup \{i\}) - v(S)\right]}$$

This averages the marginal contribution of feature $i$ over all possible orderings in which features could be "added" to the coalition.

Uniqueness Theorem (Shapley 1953): The Shapley value is the unique attribution method satisfying four axioms:

Axiom	Statement	Intuition
Efficiency	$\sum_{i=1}^d \phi_i = v(\{1,\ldots,d\}) - v(\emptyset)$	Attributions sum to the total prediction minus baseline
Symmetry	If $v(S \cup \{i\}) = v(S \cup \{j\})\ \forall S$, then $\phi_i = \phi_j$	Equal contributors get equal credit
Null Player	If $v(S \cup \{i\}) = v(S)\ \forall S$, then $\phi_i = 0$	Irrelevant features get zero attribution
Linearity	$\phi_i(v_1 + v_2) = \phi_i(v_1) + \phi_i(v_2)$	Additivity across combined games

Computational Complexity

Exact Shapley values require evaluating $v(S)$ for all $2^d$ subsets — exponential in the number of features. For $d = 20$, that's over 1 million evaluations. For modern models with hundreds of features, exact computation is intractable.

Monte Carlo approximation: Sample $m$ random permutations $\pi$ of features. For each permutation, compute the marginal contribution of each feature when added in that order. Average across permutations:

$$\hat{\phi}_i = \frac{1}{m}\sum_{k=1}^m \left[v(S_i^{\pi_k} \cup \{i\}) - v(S_i^{\pi_k})\right]$$

where $S_i^{\pi_k}$ is the set of features appearing before $i$ in permutation $\pi_k$. By the law of large numbers, $\hat{\phi}_i \to \phi_i$ as $m \to \infty$.

KernelSHAP

KernelSHAP approximates Shapley values using weighted linear regression. The key insight: Shapley values are the solution to a specific weighted least-squares problem.

Setup: Represent feature subsets as binary vectors $z \in \{0,1\}^d$ (1 = feature present, 0 = marginalized). Fit a linear model $g(z) = \phi_0 + \sum_i \phi_i z_i$ minimizing:

$$\min_{\phi} \sum_{z \in \{0,1\}^d} \pi(z) \left[f(h_x(z)) - g(z)\right]^2$$

where $h_x(z)$ maps the binary coalition to an actual input (replacing absent features with baseline/marginal values), and the SHAP kernel is:

$$\pi(z) = \frac{d-1}{\binom{d}{|z|}\,|z|\,(d-|z|)}$$

This kernel assigns infinite weight to $|z| = 0$ and $|z| = d$ (enforcing efficiency), and higher weight to smaller/larger coalitions where marginal contributions are most informative.

import numpy as np
from itertools import combinations

def exact_shapley(model_fn, x, baseline, d):
    """Compute exact Shapley values for d features."""
    shapley_values = np.zeros(d)
    
    for i in range(d):
        # Sum over all subsets S not containing i
        others = [j for j in range(d) if j != i]
        phi_i = 0.0
        
        for size in range(d):  # |S| from 0 to d-1
            for S in combinations(others, size):
                S_set = set(S)
                
                # v(S): prediction with features in S, rest = baseline
                x_S = baseline.copy()
                for j in S_set:
                    x_S[j] = x[j]
                v_S = model_fn(x_S)
                
                # v(S ∪ {i}): add feature i
                x_Si = x_S.copy()
                x_Si[i] = x[i]
                v_Si = model_fn(x_Si)
                
                # Weight: |S|!(d-|S|-1)! / d!
                weight = (np.math.factorial(size) * 
                         np.math.factorial(d - size - 1) / 
                         np.math.factorial(d))
                phi_i += weight * (v_Si - v_S)
        
        shapley_values[i] = phi_i
    
    return shapley_values

# Example: simple linear model f(x) = 2*x1 + 3*x2 + 0*x3
def linear_model(x):
    return 2*x[0] + 3*x[1] + 0*x[2]

x = np.array([1.0, 2.0, 5.0])
baseline = np.array([0.0, 0.0, 0.0])

shapley = exact_shapley(linear_model, x, baseline, d=3)
print(f"Model output: f(x) = {linear_model(x):.1f}")
print(f"Baseline:     f(0) = {linear_model(baseline):.1f}")
print(f"Shapley values: {np.round(shapley, 4)}")
print(f"Sum of Shapley: {shapley.sum():.4f} (should = f(x) - f(0) = {linear_model(x) - linear_model(baseline):.1f})")
print(f"\nFeature 3 (coefficient=0): phi_3 = {shapley[2]:.4f} (null player axiom)")

LIME (Local Interpretable Model-agnostic Explanations)

Local Linear Approximation

LIME explains individual predictions by fitting an interpretable model (linear) in the neighborhood of the input. The explanation model $g \in G$ (class of linear models) minimizes:

$$\boxed{\xi(x) = \arg\min_{g \in G}\; \mathcal{L}(f, g, \pi_x) + \Omega(g)}$$

where $\mathcal{L}$ measures fidelity (how well $g$ approximates $f$ near $x$), $\pi_x$ defines the neighborhood, and $\Omega(g)$ is a complexity penalty (e.g., number of non-zero coefficients).

Concrete formulation: Sample $N$ perturbed instances $z_k$ near $x$, weighted by proximity to $x$. Fit a weighted linear regression:

$$\min_{\phi_0, \phi_1, \ldots, \phi_d} \sum_{k=1}^N \pi_x(z_k)\left[f(z_k) - \phi_0 - \sum_{i=1}^d \phi_i z_{k,i}\right]^2$$

Exponential Kernel

The proximity kernel $\pi_x$ controls which perturbed samples matter most. LIME uses an exponential (RBF) kernel:

$$\pi_x(z) = \exp\left(-\frac{D(x, z)^2}{\sigma^2}\right)$$

where $D(x,z)$ is a distance function (cosine distance for text, L2 for tabular data) and $\sigma$ is the kernel width. Closer perturbations get higher weight, making the linear model locally faithful.

Key difference from SHAP: LIME approximates the model locally with no efficiency guarantee (attributions don't necessarily sum to the prediction gap). It trades theoretical guarantees for computational speed and flexibility.

import numpy as np

def lime_explain(model_fn, x, n_samples=1000, kernel_width=0.75):
    """
    LIME explanation for a 1D output model.
    
    Args:
        model_fn: black-box model f(x) -> scalar
        x: input to explain, shape (d,)
        n_samples: number of perturbations
        kernel_width: sigma for exponential kernel
    
    Returns:
        Linear coefficients (feature importances)
    """
    d = len(x)
    
    # Generate perturbations around x
    perturbations = np.random.normal(0, 1, (n_samples, d))
    perturbed_inputs = x + perturbations * 0.5  # scale perturbations
    
    # Get model predictions for perturbations
    predictions = np.array([model_fn(z) for z in perturbed_inputs])
    
    # Compute proximity weights (exponential kernel)
    distances = np.linalg.norm(perturbations, axis=1)
    weights = np.exp(-(distances**2) / (kernel_width**2))
    
    # Weighted linear regression: solve W*(X^T X) phi = W*(X^T y)
    # Add intercept column
    X = np.column_stack([np.ones(n_samples), perturbed_inputs - x])
    W = np.diag(weights)
    
    # Solve: (X^T W X) beta = X^T W y
    XtWX = X.T @ W @ X
    XtWy = X.T @ W @ predictions
    
    # Add small ridge for stability
    beta = np.linalg.solve(XtWX + 1e-6 * np.eye(d+1), XtWy)
    
    return beta[1:]  # exclude intercept

# Example: nonlinear model
def nonlinear_model(x):
    return 3*x[0]**2 + 2*x[1] - x[2]*x[0] + 0.5

x = np.array([1.0, 2.0, 0.5])
np.random.seed(42)

lime_weights = lime_explain(nonlinear_model, x)
print(f"Model at x: f(x) = {nonlinear_model(x):.4f}")
print(f"LIME feature importances (local linear coefficients):")
for i, w in enumerate(lime_weights):
    print(f"  Feature {i+1}: {w:.4f}")
print(f"\nNote: For f(x) = 3x1^2 + 2x2 - x2*x1 + 0.5 at x=[1,2,0.5]:")
print(f"  df/dx1 = 6x1 - x3 = {6*x[0] - x[2]:.1f}")
print(f"  df/dx2 = 2")
print(f"  df/dx3 = -x1 = {-x[0]:.1f}")
print(f"LIME approximates local gradients via linear regression.")

Integrated Gradients

Axioms & Path Integral

Integrated gradients (IG) uses calculus to attribute the prediction difference between an input $x$ and a baseline $x'$ (typically zeros or a blank image). The attribution for feature $i$ is:

$$\boxed{\text{IG}_i(x) = (x_i - x'_i) \times \int_0^1 \frac{\partial f(x' + \alpha(x - x'))}{\partial x_i}\, d\alpha}$$

Intuition: Walk along a straight line from baseline $x'$ to input $x$. At each point on the path, compute the gradient. The integral accumulates how much the model's output increases due to feature $i$ along this path.

Key axioms satisfied:

Completeness (Efficiency): $\sum_{i=1}^d \text{IG}_i(x) = f(x) - f(x')$ — attributions sum to the prediction difference
Sensitivity: If $x$ and $x'$ differ only in feature $i$ and $f(x) \neq f(x')$, then $\text{IG}_i(x) \neq 0$
Implementation Invariance: Two networks with identical outputs get identical attributions (regardless of architecture)

Proof of completeness (fundamental theorem of calculus):

$$\sum_i \text{IG}_i(x) = \sum_i (x_i - x'_i)\int_0^1 \frac{\partial f}{\partial x_i}\bigg|_{x'+\alpha(x-x')} d\alpha = \int_0^1 \nabla f \cdot (x-x')\, d\alpha = f(x) - f(x')$$

The last step uses $\frac{d}{d\alpha}f(x' + \alpha(x-x')) = \nabla f \cdot (x-x')$ and the fundamental theorem: $\int_0^1 \frac{d}{d\alpha}f\, d\alpha = f(x) - f(x')$.

Riemann Approximation

In practice, the integral is approximated with $m$ steps (typically $m = 50\text{–}300$):

$$\text{IG}_i(x) \approx (x_i - x'_i) \times \frac{1}{m}\sum_{k=1}^m \frac{\partial f}{\partial x_i}\bigg|_{x' + \frac{k}{m}(x - x')}$$

This requires $m$ forward passes through the model to compute gradients at each interpolation point.

import numpy as np

def integrated_gradients(model_fn, grad_fn, x, baseline, m=50):
    """
    Compute integrated gradients attribution.
    
    Args:
        model_fn: model f(x) -> scalar
        grad_fn: gradient function nabla_f(x) -> array of shape (d,)
        x: input to explain, shape (d,)
        baseline: reference input, shape (d,)
        m: number of interpolation steps
    
    Returns:
        Attribution vector, shape (d,)
    """
    # Generate interpolation points: baseline + alpha*(x - baseline)
    alphas = np.linspace(1/m, 1.0, m)  # avoid alpha=0 for stability
    
    # Accumulate gradients along the path
    avg_gradients = np.zeros_like(x)
    for alpha in alphas:
        interpolated = baseline + alpha * (x - baseline)
        avg_gradients += grad_fn(interpolated)
    avg_gradients /= m
    
    # IG = (x - baseline) * average_gradient_along_path
    attributions = (x - baseline) * avg_gradients
    return attributions

# Example: quadratic model f(x) = x1^2 + 2*x1*x2 + x2^2
def quadratic_model(x):
    return x[0]**2 + 2*x[0]*x[1] + x[1]**2

def quadratic_grad(x):
    return np.array([2*x[0] + 2*x[1], 2*x[0] + 2*x[1]])

x = np.array([3.0, 2.0])
baseline = np.array([0.0, 0.0])

ig = integrated_gradients(quadratic_model, quadratic_grad, x, baseline, m=100)

print(f"f(x) = {quadratic_model(x):.1f}")
print(f"f(baseline) = {quadratic_model(baseline):.1f}")
print(f"Prediction gap: {quadratic_model(x) - quadratic_model(baseline):.1f}")
print(f"\nIntegrated Gradients: {np.round(ig, 4)}")
print(f"Sum of IG: {ig.sum():.4f} (should = prediction gap = {quadratic_model(x) - quadratic_model(baseline):.1f})")

# Verify completeness axiom
print(f"\nCompleteness check: sum(IG) == f(x) - f(x')?")
print(f"  {ig.sum():.6f} ≈ {quadratic_model(x) - quadratic_model(baseline):.6f} ✓")

Influence Functions

While Shapley/LIME/IG attribute importance to features, influence functions attribute importance to training data points: "which training example most influenced this prediction?"

Setup: Let $\hat{\theta} = \arg\min_\theta \frac{1}{n}\sum_{i=1}^n L(z_i, \theta)$ be the trained parameters. The influence of upweighting training point $z_j$ by $\epsilon$ on a test prediction is:

$$\mathcal{I}(z_j, z_{\text{test}}) = -\nabla_\theta L(z_{\text{test}}, \hat\theta)^T\, H_{\hat\theta}^{-1}\, \nabla_\theta L(z_j, \hat\theta)$$

where $H_{\hat\theta} = \frac{1}{n}\sum_{i=1}^n \nabla^2_\theta L(z_i, \hat\theta)$ is the Hessian of the training loss at the optimum.

Interpretation: This measures how much the test loss would change if we infinitesimally upweighted a training point — without retraining. The $H^{-1}$ term accounts for how the parameters would shift.

Practical challenges:

Computing $H^{-1}$ is $O(p^3)$ where $p$ = number of parameters (intractable for large models)
Approximations use Hessian-vector products (HVPs) via conjugate gradient or stochastic estimation
Deep networks are often not strictly convex, making $H$ poorly conditioned — damping term $H + \lambda I$ required

Connections & Limitations

Method	Foundation	Guarantees	Cost	Model Access
Shapley / SHAP	Game theory	Efficiency + uniqueness axioms	Exponential (exact) or O(N·d) approx	Black-box
LIME	Local linear approximation	Local fidelity (no efficiency)	O(N·d) for N samples	Black-box
Integrated Gradients	Path integral (calculus)	Completeness + sensitivity	O(m) forward passes	Gradient access (white-box)
Influence Functions	Implicit differentiation	Leave-one-out approximation	O(p) per HVP	Full model access

Key limitations:
            Baseline dependence: Both SHAP and IG depend on the choice of baseline/reference. Different baselines give different attributions.
Feature correlation: When features are correlated, marginalizing one feature while holding others fixed creates unrealistic inputs, potentially misleading attributions.
Faithfulness vs interpretability: High-fidelity explanations (SHAP) may be complex; interpretable explanations (LIME top-k) may be unfaithful.
Attention ≠ explanation: Attention weights are not reliable attributions — they can be permuted without changing outputs (Jain & Wallace, 2019).

        

ExerciseTheory

Verify Shapley Uniqueness

For a 3-player game with $v(\{1\}) = 5$, $v(\{2\}) = 3$, $v(\{1,2\}) = 12$, $v(\{3\}) = 0$, $v(\{1,3\}) = 5$, $v(\{2,3\}) = 3$, $v(\{1,2,3\}) = 15$, $v(\emptyset) = 0$:

1. Compute exact Shapley values. 2. Verify efficiency: $\phi_1 + \phi_2 + \phi_3 = v(\{1,2,3\})$. 3. Verify player 3 is a null player ($\phi_3 = 0$).

Table of Contents