The Attribution Problem
Given a model $f: \mathbb{R}^d \to \mathbb{R}$ and an input $x = (x_1, \ldots, x_d)$ with prediction $f(x)$, the attribution problem asks: how much did each feature $x_i$ contribute to the output?
We want an attribution vector $\phi = (\phi_1, \ldots, \phi_d)$ satisfying desirable properties. Different methods formalize "contribution" differently — game theory, local approximation, or path integrals — leading to Shapley values, LIME, and integrated gradients respectively.
Shapley Values
Axioms & Uniqueness Theorem
Shapley values come from cooperative game theory. Define a "game" with $d$ players (features) and a value function $v: 2^{\{1,\ldots,d\}} \to \mathbb{R}$ where $v(S)$ is the model's output using only features in subset $S$ (other features marginalized out).
Definition. The Shapley value for feature $i$ is:
$$\boxed{\phi_i = \sum_{S \subseteq \{1,\ldots,d\} \setminus \{i\}} \frac{|S|!\,(d-|S|-1)!}{d!} \left[v(S \cup \{i\}) - v(S)\right]}$$This averages the marginal contribution of feature $i$ over all possible orderings in which features could be "added" to the coalition.
Uniqueness Theorem (Shapley 1953): The Shapley value is the unique attribution method satisfying four axioms:
| Axiom | Statement | Intuition |
|---|---|---|
| Efficiency | $\sum_{i=1}^d \phi_i = v(\{1,\ldots,d\}) - v(\emptyset)$ | Attributions sum to the total prediction minus baseline |
| Symmetry | If $v(S \cup \{i\}) = v(S \cup \{j\})\ \forall S$, then $\phi_i = \phi_j$ | Equal contributors get equal credit |
| Null Player | If $v(S \cup \{i\}) = v(S)\ \forall S$, then $\phi_i = 0$ | Irrelevant features get zero attribution |
| Linearity | $\phi_i(v_1 + v_2) = \phi_i(v_1) + \phi_i(v_2)$ | Additivity across combined games |
Computational Complexity
Exact Shapley values require evaluating $v(S)$ for all $2^d$ subsets — exponential in the number of features. For $d = 20$, that's over 1 million evaluations. For modern models with hundreds of features, exact computation is intractable.
Monte Carlo approximation: Sample $m$ random permutations $\pi$ of features. For each permutation, compute the marginal contribution of each feature when added in that order. Average across permutations:
$$\hat{\phi}_i = \frac{1}{m}\sum_{k=1}^m \left[v(S_i^{\pi_k} \cup \{i\}) - v(S_i^{\pi_k})\right]$$where $S_i^{\pi_k}$ is the set of features appearing before $i$ in permutation $\pi_k$. By the law of large numbers, $\hat{\phi}_i \to \phi_i$ as $m \to \infty$.
KernelSHAP
KernelSHAP approximates Shapley values using weighted linear regression. The key insight: Shapley values are the solution to a specific weighted least-squares problem.
Setup: Represent feature subsets as binary vectors $z \in \{0,1\}^d$ (1 = feature present, 0 = marginalized). Fit a linear model $g(z) = \phi_0 + \sum_i \phi_i z_i$ minimizing:
$$\min_{\phi} \sum_{z \in \{0,1\}^d} \pi(z) \left[f(h_x(z)) - g(z)\right]^2$$where $h_x(z)$ maps the binary coalition to an actual input (replacing absent features with baseline/marginal values), and the SHAP kernel is:
$$\pi(z) = \frac{d-1}{\binom{d}{|z|}\,|z|\,(d-|z|)}$$This kernel assigns infinite weight to $|z| = 0$ and $|z| = d$ (enforcing efficiency), and higher weight to smaller/larger coalitions where marginal contributions are most informative.
import numpy as np
from itertools import combinations
def exact_shapley(model_fn, x, baseline, d):
"""Compute exact Shapley values for d features."""
shapley_values = np.zeros(d)
for i in range(d):
# Sum over all subsets S not containing i
others = [j for j in range(d) if j != i]
phi_i = 0.0
for size in range(d): # |S| from 0 to d-1
for S in combinations(others, size):
S_set = set(S)
# v(S): prediction with features in S, rest = baseline
x_S = baseline.copy()
for j in S_set:
x_S[j] = x[j]
v_S = model_fn(x_S)
# v(S ∪ {i}): add feature i
x_Si = x_S.copy()
x_Si[i] = x[i]
v_Si = model_fn(x_Si)
# Weight: |S|!(d-|S|-1)! / d!
weight = (np.math.factorial(size) *
np.math.factorial(d - size - 1) /
np.math.factorial(d))
phi_i += weight * (v_Si - v_S)
shapley_values[i] = phi_i
return shapley_values
# Example: simple linear model f(x) = 2*x1 + 3*x2 + 0*x3
def linear_model(x):
return 2*x[0] + 3*x[1] + 0*x[2]
x = np.array([1.0, 2.0, 5.0])
baseline = np.array([0.0, 0.0, 0.0])
shapley = exact_shapley(linear_model, x, baseline, d=3)
print(f"Model output: f(x) = {linear_model(x):.1f}")
print(f"Baseline: f(0) = {linear_model(baseline):.1f}")
print(f"Shapley values: {np.round(shapley, 4)}")
print(f"Sum of Shapley: {shapley.sum():.4f} (should = f(x) - f(0) = {linear_model(x) - linear_model(baseline):.1f})")
print(f"\nFeature 3 (coefficient=0): phi_3 = {shapley[2]:.4f} (null player axiom)")
LIME (Local Interpretable Model-agnostic Explanations)
Local Linear Approximation
LIME explains individual predictions by fitting an interpretable model (linear) in the neighborhood of the input. The explanation model $g \in G$ (class of linear models) minimizes:
$$\boxed{\xi(x) = \arg\min_{g \in G}\; \mathcal{L}(f, g, \pi_x) + \Omega(g)}$$where $\mathcal{L}$ measures fidelity (how well $g$ approximates $f$ near $x$), $\pi_x$ defines the neighborhood, and $\Omega(g)$ is a complexity penalty (e.g., number of non-zero coefficients).
Concrete formulation: Sample $N$ perturbed instances $z_k$ near $x$, weighted by proximity to $x$. Fit a weighted linear regression:
$$\min_{\phi_0, \phi_1, \ldots, \phi_d} \sum_{k=1}^N \pi_x(z_k)\left[f(z_k) - \phi_0 - \sum_{i=1}^d \phi_i z_{k,i}\right]^2$$Exponential Kernel
The proximity kernel $\pi_x$ controls which perturbed samples matter most. LIME uses an exponential (RBF) kernel:
$$\pi_x(z) = \exp\left(-\frac{D(x, z)^2}{\sigma^2}\right)$$where $D(x,z)$ is a distance function (cosine distance for text, L2 for tabular data) and $\sigma$ is the kernel width. Closer perturbations get higher weight, making the linear model locally faithful.
Key difference from SHAP: LIME approximates the model locally with no efficiency guarantee (attributions don't necessarily sum to the prediction gap). It trades theoretical guarantees for computational speed and flexibility.
import numpy as np
def lime_explain(model_fn, x, n_samples=1000, kernel_width=0.75):
"""
LIME explanation for a 1D output model.
Args:
model_fn: black-box model f(x) -> scalar
x: input to explain, shape (d,)
n_samples: number of perturbations
kernel_width: sigma for exponential kernel
Returns:
Linear coefficients (feature importances)
"""
d = len(x)
# Generate perturbations around x
perturbations = np.random.normal(0, 1, (n_samples, d))
perturbed_inputs = x + perturbations * 0.5 # scale perturbations
# Get model predictions for perturbations
predictions = np.array([model_fn(z) for z in perturbed_inputs])
# Compute proximity weights (exponential kernel)
distances = np.linalg.norm(perturbations, axis=1)
weights = np.exp(-(distances**2) / (kernel_width**2))
# Weighted linear regression: solve W*(X^T X) phi = W*(X^T y)
# Add intercept column
X = np.column_stack([np.ones(n_samples), perturbed_inputs - x])
W = np.diag(weights)
# Solve: (X^T W X) beta = X^T W y
XtWX = X.T @ W @ X
XtWy = X.T @ W @ predictions
# Add small ridge for stability
beta = np.linalg.solve(XtWX + 1e-6 * np.eye(d+1), XtWy)
return beta[1:] # exclude intercept
# Example: nonlinear model
def nonlinear_model(x):
return 3*x[0]**2 + 2*x[1] - x[2]*x[0] + 0.5
x = np.array([1.0, 2.0, 0.5])
np.random.seed(42)
lime_weights = lime_explain(nonlinear_model, x)
print(f"Model at x: f(x) = {nonlinear_model(x):.4f}")
print(f"LIME feature importances (local linear coefficients):")
for i, w in enumerate(lime_weights):
print(f" Feature {i+1}: {w:.4f}")
print(f"\nNote: For f(x) = 3x1^2 + 2x2 - x2*x1 + 0.5 at x=[1,2,0.5]:")
print(f" df/dx1 = 6x1 - x3 = {6*x[0] - x[2]:.1f}")
print(f" df/dx2 = 2")
print(f" df/dx3 = -x1 = {-x[0]:.1f}")
print(f"LIME approximates local gradients via linear regression.")
Integrated Gradients
Axioms & Path Integral
Integrated gradients (IG) uses calculus to attribute the prediction difference between an input $x$ and a baseline $x'$ (typically zeros or a blank image). The attribution for feature $i$ is:
$$\boxed{\text{IG}_i(x) = (x_i - x'_i) \times \int_0^1 \frac{\partial f(x' + \alpha(x - x'))}{\partial x_i}\, d\alpha}$$Intuition: Walk along a straight line from baseline $x'$ to input $x$. At each point on the path, compute the gradient. The integral accumulates how much the model's output increases due to feature $i$ along this path.
Key axioms satisfied:
- Completeness (Efficiency): $\sum_{i=1}^d \text{IG}_i(x) = f(x) - f(x')$ — attributions sum to the prediction difference
- Sensitivity: If $x$ and $x'$ differ only in feature $i$ and $f(x) \neq f(x')$, then $\text{IG}_i(x) \neq 0$
- Implementation Invariance: Two networks with identical outputs get identical attributions (regardless of architecture)
Proof of completeness (fundamental theorem of calculus):
$$\sum_i \text{IG}_i(x) = \sum_i (x_i - x'_i)\int_0^1 \frac{\partial f}{\partial x_i}\bigg|_{x'+\alpha(x-x')} d\alpha = \int_0^1 \nabla f \cdot (x-x')\, d\alpha = f(x) - f(x')$$The last step uses $\frac{d}{d\alpha}f(x' + \alpha(x-x')) = \nabla f \cdot (x-x')$ and the fundamental theorem: $\int_0^1 \frac{d}{d\alpha}f\, d\alpha = f(x) - f(x')$.
Riemann Approximation
In practice, the integral is approximated with $m$ steps (typically $m = 50\text{–}300$):
$$\text{IG}_i(x) \approx (x_i - x'_i) \times \frac{1}{m}\sum_{k=1}^m \frac{\partial f}{\partial x_i}\bigg|_{x' + \frac{k}{m}(x - x')}$$This requires $m$ forward passes through the model to compute gradients at each interpolation point.
import numpy as np
def integrated_gradients(model_fn, grad_fn, x, baseline, m=50):
"""
Compute integrated gradients attribution.
Args:
model_fn: model f(x) -> scalar
grad_fn: gradient function nabla_f(x) -> array of shape (d,)
x: input to explain, shape (d,)
baseline: reference input, shape (d,)
m: number of interpolation steps
Returns:
Attribution vector, shape (d,)
"""
# Generate interpolation points: baseline + alpha*(x - baseline)
alphas = np.linspace(1/m, 1.0, m) # avoid alpha=0 for stability
# Accumulate gradients along the path
avg_gradients = np.zeros_like(x)
for alpha in alphas:
interpolated = baseline + alpha * (x - baseline)
avg_gradients += grad_fn(interpolated)
avg_gradients /= m
# IG = (x - baseline) * average_gradient_along_path
attributions = (x - baseline) * avg_gradients
return attributions
# Example: quadratic model f(x) = x1^2 + 2*x1*x2 + x2^2
def quadratic_model(x):
return x[0]**2 + 2*x[0]*x[1] + x[1]**2
def quadratic_grad(x):
return np.array([2*x[0] + 2*x[1], 2*x[0] + 2*x[1]])
x = np.array([3.0, 2.0])
baseline = np.array([0.0, 0.0])
ig = integrated_gradients(quadratic_model, quadratic_grad, x, baseline, m=100)
print(f"f(x) = {quadratic_model(x):.1f}")
print(f"f(baseline) = {quadratic_model(baseline):.1f}")
print(f"Prediction gap: {quadratic_model(x) - quadratic_model(baseline):.1f}")
print(f"\nIntegrated Gradients: {np.round(ig, 4)}")
print(f"Sum of IG: {ig.sum():.4f} (should = prediction gap = {quadratic_model(x) - quadratic_model(baseline):.1f})")
# Verify completeness axiom
print(f"\nCompleteness check: sum(IG) == f(x) - f(x')?")
print(f" {ig.sum():.6f} ≈ {quadratic_model(x) - quadratic_model(baseline):.6f} ✓")
Influence Functions
While Shapley/LIME/IG attribute importance to features, influence functions attribute importance to training data points: "which training example most influenced this prediction?"
Setup: Let $\hat{\theta} = \arg\min_\theta \frac{1}{n}\sum_{i=1}^n L(z_i, \theta)$ be the trained parameters. The influence of upweighting training point $z_j$ by $\epsilon$ on a test prediction is:
$$\mathcal{I}(z_j, z_{\text{test}}) = -\nabla_\theta L(z_{\text{test}}, \hat\theta)^T\, H_{\hat\theta}^{-1}\, \nabla_\theta L(z_j, \hat\theta)$$where $H_{\hat\theta} = \frac{1}{n}\sum_{i=1}^n \nabla^2_\theta L(z_i, \hat\theta)$ is the Hessian of the training loss at the optimum.
Interpretation: This measures how much the test loss would change if we infinitesimally upweighted a training point — without retraining. The $H^{-1}$ term accounts for how the parameters would shift.
Practical challenges:
- Computing $H^{-1}$ is $O(p^3)$ where $p$ = number of parameters (intractable for large models)
- Approximations use Hessian-vector products (HVPs) via conjugate gradient or stochastic estimation
- Deep networks are often not strictly convex, making $H$ poorly conditioned — damping term $H + \lambda I$ required
Connections & Limitations
| Method | Foundation | Guarantees | Cost | Model Access |
|---|---|---|---|---|
| Shapley / SHAP | Game theory | Efficiency + uniqueness axioms | Exponential (exact) or O(N·d) approx | Black-box |
| LIME | Local linear approximation | Local fidelity (no efficiency) | O(N·d) for N samples | Black-box |
| Integrated Gradients | Path integral (calculus) | Completeness + sensitivity | O(m) forward passes | Gradient access (white-box) |
| Influence Functions | Implicit differentiation | Leave-one-out approximation | O(p) per HVP | Full model access |
- Baseline dependence: Both SHAP and IG depend on the choice of baseline/reference. Different baselines give different attributions.
- Feature correlation: When features are correlated, marginalizing one feature while holding others fixed creates unrealistic inputs, potentially misleading attributions.
- Faithfulness vs interpretability: High-fidelity explanations (SHAP) may be complex; interpretable explanations (LIME top-k) may be unfaithful.
- Attention ≠ explanation: Attention weights are not reliable attributions — they can be permuted without changing outputs (Jain & Wallace, 2019).
Verify Shapley Uniqueness
For a 3-player game with $v(\{1\}) = 5$, $v(\{2\}) = 3$, $v(\{1,2\}) = 12$, $v(\{3\}) = 0$, $v(\{1,3\}) = 5$, $v(\{2,3\}) = 3$, $v(\{1,2,3\}) = 15$, $v(\emptyset) = 0$:
1. Compute exact Shapley values. 2. Verify efficiency: $\phi_1 + \phi_2 + \phi_3 = v(\{1,2,3\})$. 3. Verify player 3 is a null player ($\phi_3 = 0$).