About This Series
This is Part 18 of the AI in the Wild: Real-World Applications & Ethics series — a 24-part deep dive covering the complete end-to-end AI journey, from ML foundations through to responsible AI governance.
As AI systems take consequential decisions in credit, healthcare, law, and employment, the ability to explain and interpret model behaviour is no longer optional — it is a regulatory requirement, an engineering discipline, and a prerequisite for responsible deployment.
This is Part 18 of the AI in the Wild: Real-World Applications & Ethics series — a 24-part deep dive covering the complete end-to-end AI journey, from ML foundations through to responsible AI governance.
Explainable AI (XAI) is a collection of techniques and practices that make the behaviour of machine learning models understandable to human stakeholders — whether those stakeholders are data scientists debugging a model, regulators auditing it for compliance, clinicians deciding whether to act on a medical AI recommendation, or customers challenging an adverse decision. The field emerged from a central tension in modern ML: the most accurate models (deep neural networks, gradient boosted trees, large ensembles) are also the most opaque, while the most interpretable models (linear regression, decision trees) tend to be less accurate on complex, high-dimensional tasks.
This tension has intensified as AI moves from research into consequential deployments. The EU General Data Protection Regulation (GDPR) requires that automated decisions with significant effects on individuals must be explainable. The EU AI Act (2024) classifies many AI applications as high-risk and mandates transparency, auditability, and human oversight. ECOA and fair lending laws in the US require that credit decisions be explainable. These regulatory requirements, combined with the engineering need to debug and improve models, have driven XAI from a research curiosity to a production engineering discipline.
The field distinguishes between two fundamentally different approaches:
Critical distinction: "This SHAP value shows that feature X contributed +0.3 to the prediction" is a statement about the SHAP approximation, not a statement about the model's true causal mechanism. Conflating explanation quality with ground truth is the most common misuse of XAI tools.
A second fundamental axis: local explanations explain a specific prediction (why was this loan application rejected?), while global explanations characterize the model's overall behaviour (what features does this model rely on most across all predictions?). Both are necessary but serve different purposes. Local explanations serve individual stakeholders (the loan applicant, the compliance officer reviewing a specific case). Global explanations serve model developers (identify data quality issues, spot biases, understand model behaviour at deployment).
Most practical XAI deployments require both. A credit scoring system typically provides: (1) a global SHAP summary plot showing the five most important features across the population, and (2) a local SHAP waterfall plot showing the specific factors that influenced each individual decision — both to regulators who audit the model and to customers who have a right to understand adverse decisions.
Post-hoc methods are the workhorses of applied XAI. They require no modification to the model architecture or training procedure — they operate on any model that can produce predictions given inputs. This model-agnosticism is both their strength (universally applicable) and a limitation (approximations may be inaccurate for highly nonlinear or discontinuous models).
SHAP (SHapley Additive exPlanations) is the most widely adopted XAI method in production, used by Salesforce, Airbnb, Microsoft, healthcare systems, and financial institutions worldwide. It roots explanations in cooperative game theory: the Shapley value of each feature is the average marginal contribution of that feature across all possible orderings in which features could be added to the prediction. This formulation satisfies four desirable axioms: efficiency (attributions sum to the difference between prediction and expected value), symmetry (identical features get identical attributions), dummy (irrelevant features get zero attribution), and additivity (explanations from sub-models add correctly).
import shap
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# SHAP (SHapley Additive exPlanations) — game theory-based feature attribution
# Used by: Salesforce, Airbnb, healthcare risk models, financial ML
# Train XGBoost on loan default prediction
feature_names = ['credit_score', 'income', 'debt_ratio', 'employment_years',
'previous_defaults', 'loan_amount', 'loan_term', 'purpose']
model = xgb.XGBClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)
# SHAP explainer — TreeSHAP is exact and fast for tree-based models
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test) # shape: (n_samples, n_features)
# 1. Global explanation: mean |SHAP| across all samples
shap.summary_plot(shap_values, X_test, feature_names=feature_names, plot_type="bar")
# Output: credit_score (0.42) > previous_defaults (0.31) > debt_ratio (0.18) > ...
# 2. Local explanation: why was THIS loan rejected?
shap.waterfall_plot(shap.Explanation(
values=shap_values[42], # sample 42
base_values=explainer.expected_value,
data=X_test.iloc[42],
feature_names=feature_names
))
# Shows: credit_score → -0.35 (negative: reduces default risk),
# previous_defaults → +0.28 (positive: increases default risk)
# 3. Dependence plot: how does credit_score affect predictions?
shap.dependence_plot("credit_score", shap_values, X_test, feature_names=feature_names)
# Reveals non-linear relationship: scores < 600 dramatically increase default risk
LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al., 2016) takes a different approach: rather than computing exact feature attributions mathematically, it approximates the model locally around a specific prediction. The algorithm samples from a neighbourhood around the input, queries the black-box model for predictions, weights samples by their proximity to the original input, and fits a simple interpretable model (usually logistic regression or a decision tree) to these weighted samples. The local model's coefficients are then presented as the explanation.
from lime import lime_tabular, lime_text
import numpy as np
# LIME: Local Interpretable Model-agnostic Explanations
# Approximates any black-box model locally with an interpretable model
# Tabular explanation
explainer_tabular = lime_tabular.LimeTabularExplainer(
X_train.values,
feature_names=feature_names,
class_names=['No Default', 'Default'],
mode='classification',
discretize_continuous=True # converts continuous features to ranges
)
# Explain a single prediction
instance = X_test.iloc[0]
explanation = explainer_tabular.explain_instance(
instance.values,
model.predict_proba,
num_features=5, # top 5 contributing features
num_samples=1000 # local neighbourhood samples
)
print("Prediction explanation:")
for feature, weight in explanation.as_list():
direction = "↑ risk" if weight > 0 else "↓ risk"
print(f" {feature}: {weight:.3f} ({direction})")
# Text classification explanation
text_explainer = lime_text.LimeTextExplainer(class_names=['Not Spam', 'Spam'])
text_exp = text_explainer.explain_instance(
"Congratulations! You've won $1000!!!",
spam_model.predict_proba,
num_features=5
)
# Highlights: "won" (+0.38), "$1000" (+0.32), "Congratulations" (+0.21)
LIME's key advantages over SHAP: (1) it works natively for text and image modalities with domain-appropriate perturbation strategies (word masking for text, superpixel masking for images), (2) it is computationally lighter than KernelSHAP, and (3) its local linear approximation is highly intuitive. Its limitations: LIME explanations are inherently unstable (different random seeds produce different neighbourhoods and different explanations), and the neighbourhood definition — what counts as "local" — is a hyper-parameter with significant impact on explanation quality.
Transformer models compute attention weights between all pairs of tokens in a sequence. These weights are often visualized as heatmaps to understand which tokens the model "attends to" when making predictions. Attention visualisation is widely used in NLP for debugging models, understanding cross-lingual transfer, and generating rationale-style explanations for document classification. In vision transformers (ViT), attention maps show which image patches are attended to when classifying an image.
from transformers import BertTokenizer, BertModel
import torch
import matplotlib.pyplot as plt
import seaborn as sns
# Visualize attention patterns — which tokens attend to which
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
model.eval()
text = "The bank approved the loan despite the poor credit history."
inputs = tokenizer(text, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
with torch.no_grad():
outputs = model(**inputs)
# Extract attention weights: (batch, heads, seq_len, seq_len)
attention = outputs.attentions[-1][0] # last layer
# Average across heads for visualization
avg_attention = attention.mean(dim=0).numpy()
plt.figure(figsize=(10, 8))
sns.heatmap(avg_attention, xticklabels=tokens, yticklabels=tokens,
cmap='Blues', vmin=0, vmax=avg_attention.max())
plt.title("BERT Attention: Last Layer (averaged over heads)")
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=150)
# Note: attention ≠ explanation — high attention to a token doesn't mean it causes the prediction
# Mechanistic interpretability goes deeper: circuits, features, activations
Integrated Gradients (Sundararajan et al., 2017) provides a theoretically grounded gradient-based attribution: it integrates the gradient of the output with respect to each input feature along a straight-line path from a baseline (typically zero or a neutral input) to the actual input. Unlike vanilla gradients (which are locally computed and can be misleading near saturated activations), integrated gradients satisfies Sensitivity and Implementation Invariance axioms. It is the attribution method of choice for differentiable models in production — used by Google in their AI Explainability service, by Captum (PyTorch's XAI library), and increasingly as the standard attribution method in medical imaging AI.
GradCAM (Gradient-weighted Class Activation Mapping) is the dominant explanation method for CNNs: it computes the gradient of the target class score with respect to the feature maps of the last convolutional layer, then uses the global average of these gradients to weight the feature maps and produce a coarse heatmap over the input image. GradCAM is widely used in medical imaging to highlight the regions of an X-ray, MRI, or histology slide that most influenced a diagnostic prediction. It is computationally inexpensive and produces visually interpretable spatial maps — but only applicable to CNNs, not to fully connected networks or transformers without modification.
Mechanistic interpretability (MI) is a research frontier that aims at a different, deeper question than post-hoc explanation methods: not "which input features influenced this prediction?" but "what computational algorithm did the network implement to produce this prediction?" MI researchers seek to reverse-engineer the learned algorithms inside neural networks — understanding them at the level of circuits, features, and computations rather than input-output correlations.
The circuits paradigm, developed by researchers at Anthropic (Olah et al., 2020+), proposes that neural networks are composed of discrete computational subgraphs (circuits) that implement specific algorithms. Landmark discoveries include:
A major obstacle to mechanistic interpretability is superposition: neural networks represent more features than they have neurons by encoding multiple features as non-orthogonal directions in activation space, relying on the sparsity of real-world data to prevent interference. This means individual neurons are rarely monosemantic (responding to a single concept) — they are polysemantic, activated by many unrelated concepts.
Sparse Autoencoders (SAEs) have emerged in 2023–2024 as a tool for decomposing polysemantic neuron activations into monosemantic features. A SAE is trained to reconstruct activation vectors from a sparse combination of learned feature directions, enforcing sparsity through an L1 penalty. Anthropic's 2024 work on Claude applied SAEs at scale and discovered millions of interpretable features (including features corresponding to specific people, places, concepts, and even safety-relevant features like "Assistant" identity). This represents the state of the art in understanding what large language models internally represent.
Counterfactual explanations answer a different question than SHAP or LIME: not "why was this decision made?" but "what would need to change for a different decision to be made?" This framing is highly intuitive and actionable — especially in adverse decision contexts like loan rejections or insurance denials. "Your application was rejected. If your credit score were above 680 and your debt-to-income ratio below 35%, your application would be approved" is a counterfactual explanation.
Counterfactuals must satisfy several properties to be useful: proximity (the counterfactual should require minimal change from the factual input), feasibility (the changes must be possible for the individual to make — age cannot be changed, while savings rate can), actionability (the explanation should guide action, not just describe the nearest decision boundary), and diversity (provide multiple counterfactual paths, not just the nearest one, since different paths may be feasible for different individuals).
The regulatory landscape for AI explainability is evolving rapidly and varies significantly across jurisdictions and sectors. Key requirements as of 2026:
Model cards (Mitchell et al., 2019) are structured documentation for ML models that cover: intended use, limitations, evaluation metrics across demographic groups, ethical considerations, and recommendations for appropriate use. They are the minimum transparency standard for any publicly deployed model. Major platforms (HuggingFace, Google Vertex AI, AWS SageMaker) now require or strongly encourage model cards for published models. The EU AI Act's technical documentation requirements effectively mandate model card equivalents for high-risk AI.
| Industry | Use Case | Required Explanation Type | Regulatory Driver | Example Tool |
|---|---|---|---|---|
| Finance — Credit | Credit scoring & loan origination | Local: individual adverse action reasons. Global: demographic parity across protected groups. | ECOA / Reg B, EU AI Act (Annex III), GDPR Art. 22 | SHAP waterfall plots; DiCE counterfactuals; FICO Explainable AI |
| Finance — Fraud | Transaction fraud detection | Local: why this transaction was flagged. Audit: feature drift monitoring over time. | Internal compliance, PSD2 (EU), audit requirements | SHAP force plots; rule extraction from gradient boosted models |
| Healthcare | Clinical decision support (radiology, pathology, risk scoring) | Visual: GradCAM/saliency maps showing relevant image regions. Clinical: feature-based rationale clinician can verify. | FDA AI/ML-Based SaMD guidance, EU MDR, IEC 62304 clinical evidence requirements | GradCAM; Integrated Gradients; Captum; PathAI attention maps |
| Legal / Criminal Justice | Recidivism risk scoring (COMPAS-type tools) | Local: factors contributing to risk score. Counterfactual: what would reduce the score. | Due process requirements, judicial review rights, US state-level AI auditing laws | LIME; counterfactual tools; rule-based surrogate models |
| HR / Recruitment | Resume screening, interview scoring | Global: which features predict candidate success. Demographic: parity across protected groups. | NYC Local Law 144, EEOC guidelines, EU AI Act employment provisions | SHAP summary plots; bias audits; adverse impact analysis |
| Insurance | Underwriting & claims assessment | Local: why this premium was set / claim denied. Audit: proxy discrimination analysis. | Insurance regulation, GDPR, EU AI Act financial services provisions | SHAP; PDPs; Integrated Gradients for actuarial models |
Choosing the right XAI method for a given problem requires understanding the trade-offs across several dimensions. The following table provides a practical reference for method selection.
| Method | Type | Model-Agnostic? | Local / Global | Fidelity | Speed | Best For |
|---|---|---|---|---|---|---|
| SHAP (TreeSHAP) | Feature attribution | No — tree models only | Both | Exact | Fast (ms) | Tabular ML with tree models; credit, fraud, churn; real-time explanations |
| SHAP (KernelSHAP) | Feature attribution | Yes | Both | Approximate (sampling) | Slow (seconds–minutes) | Model-agnostic tabular; when TreeSHAP unavailable; batch explanation |
| LIME | Local surrogate | Yes | Local only | Approximate (local linear) | Medium (hundreds of ms) | Text classification; image classification; intuitive explanations for non-technical audiences |
| Attention Visualisation | Internal representation | No — transformers only | Local | Not a faithful explanation (debated) | Fast (single forward pass) | Debugging transformer models; NLP rationale generation; ViT spatial focus |
| Counterfactuals | Contrastive example | Yes (with constraints) | Local | High (boundary-based) | Slow (optimisation per sample) | Adverse decision contexts; actionable user explanations; regulatory recourse requirement |
| ICE / PDP | Partial dependence | Yes | Both | High (marginalises correctly) | Medium | Understanding feature effects globally; non-linear relationship discovery; model debugging |
| Integrated Gradients | Gradient attribution | No — differentiable models only | Local | Satisfies axioms (exact) | Medium (multiple forward/backward passes) | Deep learning on text, images, tabular; when theoretical guarantees matter; medical imaging |
Train a GradientBoostingClassifier on the Titanic dataset (available via seaborn or Kaggle). Use shap.TreeExplainer to compute SHAP values for the test set. Generate: (a) a beeswarm summary plot showing global feature importances, (b) a waterfall plot for the passenger with the highest predicted survival probability, and (c) a dependence plot for the Age feature. Answer: what are the three most important features? Does the directionality (positive/negative SHAP values for each feature) match your intuition about Titanic survival? Where does the model deviate from historical knowledge?
Tools: Python, scikit-learn, SHAP library (pip install shap), Seaborn for data loading.
Train an XGBoost model on the UCI Adult Income dataset (predict income >50K). Generate SHAP values (TreeSHAP) and LIME explanations for 20 random test instances. For each instance, extract the top-5 features from each method and compute rank correlation (Spearman) between SHAP and LIME feature importance rankings. Find at least one instance where the methods disagree substantially (rank correlation <0.6). Investigate why: is it a region of high nonlinearity, a feature with strong interactions, or a sample near the decision boundary? Understanding when methods disagree is essential for responsible XAI deployment.
Tools: Python, XGBoost, SHAP, LIME, SciPy (for rank correlation). Dataset: sklearn.datasets.fetch_openml('adult', version=2).
Train a binary classifier on the German Credit Dataset (predict loan default). Identify 10 instances predicted as "high default risk" (rejected applications). Use DiCE (pip install dice-ml) to generate diverse counterfactual explanations for each rejection. Constrain: age cannot decrease; employment years cannot decrease (actionable features only). For each rejection, report: the minimum credit score increase required for approval, the minimum employment years required, and whether both changes together are required or either is sufficient. Visualize the counterfactual distribution: what does the "approval boundary" look like in credit-score vs. debt-ratio space? Discuss: are these counterfactuals actionable for a typical applicant? What additional constraints would make them more useful?
Tools: Python, DiCE-ML, scikit-learn. Dataset: UCI German Credit. Estimated time: 3–5 hours.
Generate a structured XAI audit report for your AI system. Document the explanation methods used, findings, bias analysis, and recommendations for your compliance, governance, or internal review process.
Document your AI model's explainability analysis, global and local findings, bias assessment, and recommended mitigations. Download as Word, Excel, PDF, or PowerPoint for regulatory submissions, model reviews, or team documentation.
All data stays in your browser. Nothing is sent to or stored on any server.
The XAI landscape offers a spectrum of tools for different purposes and audiences. Intrinsically interpretable models — linear regression, decision trees, GAMs, EBMs — are the gold standard for high-stakes tabular applications where accuracy permits, offering full transparency without approximation. SHAP is the dominant post-hoc method for tabular and structured models: theoretically grounded in Shapley game theory, consistent across features, and supported by efficient exact algorithms for tree ensembles that make real-time explanation practical. LIME provides a lighter-weight alternative particularly suited to text and image modalities, producing intuitive local approximations at modest computational cost. Gradient-based methods (Integrated Gradients, GradCAM) are the appropriate tool when model gradients are available and the explanation consumer is a technical user. Counterfactual explanations provide actionable recourse for individuals affected by adverse automated decisions — a regulatory requirement in credit, employment, and other high-stakes domains.
Attention visualisation provides useful but imperfect insights into transformer reasoning and should be treated as a diagnostic tool rather than a causal explanation. Mechanistic interpretability is a research frontier with the potential to provide genuine understanding of neural network computations — its discoveries about circuits, superposition, and sparse autoencoders are some of the most important results in AI safety research — but remains impractical for production deployment today. In production, explanation method selection should be driven by three questions: who is the explanation for (model developer, regulator, affected individual), what decision does it support (debugging, compliance, recourse), and what regulatory framework applies? Model cards are the minimum documentation standard for any publicly deployed model.
The next part extends these ideas into the domain of fairness and bias — where interpretability tools are the primary instrument for detecting and correcting discriminatory patterns in AI systems before they cause harm. SHAP feature importance plots that reveal proxy discrimination, LIME explanations that expose differential treatment across demographic groups, and counterfactuals that quantify the actionable gap between protected groups are all tools we will apply in the context of AI fairness assessment.
In Part 19: AI Ethics & Bias Mitigation, we move from explaining model behaviour to evaluating it against fairness criteria — covering fairness metrics (demographic parity, equal opportunity, equalized odds), dataset auditing techniques, and the technical and organisational approaches to debiasing AI systems before and after deployment.