Explainable AI & Interpretability

AI in the Wild Part 18 of 24

About This Series

This is Part 18 of the AI in the Wild: Real-World Applications & Ethics series — a 24-part deep dive covering the complete end-to-end AI journey, from ML foundations through to responsible AI governance.

Advanced Interpretability Responsible AI

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 18

18

SHAP, LIME, attention, mechanistic interpretability

You Are Here

19

AI Ethics & Bias Mitigation

Fairness metrics, dataset auditing, debiasing techniques

20

MLOps & Model Deployment

CI/CD for ML, feature stores, monitoring, drift detection

21

Edge AI & On-Device Intelligence

Quantization, pruning, TFLite, CoreML, embedded inference

22

AI Infrastructure, Hardware & Scaling

GPUs, TPUs, distributed training, memory hierarchy

23

Responsible AI Governance

Risk frameworks, model cards, auditing, organisational practice

24

AI Policy, Regulation & Future Directions

EU AI Act, global frameworks, emerging risks, what's next

The XAI Landscape

Explainable AI (XAI) is a collection of techniques and practices that make the behaviour of machine learning models understandable to human stakeholders — whether those stakeholders are data scientists debugging a model, regulators auditing it for compliance, clinicians deciding whether to act on a medical AI recommendation, or customers challenging an adverse decision. The field emerged from a central tension in modern ML: the most accurate models (deep neural networks, gradient boosted trees, large ensembles) are also the most opaque, while the most interpretable models (linear regression, decision trees) tend to be less accurate on complex, high-dimensional tasks.

This tension has intensified as AI moves from research into consequential deployments. The EU General Data Protection Regulation (GDPR) requires that automated decisions with significant effects on individuals must be explainable. The EU AI Act (2024) classifies many AI applications as high-risk and mandates transparency, auditability, and human oversight. ECOA and fair lending laws in the US require that credit decisions be explainable. These regulatory requirements, combined with the engineering need to debug and improve models, have driven XAI from a research curiosity to a production engineering discipline.

Interpretable vs. Explainable Models

Taxonomy

The Interpretability Spectrum

The field distinguishes between two fundamentally different approaches:

Intrinsically interpretable models: Models whose structure is directly understandable. Linear regression (coefficients = feature weights), decision trees (rules visible as tree paths), GAMs (Generalized Additive Models), EBMs (Explainable Boosting Machines). The explanation is not a post-hoc approximation — it is the model itself. Gold standard for high-stakes tabular applications where accuracy permits.
Post-hoc explanation methods: Techniques applied to black-box models after training to generate explanations. SHAP, LIME, attention visualisation, integrated gradients. The explanation is an approximation of the model's behaviour — not the model itself. May be inaccurate or misleading if the approximation quality is low. Required when black-box model accuracy is necessary and interpretable models are insufficient.

Critical distinction: "This SHAP value shows that feature X contributed +0.3 to the prediction" is a statement about the SHAP approximation, not a statement about the model's true causal mechanism. Conflating explanation quality with ground truth is the most common misuse of XAI tools.

Local vs. Global Explanations

A second fundamental axis: local explanations explain a specific prediction (why was this loan application rejected?), while global explanations characterize the model's overall behaviour (what features does this model rely on most across all predictions?). Both are necessary but serve different purposes. Local explanations serve individual stakeholders (the loan applicant, the compliance officer reviewing a specific case). Global explanations serve model developers (identify data quality issues, spot biases, understand model behaviour at deployment).

Most practical XAI deployments require both. A credit scoring system typically provides: (1) a global SHAP summary plot showing the five most important features across the population, and (2) a local SHAP waterfall plot showing the specific factors that influenced each individual decision — both to regulators who audit the model and to customers who have a right to understand adverse decisions.

Post-Hoc Explanation Methods

Post-hoc methods are the workhorses of applied XAI. They require no modification to the model architecture or training procedure — they operate on any model that can produce predictions given inputs. This model-agnosticism is both their strength (universally applicable) and a limitation (approximations may be inaccurate for highly nonlinear or discontinuous models).

SHAP: Shapley Values

SHAP (SHapley Additive exPlanations) is the most widely adopted XAI method in production, used by Salesforce, Airbnb, Microsoft, healthcare systems, and financial institutions worldwide. It roots explanations in cooperative game theory: the Shapley value of each feature is the average marginal contribution of that feature across all possible orderings in which features could be added to the prediction. This formulation satisfies four desirable axioms: efficiency (attributions sum to the difference between prediction and expected value), symmetry (identical features get identical attributions), dummy (irrelevant features get zero attribution), and additivity (explanations from sub-models add correctly).

import shap
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# SHAP (SHapley Additive exPlanations) — game theory-based feature attribution
# Used by: Salesforce, Airbnb, healthcare risk models, financial ML

# Train XGBoost on loan default prediction
feature_names = ['credit_score', 'income', 'debt_ratio', 'employment_years',
                  'previous_defaults', 'loan_amount', 'loan_term', 'purpose']
model = xgb.XGBClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)

# SHAP explainer — TreeSHAP is exact and fast for tree-based models
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)  # shape: (n_samples, n_features)

# 1. Global explanation: mean |SHAP| across all samples
shap.summary_plot(shap_values, X_test, feature_names=feature_names, plot_type="bar")
# Output: credit_score (0.42) > previous_defaults (0.31) > debt_ratio (0.18) > ...

# 2. Local explanation: why was THIS loan rejected?
shap.waterfall_plot(shap.Explanation(
    values=shap_values[42],  # sample 42
    base_values=explainer.expected_value,
    data=X_test.iloc[42],
    feature_names=feature_names
))
# Shows: credit_score → -0.35 (negative: reduces default risk),
#         previous_defaults → +0.28 (positive: increases default risk)

# 3. Dependence plot: how does credit_score affect predictions?
shap.dependence_plot("credit_score", shap_values, X_test, feature_names=feature_names)
# Reveals non-linear relationship: scores < 600 dramatically increase default risk

SHAP Variants by Model Type

Choosing the Right SHAP Explainer

TreeSHAP (TreeExplainer): Exact computation for tree-based models (XGBoost, LightGBM, RandomForest, scikit-learn trees). O(TLD²) time complexity. Fast enough for real-time use. The default choice for tabular ML.
DeepSHAP (DeepExplainer): Approximation for deep neural networks using backpropagation. Based on DeepLIFT. Fast but approximate — accuracy degrades for highly nonlinear layers.
GradientExplainer: Gradient-based approximation for differentiable models. More accurate than DeepSHAP for some architectures.
KernelSHAP (KernelExplainer): Model-agnostic SHAP using weighted linear regression. Exact in expectation but requires many samples (hundreds to thousands) for low-variance estimates. 100–1000x slower than TreeSHAP. Use only when model-specific explainers are unavailable.

LIME: Local Surrogate Models

LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al., 2016) takes a different approach: rather than computing exact feature attributions mathematically, it approximates the model locally around a specific prediction. The algorithm samples from a neighbourhood around the input, queries the black-box model for predictions, weights samples by their proximity to the original input, and fits a simple interpretable model (usually logistic regression or a decision tree) to these weighted samples. The local model's coefficients are then presented as the explanation.

from lime import lime_tabular, lime_text
import numpy as np

# LIME: Local Interpretable Model-agnostic Explanations
# Approximates any black-box model locally with an interpretable model

# Tabular explanation
explainer_tabular = lime_tabular.LimeTabularExplainer(
    X_train.values,
    feature_names=feature_names,
    class_names=['No Default', 'Default'],
    mode='classification',
    discretize_continuous=True  # converts continuous features to ranges
)

# Explain a single prediction
instance = X_test.iloc[0]
explanation = explainer_tabular.explain_instance(
    instance.values,
    model.predict_proba,
    num_features=5,        # top 5 contributing features
    num_samples=1000       # local neighbourhood samples
)

print("Prediction explanation:")
for feature, weight in explanation.as_list():
    direction = "↑ risk" if weight > 0 else "↓ risk"
    print(f"  {feature}: {weight:.3f} ({direction})")

# Text classification explanation
text_explainer = lime_text.LimeTextExplainer(class_names=['Not Spam', 'Spam'])
text_exp = text_explainer.explain_instance(
    "Congratulations! You've won $1000!!!",
    spam_model.predict_proba,
    num_features=5
)
# Highlights: "won" (+0.38), "$1000" (+0.32), "Congratulations" (+0.21)

LIME's key advantages over SHAP: (1) it works natively for text and image modalities with domain-appropriate perturbation strategies (word masking for text, superpixel masking for images), (2) it is computationally lighter than KernelSHAP, and (3) its local linear approximation is highly intuitive. Its limitations: LIME explanations are inherently unstable (different random seeds produce different neighbourhoods and different explanations), and the neighbourhood definition — what counts as "local" — is a hyper-parameter with significant impact on explanation quality.

Attention & Gradient Methods

Attention Visualisation

Transformer models compute attention weights between all pairs of tokens in a sequence. These weights are often visualized as heatmaps to understand which tokens the model "attends to" when making predictions. Attention visualisation is widely used in NLP for debugging models, understanding cross-lingual transfer, and generating rationale-style explanations for document classification. In vision transformers (ViT), attention maps show which image patches are attended to when classifying an image.

from transformers import BertTokenizer, BertModel
import torch
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize attention patterns — which tokens attend to which
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
model.eval()

text = "The bank approved the loan despite the poor credit history."
inputs = tokenizer(text, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

with torch.no_grad():
    outputs = model(**inputs)

# Extract attention weights: (batch, heads, seq_len, seq_len)
attention = outputs.attentions[-1][0]  # last layer

# Average across heads for visualization
avg_attention = attention.mean(dim=0).numpy()

plt.figure(figsize=(10, 8))
sns.heatmap(avg_attention, xticklabels=tokens, yticklabels=tokens,
            cmap='Blues', vmin=0, vmax=avg_attention.max())
plt.title("BERT Attention: Last Layer (averaged over heads)")
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=150)
# Note: attention ≠ explanation — high attention to a token doesn't mean it causes the prediction
# Mechanistic interpretability goes deeper: circuits, features, activations

                            
                            The Attention ≠ Explanation Debate: Jain & Wallace (2019) demonstrated that attention weights are not reliable indicators of which tokens causally influence predictions — adversarial attention distributions can be found that differ drastically from the learned attention while producing identical predictions. Wiegreffe & Pinter (2019) contested this finding. The current consensus: attention is a useful diagnostic tool but not a rigorous explanation method. Use gradient-based methods when causal attribution is required.
                        

Integrated Gradients & GradCAM

Integrated Gradients (Sundararajan et al., 2017) provides a theoretically grounded gradient-based attribution: it integrates the gradient of the output with respect to each input feature along a straight-line path from a baseline (typically zero or a neutral input) to the actual input. Unlike vanilla gradients (which are locally computed and can be misleading near saturated activations), integrated gradients satisfies Sensitivity and Implementation Invariance axioms. It is the attribution method of choice for differentiable models in production — used by Google in their AI Explainability service, by Captum (PyTorch's XAI library), and increasingly as the standard attribution method in medical imaging AI.

GradCAM (Gradient-weighted Class Activation Mapping) is the dominant explanation method for CNNs: it computes the gradient of the target class score with respect to the feature maps of the last convolutional layer, then uses the global average of these gradients to weight the feature maps and produce a coarse heatmap over the input image. GradCAM is widely used in medical imaging to highlight the regions of an X-ray, MRI, or histology slide that most influenced a diagnostic prediction. It is computationally inexpensive and produces visually interpretable spatial maps — but only applicable to CNNs, not to fully connected networks or transformers without modification.

Mechanistic Interpretability

Mechanistic interpretability (MI) is a research frontier that aims at a different, deeper question than post-hoc explanation methods: not "which input features influenced this prediction?" but "what computational algorithm did the network implement to produce this prediction?" MI researchers seek to reverse-engineer the learned algorithms inside neural networks — understanding them at the level of circuits, features, and computations rather than input-output correlations.

Circuits & Features

The circuits paradigm, developed by researchers at Anthropic (Olah et al., 2020+), proposes that neural networks are composed of discrete computational subgraphs (circuits) that implement specific algorithms. Landmark discoveries include:

Curve detectors in early CNN layers: neurons that specifically detect curves at particular orientations, chained together to detect complex shapes.
Induction heads in transformer attention: a two-head circuit in attention layers that implements prefix completion — the algorithmic basis for in-context learning.
Indirect Object Identification circuit in GPT-2: a 26-head circuit that correctly identifies the indirect object in sentences like "Mary gave John a book because she trusted him" — a complete reverse-engineered algorithm.
The Modular Arithmetic circuit in small transformers trained on modular addition: the network implements Fourier basis decomposition — a specific mathematical algorithm, not a neural heuristic.

Superposition & Sparse Autoencoders

A major obstacle to mechanistic interpretability is superposition: neural networks represent more features than they have neurons by encoding multiple features as non-orthogonal directions in activation space, relying on the sparsity of real-world data to prevent interference. This means individual neurons are rarely monosemantic (responding to a single concept) — they are polysemantic, activated by many unrelated concepts.

Sparse Autoencoders (SAEs) have emerged in 2023–2024 as a tool for decomposing polysemantic neuron activations into monosemantic features. A SAE is trained to reconstruct activation vectors from a sparse combination of learned feature directions, enforcing sparsity through an L1 penalty. Anthropic's 2024 work on Claude applied SAEs at scale and discovered millions of interpretable features (including features corresponding to specific people, places, concepts, and even safety-relevant features like "Assistant" identity). This represents the state of the art in understanding what large language models internally represent.

Counterfactual Explanations

Counterfactual explanations answer a different question than SHAP or LIME: not "why was this decision made?" but "what would need to change for a different decision to be made?" This framing is highly intuitive and actionable — especially in adverse decision contexts like loan rejections or insurance denials. "Your application was rejected. If your credit score were above 680 and your debt-to-income ratio below 35%, your application would be approved" is a counterfactual explanation.

Counterfactuals must satisfy several properties to be useful: proximity (the counterfactual should require minimal change from the factual input), feasibility (the changes must be possible for the individual to make — age cannot be changed, while savings rate can), actionability (the explanation should guide action, not just describe the nearest decision boundary), and diversity (provide multiple counterfactual paths, not just the nearest one, since different paths may be feasible for different individuals).

Tools

Counterfactual Generation Frameworks

DiCE (Diverse Counterfactual Explanations): Microsoft Research library. Generates diverse sets of counterfactuals for any model. Supports feasibility constraints (e.g., "age cannot decrease"). Available for Python with sklearn, TensorFlow, and PyTorch models.
CFNMF / Wachter et al. (2017): Original counterfactual explanation paper. Minimizes L2 distance to the decision boundary while keeping the counterfactual in the data manifold. Basis for most subsequent work.
ALIBI: Seldon's open-source explainability library. Includes counterfactual methods alongside SHAP, LIME, and Anchor explanations.
GrowingSpheres: Generates counterfactuals by growing a sphere around the input until the boundary is reached, then finding the nearest point on the boundary.

XAI in Production & Regulation

Regulatory Requirements for Explainability

The regulatory landscape for AI explainability is evolving rapidly and varies significantly across jurisdictions and sectors. Key requirements as of 2026:

GDPR Article 22 (EU): Individuals have the right not to be subject to solely automated decisions with significant effects. When such decisions are made, controllers must provide "meaningful information about the logic involved." Courts have interpreted this as requiring explanation at a level of granularity sufficient for the individual to understand and contest the decision. Implemented by DPAs across the EU; fines up to 4% of global annual turnover.
EU AI Act (2024, effective 2026): High-risk AI systems (Annex III: biometric systems, educational tools, employment AI, credit AI, essential services) must provide transparency to affected persons and to market surveillance authorities on request. Requires logging, audit trails, and human oversight mechanisms. High-risk systems must be accompanied by technical documentation demonstrating explainability capabilities.
Equal Credit Opportunity Act / Regulation B (US): Creditors must provide specific reasons for adverse action. The CFPB has issued guidance that AI models used in credit decisions must generate adverse action reasons that are specific, accurate, and based on the actual factors that drove the model's decision — not generic reasons that satisfy the letter of the regulation but not its intent.
NYC Local Law 144 (2023): Requires bias audits for automated employment decision tools (AEDTs). Audit results must be published. First law in the US specifically targeting algorithmic hiring tools.

Model Cards & Datasheets

Model cards (Mitchell et al., 2019) are structured documentation for ML models that cover: intended use, limitations, evaluation metrics across demographic groups, ethical considerations, and recommendations for appropriate use. They are the minimum transparency standard for any publicly deployed model. Major platforms (HuggingFace, Google Vertex AI, AWS SageMaker) now require or strongly encourage model cards for published models. The EU AI Act's technical documentation requirements effectively mandate model card equivalents for high-risk AI.

XAI Use Cases by Industry

Industry	Use Case	Required Explanation Type	Regulatory Driver	Example Tool
Finance — Credit	Credit scoring & loan origination	Local: individual adverse action reasons. Global: demographic parity across protected groups.	ECOA / Reg B, EU AI Act (Annex III), GDPR Art. 22	SHAP waterfall plots; DiCE counterfactuals; FICO Explainable AI
Finance — Fraud	Transaction fraud detection	Local: why this transaction was flagged. Audit: feature drift monitoring over time.	Internal compliance, PSD2 (EU), audit requirements	SHAP force plots; rule extraction from gradient boosted models
Healthcare	Clinical decision support (radiology, pathology, risk scoring)	Visual: GradCAM/saliency maps showing relevant image regions. Clinical: feature-based rationale clinician can verify.	FDA AI/ML-Based SaMD guidance, EU MDR, IEC 62304 clinical evidence requirements	GradCAM; Integrated Gradients; Captum; PathAI attention maps
Legal / Criminal Justice	Recidivism risk scoring (COMPAS-type tools)	Local: factors contributing to risk score. Counterfactual: what would reduce the score.	Due process requirements, judicial review rights, US state-level AI auditing laws	LIME; counterfactual tools; rule-based surrogate models
HR / Recruitment	Resume screening, interview scoring	Global: which features predict candidate success. Demographic: parity across protected groups.	NYC Local Law 144, EEOC guidelines, EU AI Act employment provisions	SHAP summary plots; bias audits; adverse impact analysis
Insurance	Underwriting & claims assessment	Local: why this premium was set / claim denied. Audit: proxy discrimination analysis.	Insurance regulation, GDPR, EU AI Act financial services provisions	SHAP; PDPs; Integrated Gradients for actuarial models

XAI Methods Comparison

Choosing the right XAI method for a given problem requires understanding the trade-offs across several dimensions. The following table provides a practical reference for method selection.

Method	Type	Model-Agnostic?	Local / Global	Fidelity	Speed	Best For
SHAP (TreeSHAP)	Feature attribution	No — tree models only	Both	Exact	Fast (ms)	Tabular ML with tree models; credit, fraud, churn; real-time explanations
SHAP (KernelSHAP)	Feature attribution	Yes	Both	Approximate (sampling)	Slow (seconds–minutes)	Model-agnostic tabular; when TreeSHAP unavailable; batch explanation
LIME	Local surrogate	Yes	Local only	Approximate (local linear)	Medium (hundreds of ms)	Text classification; image classification; intuitive explanations for non-technical audiences
Attention Visualisation	Internal representation	No — transformers only	Local	Not a faithful explanation (debated)	Fast (single forward pass)	Debugging transformer models; NLP rationale generation; ViT spatial focus
Counterfactuals	Contrastive example	Yes (with constraints)	Local	High (boundary-based)	Slow (optimisation per sample)	Adverse decision contexts; actionable user explanations; regulatory recourse requirement
ICE / PDP	Partial dependence	Yes	Both	High (marginalises correctly)	Medium	Understanding feature effects globally; non-linear relationship discovery; model debugging
Integrated Gradients	Gradient attribution	No — differentiable models only	Local	Satisfies axioms (exact)	Medium (multiple forward/backward passes)	Deep learning on text, images, tabular; when theoretical guarantees matter; medical imaging

Hands-On Exercises

Beginner

Exercise 1: SHAP Analysis on Titanic Survival

Train a GradientBoostingClassifier on the Titanic dataset (available via seaborn or Kaggle). Use shap.TreeExplainer to compute SHAP values for the test set. Generate: (a) a beeswarm summary plot showing global feature importances, (b) a waterfall plot for the passenger with the highest predicted survival probability, and (c) a dependence plot for the Age feature. Answer: what are the three most important features? Does the directionality (positive/negative SHAP values for each feature) match your intuition about Titanic survival? Where does the model deviate from historical knowledge?

Tools: Python, scikit-learn, SHAP library (pip install shap), Seaborn for data loading.

Intermediate

Exercise 2: SHAP vs LIME Disagreement Analysis

Train an XGBoost model on the UCI Adult Income dataset (predict income >50K). Generate SHAP values (TreeSHAP) and LIME explanations for 20 random test instances. For each instance, extract the top-5 features from each method and compute rank correlation (Spearman) between SHAP and LIME feature importance rankings. Find at least one instance where the methods disagree substantially (rank correlation <0.6). Investigate why: is it a region of high nonlinearity, a feature with strong interactions, or a sample near the decision boundary? Understanding when methods disagree is essential for responsible XAI deployment.

Tools: Python, XGBoost, SHAP, LIME, SciPy (for rank correlation). Dataset: sklearn.datasets.fetch_openml('adult', version=2).

Advanced

Exercise 3: Counterfactual Explanations for Loan Rejections

Train a binary classifier on the German Credit Dataset (predict loan default). Identify 10 instances predicted as "high default risk" (rejected applications). Use DiCE (pip install dice-ml) to generate diverse counterfactual explanations for each rejection. Constrain: age cannot decrease; employment years cannot decrease (actionable features only). For each rejection, report: the minimum credit score increase required for approval, the minimum employment years required, and whether both changes together are required or either is sufficient. Visualize the counterfactual distribution: what does the "approval boundary" look like in credit-score vs. debt-ratio space? Discuss: are these counterfactuals actionable for a typical applicant? What additional constraints would make them more useful?

Tools: Python, DiCE-ML, scikit-learn. Dataset: UCI German Credit. Estimated time: 3–5 hours.

XAI Audit Report Generator

Generate a structured XAI audit report for your AI system. Document the explanation methods used, findings, bias analysis, and recommendations for your compliance, governance, or internal review process.

XAI Audit Report Generator

Document your AI model's explainability analysis, global and local findings, bias assessment, and recommended mitigations. Download as Word, Excel, PDF, or PowerPoint for regulatory submissions, model reviews, or team documentation.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

System / Model Name *

Model Architecture

Primary Explanation Method

Target Population / Use Context *

Explainability Goal

Global Explanation Findings

Local Explanation Findings

Bias & Fairness Findings

Implemented Mitigations

Recommendations & Next Steps

Author Name

Conclusion & Next Steps

The XAI landscape offers a spectrum of tools for different purposes and audiences. Intrinsically interpretable models — linear regression, decision trees, GAMs, EBMs — are the gold standard for high-stakes tabular applications where accuracy permits, offering full transparency without approximation. SHAP is the dominant post-hoc method for tabular and structured models: theoretically grounded in Shapley game theory, consistent across features, and supported by efficient exact algorithms for tree ensembles that make real-time explanation practical. LIME provides a lighter-weight alternative particularly suited to text and image modalities, producing intuitive local approximations at modest computational cost. Gradient-based methods (Integrated Gradients, GradCAM) are the appropriate tool when model gradients are available and the explanation consumer is a technical user. Counterfactual explanations provide actionable recourse for individuals affected by adverse automated decisions — a regulatory requirement in credit, employment, and other high-stakes domains.

Attention visualisation provides useful but imperfect insights into transformer reasoning and should be treated as a diagnostic tool rather than a causal explanation. Mechanistic interpretability is a research frontier with the potential to provide genuine understanding of neural network computations — its discoveries about circuits, superposition, and sparse autoencoders are some of the most important results in AI safety research — but remains impractical for production deployment today. In production, explanation method selection should be driven by three questions: who is the explanation for (model developer, regulator, affected individual), what decision does it support (debugging, compliance, recourse), and what regulatory framework applies? Model cards are the minimum documentation standard for any publicly deployed model.

The next part extends these ideas into the domain of fairness and bias — where interpretability tools are the primary instrument for detecting and correcting discriminatory patterns in AI systems before they cause harm. SHAP feature importance plots that reveal proxy discrimination, LIME explanations that expose differential treatment across demographic groups, and counterfactuals that quantify the actionable gap between protected groups are all tools we will apply in the context of AI fairness assessment.

Next in the Series

In Part 19: AI Ethics & Bias Mitigation, we move from explaining model behaviour to evaluating it against fairness criteria — covering fairness metrics (demographic parity, equal opportunity, equalized odds), dataset auditing techniques, and the technical and organisational approaches to debiasing AI systems before and after deployment.

Cookie Consent

Cookie Preferences

Explainable AI & Interpretability

Table of Contents

About This Series

AI in the Wild: Real-World Applications & Ethics

AI & ML Landscape Overview

ML Foundations for Practitioners

Natural Language Processing

Computer Vision in the Real World

Recommender Systems

Reinforcement Learning Applications

Conversational AI & Chatbots

Large Language Models

Prompt Engineering & In-Context Learning

Fine-tuning, RLHF & Model Alignment

Generative AI Applications

Multimodal AI

AI Agents & Agentic Workflows

AI in Healthcare & Life Sciences

AI in Finance & Fraud Detection

AI in Autonomous Systems & Robotics

AI Security & Adversarial Robustness

Explainable AI & Interpretability

AI Ethics & Bias Mitigation

MLOps & Model Deployment

Edge AI & On-Device Intelligence

AI Infrastructure, Hardware & Scaling

Responsible AI Governance

AI Policy, Regulation & Future Directions

The XAI Landscape

Interpretable vs. Explainable Models

The Interpretability Spectrum

Local vs. Global Explanations

Post-Hoc Explanation Methods

SHAP: Shapley Values

Choosing the Right SHAP Explainer

LIME: Local Surrogate Models

Attention & Gradient Methods

Attention Visualisation

Integrated Gradients & GradCAM

Mechanistic Interpretability

Circuits & Features

Superposition & Sparse Autoencoders

Counterfactual Explanations

Counterfactual Generation Frameworks

XAI in Production & Regulation

Regulatory Requirements for Explainability

Model Cards & Datasheets

XAI Use Cases by Industry

XAI Methods Comparison

Hands-On Exercises

Exercise 1: SHAP Analysis on Titanic Survival

Exercise 2: SHAP vs LIME Disagreement Analysis

Exercise 3: Counterfactual Explanations for Loan Rejections

XAI Audit Report Generator

XAI Audit Report Generator

Conclusion & Next Steps

Next in the Series

Continue This Series

Part 17: AI Security & Adversarial Robustness

Part 19: AI Ethics & Bias Mitigation

Part 23: Responsible AI Governance