Back to Mathematics

Part 7: Linear Algebra

April 26, 2026 Wasil Zafar 42 min read

A neural network layer is a matrix multiplication. PCA is eigendecomposition. Attention is scaled dot-product between matrices. Linear algebra is the computational backbone of all of modern ML — and understanding it geometrically is the key to using it well.

Table of Contents

  1. Indices & Logarithms
  2. Vectors
  3. Matrices
  4. Linear Transformations
  5. Systems of Linear Equations
  6. Rank & Vector Spaces
  7. Eigenvalues & Eigenvectors
  8. Singular Value Decomposition
  9. Distance Metrics
  10. ML Connections
  11. Practice Exercises
  12. Conclusion & Next Steps

Indices & Logarithms

Index and logarithm rules underpin the math of loss functions, learning rate schedules, information theory, and activation functions (sigmoid/softmax). Mastering them removes a common obstacle when reading ML papers.

Laws of Indices (Exponents)

RuleFormulaExample
Product rule$a^m \cdot a^n = a^{m+n}$$x^2 \cdot x^3 = x^5$
Quotient rule$a^m / a^n = a^{m-n}$$x^5 / x^2 = x^3$
Power of power$(a^m)^n = a^{mn}$$(x^2)^3 = x^6$
Zero exponent$a^0 = 1\;(a \neq 0)$$5^0 = 1$
Negative exponent$a^{-n} = 1/a^n$$2^{-3} = 1/8$
Fractional exponent$a^{m/n} = \sqrt[n]{a^m}$$8^{2/3} = (\sqrt[3]{8})^2 = 4$
Product to power$(ab)^n = a^n b^n$$(2x)^3 = 8x^3$

Laws of Logarithms

The logarithm $\log_b x = y$ means $b^y = x$. For ML: $\ln = \log_e$ (natural log, base $e \approx 2.718$); $\log_2$ (bits, used in information theory); $\log_{10}$ (dB scale, signal processing).

RuleFormulaML Use
Product$\log(ab) = \log a + \log b$Joint log-likelihood → sum of per-sample log-losses
Quotient$\log(a/b) = \log a - \log b$Log of probability ratios in KL divergence
Power$\log(a^n) = n \log a$Simplifying $\log(p^k)$ terms in likelihood
Change of base$\log_b a = \frac{\ln a}{\ln b}$Converting between $\log_2$ and $\ln$ for entropy
Log of 1$\log 1 = 0$Perfect prediction has zero cross-entropy loss
Log of base$\log_b b = 1$Base case in information content $I(x) = -\log_2 p(x)$
Inverse$e^{\ln x} = x$, $\ln(e^x) = x$Log-sum-exp trick for numerical stability in softmax
Log-Sum-Exp trick — prevents numerical underflow in softmax. Instead of computing $\sum_j e^{z_j}$ (which overflows for large $z$), use:
$$\log \sum_j e^{z_j} = c + \log \sum_j e^{z_j - c}, \quad c = \max_j z_j$$
Subtracting $c$ makes all exponents $\leq 0$, so $e^{z_j - c} \in (0, 1]$ — no overflow, no underflow.
import numpy as np

# Laws of indices
print("Laws of Indices:")
print(f"  2^3 * 2^4    = 2^(3+4) = {2**7}   (got {2**3 * 2**4})")
print(f"  (3^2)^4      = 3^8     = {3**8}  (got {(3**2)**4})")
print(f"  5^0          = 1       = {5**0}")
print(f"  2^(-3)       = 1/8     = {2**-3:.4f}")
print(f"  8^(2/3)      = 4       = {8**(2/3):.4f}")

# Laws of logarithms
print("\nLaws of Logarithms:")
print(f"  log(6*7)     = log6+log7 = {np.log(6)+np.log(7):.6f}, direct = {np.log(42):.6f}")
print(f"  log(10/2)    = log10-log2= {np.log(10)-np.log(2):.6f}, direct = {np.log(5):.6f}")
print(f"  log(2^10)    = 10*log2  = {10*np.log(2):.6f}, direct = {np.log(2**10):.6f}")

# Log-sum-exp trick
z = np.array([1000.0, 1001.0, 1002.0])  # overflow risk: e^1000 = inf
c = np.max(z)
lse = c + np.log(np.sum(np.exp(z - c)))
print(f"\nLog-sum-exp trick (z=[1000,1001,1002])")
print(f"  Direct np.log(sum(exp(z))): {np.log(np.sum(np.exp(z - c) * np.exp(c))):.4f}")  # would overflow without trick
print(f"  Log-sum-exp (stable):  {lse:.4f}")
print(f"  scipy logsumexp:       {lse:.4f}  (matches)")

Vectors

A vector $\mathbf{v} \in \mathbb{R}^n$ is an ordered list of $n$ real numbers. Geometrically, it represents a point or direction in $n$-dimensional space.

$$\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}$$

Vector operations:

  • Addition: $(\mathbf{u} + \mathbf{v})_i = u_i + v_i$ (component-wise)
  • Scalar multiplication: $(c\mathbf{v})_i = c \cdot v_i$
  • Dot product: $\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i = |\mathbf{u}||\mathbf{v}|\cos\theta$

Vector norms measure magnitude:

$$\|\mathbf{v}\|_1 = \sum_i |v_i|, \quad \|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}, \quad \|\mathbf{v}\|_\infty = \max_i |v_i|$$
import numpy as np

# Define two 3-dimensional vectors
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 0.0, -1.0])

# Basic operations
print("Vector operations:")
print(f"  u + v        = {u + v}")
print(f"  2 * u        = {2 * u}")
print(f"  u · v (dot)  = {np.dot(u, v):.2f}")

# Norms
print("\nVector norms of u:")
print(f"  L1 norm  ||u||_1 = {np.linalg.norm(u, ord=1):.4f}")
print(f"  L2 norm  ||u||_2 = {np.linalg.norm(u, ord=2):.4f}")
print(f"  Inf norm ||u||_∞ = {np.linalg.norm(u, ord=np.inf):.4f}")

# Cosine similarity (used extensively in NLP/embeddings)
cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print(f"\nCosine similarity(u, v) = {cos_sim:.4f}")
angle_deg = np.degrees(np.arccos(cos_sim))
print(f"Angle between u and v   = {angle_deg:.2f}°")

Matrices

A matrix $A \in \mathbb{R}^{m \times n}$ has $m$ rows and $n$ columns. In ML, a matrix typically represents a dataset (rows = samples, cols = features) or a linear transformation.

Transpose & Special Matrices

The transpose $(A^\top)_{ij} = A_{ji}$ swaps rows and columns. Key properties: $(AB)^\top = B^\top A^\top$, $(A^\top)^\top = A$.

Special matrix types:

TypeConditionML Role
Square$m = n$Weight matrices in fully-connected layers
Symmetric$A = A^\top$Covariance matrices, kernel matrices, Hessians
Diagonal$A_{ij} = 0$ for $i \neq j$Scaling transformations, diagonal covariance
Identity $I_n$Diagonal with 1sSkip connections in ResNets: $y = F(x) + Ix$
Orthogonal$A^\top A = I$, $A^{-1} = A^\top$Rotation matrices; preserves vector norms
Lower/Upper triangularAll entries above/below diagonal are 0Cholesky decomposition for sampling Gaussians

Determinant

The determinant $\det(A)$ is a scalar encoding the signed volume scaling factor of the linear transformation represented by $A$.

  • $\det(A) \neq 0 \Leftrightarrow A$ is invertible (non-singular)
  • $\det(A) = 0 \Leftrightarrow$ rows/columns are linearly dependent
  • $\det(AB) = \det(A)\det(B)$; $\det(A^\top) = \det(A)$; $\det(A^{-1}) = 1/\det(A)$

For a 2×2 matrix:

$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$$

For a 3×3 matrix (cofactor expansion along row 1):

$$\det(A) = a_{11}(a_{22}a_{33} - a_{23}a_{32}) - a_{12}(a_{21}a_{33} - a_{23}a_{31}) + a_{13}(a_{21}a_{32} - a_{22}a_{31})$$

Cofactor, Minor, and Adjoint

These are the building blocks of the matrix inverse formula.

  • Minor $M_{ij}$: determinant of the submatrix obtained by deleting row $i$ and column $j$
  • Cofactor $C_{ij}$: $C_{ij} = (-1)^{i+j} M_{ij}$ — minor with checkerboard sign pattern
  • Cofactor matrix: matrix of all cofactors $C_{ij}$
  • Adjoint (adjugate): $\text{adj}(A) = [C_{ij}]^\top$ — transpose of the cofactor matrix
$$\det(A) = \sum_{j=1}^{n} a_{ij} C_{ij} \quad \text{(cofactor expansion along row } i)$$

Matrix Inverse

For a square non-singular matrix, the inverse $A^{-1}$ satisfies $A^{-1}A = AA^{-1} = I$:

$$A^{-1} = \frac{1}{\det(A)} \text{adj}(A)$$

Key properties: $(AB)^{-1} = B^{-1}A^{-1}$; $(A^\top)^{-1} = (A^{-1})^\top$; for symmetric $A$: $A^{-1}$ is also symmetric.

In practice, never compute $A^{-1}$ explicitly — use np.linalg.solve(A, b) which is faster and more numerically stable than np.linalg.inv(A) @ b.

import numpy as np

# Define a 3x3 matrix
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]], dtype=float)

B = np.array([[9, 8, 7],
              [6, 5, 4],
              [3, 2, 1]], dtype=float)

print("Matrix operations:")
print(f"Shape: {A.shape}")
print(f"Transpose A.T:\n{A.T}")
print(f"Matrix multiply A @ B:\n{A @ B}")
print(f"Trace tr(A) = {np.trace(A):.1f}")
print(f"Determinant det(A) = {np.linalg.det(A):.4f}")  # 0 = singular

# Invertible matrix
C = np.array([[2, 1], [5, 3]], dtype=float)
print(f"\n2x2 matrix C:\n{C}")
print(f"Inverse C^-1:\n{np.linalg.inv(C)}")
print(f"C @ C^-1 (≈ Identity):\n{C @ np.linalg.inv(C)}")

Linear Transformations

A linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ satisfies $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$ and $T(c\mathbf{v}) = cT(\mathbf{v})$. Every linear transformation can be represented as left-multiplication by a matrix: $T(\mathbf{x}) = A\mathbf{x}$.

Geometric interpretations in $\mathbb{R}^2$:

  • Scaling: $\begin{pmatrix} s_x & 0 \\ 0 & s_y \end{pmatrix}$ — stretches or compresses axes independently
  • Rotation by $\theta$: $\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$ — rotates vectors counter-clockwise
  • Projection onto $\hat{\mathbf{u}}$: $\mathbf{x} \mapsto (\hat{\mathbf{u}} \cdot \mathbf{x})\hat{\mathbf{u}}$ — drops a dimension
Neural network layers as transformations: Each dense layer applies $\mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b})$ — a linear transformation $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ followed by a nonlinear activation $\sigma$. The nonlinearity is essential: stacking linear transformations without it collapses to a single matrix multiplication, limiting the network to only linear relationships.

Systems of Linear Equations

A system $A\mathbf{x} = \mathbf{b}$ where $A \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{b} \in \mathbb{R}^m$:

  • Unique solution: $m = n$, $A$ invertible → $\mathbf{x} = A^{-1}\mathbf{b}$
  • Overdetermined ($m \gt n$): more equations than unknowns — least-squares solution: $\mathbf{x} = (A^\top A)^{-1} A^\top \mathbf{b}$ (the normal equations)
  • Underdetermined ($m \lt n$): infinitely many solutions — find the minimum norm solution via pseudo-inverse
Linear RegressionNormal Equations
Linear Regression via Normal Equations

Given $m$ training examples with feature matrix $X \in \mathbb{R}^{m \times n}$ and labels $\mathbf{y} \in \mathbb{R}^m$, the OLS solution minimizes $\|X\mathbf{w} - \mathbf{y}\|^2$:

$$\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$$

This is the closed-form analytic solution — no gradient descent needed. It's $O(n^3)$ to compute the matrix inverse, making it feasible for small-to-medium $n$ but impractical for deep learning weight matrices with millions of parameters (hence gradient-based optimization).

Cramer's Rule

For a system $A\mathbf{x} = \mathbf{b}$ with $A$ square and non-singular, Cramer's Rule gives an explicit formula for each unknown:

$$x_i = \frac{\det(A_i)}{\det(A)}$$

where $A_i$ is the matrix $A$ with the $i$-th column replaced by $\mathbf{b}$.

Cramer's Rule: theoretical vs practical. Cramer's Rule is $O(n!)$ for the naive determinant computation — catastrophically slow for large systems. For $n = 50$, Gaussian elimination needs ~125,000 multiplications; Cramer's Rule would need ~$50! \approx 10^{64}$ operations — impossible. Use it for hand-solving 2×2 and 3×3 systems, and to prove existence/uniqueness of solutions analytically.
import numpy as np

# Solve 3x3 system using Cramer's Rule
# 2x + y - z = 8
# -3x - y + 2z = -11
# -2x + y + 2z = -3

A = np.array([[ 2,  1, -1],
              [-3, -1,  2],
              [-2,  1,  2]], dtype=float)
b = np.array([8, -11, -3], dtype=float)

det_A = np.linalg.det(A)
print(f"det(A) = {det_A:.4f}")

# Cramer's Rule: replace each column in turn
x_cramer = []
for i in range(3):
    Ai = A.copy()
    Ai[:, i] = b                           # replace column i with b
    xi = np.linalg.det(Ai) / det_A
    x_cramer.append(xi)
    print(f"  det(A_{i+1}) = {np.linalg.det(Ai):.4f}  ->  x{i+1} = {xi:.4f}")

print(f"\nCramer solution: {np.round(x_cramer, 6)}")

# Verify with numpy solve (more efficient)
x_solve = np.linalg.solve(A, b)
print(f"np.linalg.solve: {np.round(x_solve, 6)}")
print(f"Match: {np.allclose(x_cramer, x_solve)}")

Rank & Vector Spaces

Key subspaces of a matrix $A \in \mathbb{R}^{m \times n}$:

  • Column space (range): $\text{col}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$
  • Null space (kernel): $\text{null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\} \subseteq \mathbb{R}^n$
  • Row space: column space of $A^\top$
  • Rank: $\text{rank}(A)$ = dimension of column space = dimension of row space

Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$

Eigenvalues & Eigenvectors

For a square matrix $A \in \mathbb{R}^{n \times n}$, a non-zero vector $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if:

$$A\mathbf{v} = \lambda\mathbf{v}$$

Eigenvectors point in directions that are only scaled (not rotated) by $A$. Eigenvalues tell us by how much.

Find eigenvalues by solving the characteristic equation: $\det(A - \lambda I) = 0$.

Eigendecomposition of a diagonalizable matrix $A$:

$$A = Q \Lambda Q^{-1}$$

Where $Q$ has eigenvectors as columns and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$. For symmetric matrices ($A = A^\top$), $Q$ is orthogonal ($Q^{-1} = Q^\top$), giving $A = Q\Lambda Q^\top$.

import numpy as np

# Symmetric matrix (common in ML: covariance matrices, kernel matrices)
A = np.array([[4, 2],
              [2, 3]], dtype=float)

eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigendecomposition of symmetric 2x2 matrix:")
print(f"A =\n{A}")
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors (columns):\n{eigenvectors}")

# Verify: A * v = lambda * v for first eigenpair
v0, lam0 = eigenvectors[:, 0], eigenvalues[0]
print(f"\nVerification A*v0 = lambda0*v0:")
print(f"  A*v0     = {A @ v0}")
print(f"  lam0*v0  = {lam0 * v0}")
print(f"  Match?   {np.allclose(A @ v0, lam0 * v0)}")

# Reconstruct A from eigendecomposition
Q = eigenvectors
Lambda = np.diag(eigenvalues)
A_reconstructed = Q @ Lambda @ np.linalg.inv(Q)
print(f"\nReconstructed A:\n{np.round(A_reconstructed, 6)}")

Singular Value Decomposition

SVD generalizes eigendecomposition to non-square matrices. Every matrix $A \in \mathbb{R}^{m \times n}$ can be factored as:

$$A = U \Sigma V^\top$$
  • $U \in \mathbb{R}^{m \times m}$: orthogonal — left singular vectors (columns are orthonormal)
  • $\Sigma \in \mathbb{R}^{m \times n}$: diagonal with non-negative singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
  • $V \in \mathbb{R}^{n \times n}$: orthogonal — right singular vectors

Low-rank approximation: Keeping only the top $k$ singular values gives the best rank-$k$ approximation of $A$ in the Frobenius norm (Eckart–Young theorem):

$$A_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$$
import numpy as np

# SVD of a 4x3 matrix
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12]], dtype=float)

U, s, Vt = np.linalg.svd(A, full_matrices=True)

print("SVD of 4x3 matrix:")
print(f"  U shape: {U.shape}")
print(f"  s (singular values): {np.round(s, 4)}")
print(f"  V^T shape: {Vt.shape}")

# Reconstruct via rank-1 approximation (keep only largest singular value)
Sigma_full = np.zeros_like(A)
Sigma_full[:3, :3] = np.diag(s)
A_reconstructed = U @ Sigma_full @ Vt
print(f"\nReconstructed A (full SVD):\n{np.round(A_reconstructed, 6)}")

# Rank-1 approximation
k = 1
A_rank1 = s[0] * np.outer(U[:, 0], Vt[0, :])
print(f"\nRank-1 approximation:\n{np.round(A_rank1, 4)}")
fro_error = np.linalg.norm(A - A_rank1, 'fro')
print(f"Frobenius reconstruction error: {fro_error:.4f}")

Distance Metrics

Distance metrics quantify similarity between vectors — critical for k-NN classifiers, clustering (k-means), recommendation systems, and embedding spaces.

Common Distance Metrics

MetricFormulaProperties & Use Cases
Euclidean ($L_2$) $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{\sum_i (u_i - v_i)^2}$ Straight-line distance; sensitive to scale — must standardize features. Default in k-NN, k-means.
Manhattan ($L_1$) $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_1 = \sum_i |u_i - v_i|$ Grid distance ("city blocks"); more robust to outliers than $L_2$; better for high dimensions.
Minkowski ($L_p$) $d(\mathbf{u}, \mathbf{v}) = \left(\sum_i |u_i - v_i|^p\right)^{1/p}$ Generalizes $L_1$ ($p=1$) and $L_2$ ($p=2$); $p \to \infty$ gives Chebyshev distance $\max_i|u_i - v_i|$.
Cosine $\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ Angle-based — ignores magnitude. Standard for text embeddings, NLP. $\text{dist} = 1 - \cos(\mathbf{u}, \mathbf{v})$.
Hamming $d(\mathbf{u}, \mathbf{v}) = \frac{1}{n}\sum_i \mathbf{1}[u_i \neq v_i]$ Count of differing positions. Binary/categorical vectors; one-hot encoded features.
Curse of Dimensionality: In high-dimensional spaces, Euclidean distances concentrate — the ratio of max to min pairwise distance approaches 1 as $d \to \infty$. This makes k-NN unreliable in raw pixel space ($d \approx 784$ for MNIST). Solutions: dimensionality reduction (PCA, UMAP), cosine similarity, or learned embeddings (neural networks map inputs to compact metric spaces where distances are semantically meaningful).
import numpy as np
from scipy.spatial import distance

u = np.array([1.0, 2.0, 3.0, 4.0])
v = np.array([4.0, 1.0, 2.0, 1.0])

# Manual computations
euclidean  = np.sqrt(np.sum((u - v)**2))
manhattan  = np.sum(np.abs(u - v))
cosine_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
hamming    = np.mean(u != v)  # for binary/equal-valued vectors

print(f"Euclidean  distance: {euclidean:.4f}")
print(f"Manhattan  distance: {manhattan:.4f}")
print(f"Cosine similarity:   {cosine_sim:.4f}")
print(f"Cosine distance:     {1 - cosine_sim:.4f}")

# Minkowski generalisation
for p in [1, 2, 3, np.inf]:
    if p == np.inf:
        d = np.max(np.abs(u - v))
        label = "Chebyshev (p=inf)"
    else:
        d = np.sum(np.abs(u - v)**p)**(1/p)
        label = f"Minkowski  p={p}"
    print(f"{label}: {d:.4f}")

# scipy equivalents (same results)
print(f"\nscipy euclidean:  {distance.euclidean(u, v):.4f}")
print(f"scipy cityblock:  {distance.cityblock(u, v):.4f}")
print(f"scipy cosine:     {distance.cosine(u, v):.4f}")

ML Connections

PCAEigendecomposition
PCA via Eigendecomposition of Covariance Matrix

PCA finds directions of maximum variance by computing eigenvectors of the covariance matrix $\Sigma = \frac{1}{n-1}X_c^\top X_c$ (where $X_c$ is mean-centered data):

  1. Center data: $X_c = X - \bar{X}$
  2. Compute covariance: $\Sigma = X_c^\top X_c / (n-1)$
  3. Eigendecompose: $\Sigma = Q\Lambda Q^\top$
  4. Sort eigenvectors by eigenvalue (descending)
  5. Project: $Z = X_c Q_k$ (keep top $k$ eigenvectors)

Equivalently: PCA is SVD of the centered data matrix. The right singular vectors $V$ are the principal components. sklearn's PCA uses truncated SVD for efficiency.

Self-attention as matrix operations: In the Transformer, attention is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$ Every operation here is matrix multiplication. $QK^\top \in \mathbb{R}^{n \times n}$ computes pairwise token similarities; $\sqrt{d_k}$ prevents gradient saturation of softmax; multiplying by $V$ extracts a weighted combination of value representations. The entire Transformer architecture is built from matrix multiplications and element-wise nonlinearities.
import numpy as np

# PCA from scratch using eigendecomposition
np.random.seed(42)

# Generate correlated 2D data
mean = [2, 3]
cov = [[3, 2.5], [2.5, 3]]
X = np.random.multivariate_normal(mean, cov, size=200)

# Step 1: Center data
X_centered = X - X.mean(axis=0)

# Step 2: Compute covariance matrix
n = X_centered.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
print("Covariance matrix:")
print(np.round(cov_matrix, 4))

# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Step 4: Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nPrincipal component eigenvalues: {np.round(eigenvalues, 4)}")
explained_var = eigenvalues / eigenvalues.sum()
print(f"Explained variance ratio:        {np.round(explained_var, 4)}")

# Step 5: Project to 1D (first PC)
X_pca_1d = X_centered @ eigenvectors[:, 0]
print(f"\nProjected to 1D: shape {X_pca_1d.shape}, var={X_pca_1d.var():.4f}")
Positive Definite Matrices: A symmetric matrix $A$ is positive definite (PD) if $\mathbf{x}^\top A \mathbf{x} \gt 0$ for all $\mathbf{x} \neq \mathbf{0}$. Equivalently, all eigenvalues are positive. PD matrices arise as covariance matrices and Hessians at local minima. Positive semi-definite (PSD) allows zero eigenvalues — covariance matrices are always PSD. The condition $\mathbf{x}^\top H \mathbf{x} \gt 0$ ($H$ = Hessian) characterizes a function as locally convex, which guarantees that gradient descent converges to the global minimum.

Practice Exercises

Exercise 1Matrix Multiply
Verify the OLS Normal Equations

Given $X = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}$ and $\mathbf{y} = \begin{pmatrix}2 \\ 4 \\ 5\end{pmatrix}$, compute $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$ by hand (or with NumPy). What is the fitted line?

Show Answer

$X^\top X = \begin{pmatrix}3 & 6 \\ 6 & 14\end{pmatrix}$, $X^\top \mathbf{y} = \begin{pmatrix}11 \\ 24\end{pmatrix}$.

$(X^\top X)^{-1} = \frac{1}{3 \cdot 14 - 36}\begin{pmatrix}14 & -6 \\ -6 & 3\end{pmatrix} = \frac{1}{6}\begin{pmatrix}14 & -6 \\ -6 & 3\end{pmatrix}$.

$\mathbf{w}^* = \frac{1}{6}\begin{pmatrix}14 \cdot 11 - 6 \cdot 24 \\ -6 \cdot 11 + 3 \cdot 24\end{pmatrix} = \frac{1}{6}\begin{pmatrix}154 - 144 \\ -66 + 72\end{pmatrix} = \frac{1}{6}\begin{pmatrix}10 \\ 6\end{pmatrix} = \begin{pmatrix}5/3 \\ 1\end{pmatrix}$.

Fitted line: $\hat{y} = \frac{5}{3} + 1 \cdot x \approx 1.667 + x$.

Exercise 2Eigenvalues
Find Eigenvalues of a 2×2 Matrix

Find the eigenvalues and eigenvectors of $A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}$. What is the geometric interpretation?

Show Answer

Characteristic equation: $\det(A - \lambda I) = (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = 0$.

Eigenvalues: $\lambda_1 = 4$, $\lambda_2 = 2$.

Eigenvectors: For $\lambda_1 = 4$: $(A - 4I)\mathbf{v} = 0 \Rightarrow \mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix}$. For $\lambda_2 = 2$: $\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix}$.

Geometric interpretation: $A$ scales the $[1,1]$ direction by 4 and the $[1,-1]$ direction by 2. If $A$ were a covariance matrix in PCA, the first principal component would be along $[1,1]$ (diagonal direction), capturing twice the variance of the second.

Conclusion & Next Steps

Linear algebra provides the language in which all of modern ML is written:

  • Vectors: feature representations, embeddings, gradient updates
  • Matrix multiply: neural network forward pass, attention, convolution (im2col)
  • Eigendecomposition: PCA, spectral clustering, stability analysis
  • SVD: low-rank approximation, recommender systems, LSA, compression
  • Normal equations: closed-form linear regression
  • Positive definiteness: convexity of loss landscapes, covariance validity

Next in the Series

In Part 8: Calculus & Optimization, we cover derivatives, gradients, the chain rule (the foundation of backpropagation), and gradient descent algorithms — from vanilla SGD to Adam.