Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAIndices & Logarithms
Index and logarithm rules underpin the math of loss functions, learning rate schedules, information theory, and activation functions (sigmoid/softmax). Mastering them removes a common obstacle when reading ML papers.
Laws of Indices (Exponents)
| Rule | Formula | Example |
|---|---|---|
| Product rule | $a^m \cdot a^n = a^{m+n}$ | $x^2 \cdot x^3 = x^5$ |
| Quotient rule | $a^m / a^n = a^{m-n}$ | $x^5 / x^2 = x^3$ |
| Power of power | $(a^m)^n = a^{mn}$ | $(x^2)^3 = x^6$ |
| Zero exponent | $a^0 = 1\;(a \neq 0)$ | $5^0 = 1$ |
| Negative exponent | $a^{-n} = 1/a^n$ | $2^{-3} = 1/8$ |
| Fractional exponent | $a^{m/n} = \sqrt[n]{a^m}$ | $8^{2/3} = (\sqrt[3]{8})^2 = 4$ |
| Product to power | $(ab)^n = a^n b^n$ | $(2x)^3 = 8x^3$ |
Laws of Logarithms
The logarithm $\log_b x = y$ means $b^y = x$. For ML: $\ln = \log_e$ (natural log, base $e \approx 2.718$); $\log_2$ (bits, used in information theory); $\log_{10}$ (dB scale, signal processing).
| Rule | Formula | ML Use |
|---|---|---|
| Product | $\log(ab) = \log a + \log b$ | Joint log-likelihood → sum of per-sample log-losses |
| Quotient | $\log(a/b) = \log a - \log b$ | Log of probability ratios in KL divergence |
| Power | $\log(a^n) = n \log a$ | Simplifying $\log(p^k)$ terms in likelihood |
| Change of base | $\log_b a = \frac{\ln a}{\ln b}$ | Converting between $\log_2$ and $\ln$ for entropy |
| Log of 1 | $\log 1 = 0$ | Perfect prediction has zero cross-entropy loss |
| Log of base | $\log_b b = 1$ | Base case in information content $I(x) = -\log_2 p(x)$ |
| Inverse | $e^{\ln x} = x$, $\ln(e^x) = x$ | Log-sum-exp trick for numerical stability in softmax |
import numpy as np
# Laws of indices
print("Laws of Indices:")
print(f" 2^3 * 2^4 = 2^(3+4) = {2**7} (got {2**3 * 2**4})")
print(f" (3^2)^4 = 3^8 = {3**8} (got {(3**2)**4})")
print(f" 5^0 = 1 = {5**0}")
print(f" 2^(-3) = 1/8 = {2**-3:.4f}")
print(f" 8^(2/3) = 4 = {8**(2/3):.4f}")
# Laws of logarithms
print("\nLaws of Logarithms:")
print(f" log(6*7) = log6+log7 = {np.log(6)+np.log(7):.6f}, direct = {np.log(42):.6f}")
print(f" log(10/2) = log10-log2= {np.log(10)-np.log(2):.6f}, direct = {np.log(5):.6f}")
print(f" log(2^10) = 10*log2 = {10*np.log(2):.6f}, direct = {np.log(2**10):.6f}")
# Log-sum-exp trick
z = np.array([1000.0, 1001.0, 1002.0]) # overflow risk: e^1000 = inf
c = np.max(z)
lse = c + np.log(np.sum(np.exp(z - c)))
print(f"\nLog-sum-exp trick (z=[1000,1001,1002])")
print(f" Direct np.log(sum(exp(z))): {np.log(np.sum(np.exp(z - c) * np.exp(c))):.4f}") # would overflow without trick
print(f" Log-sum-exp (stable): {lse:.4f}")
print(f" scipy logsumexp: {lse:.4f} (matches)")
Vectors
A vector $\mathbf{v} \in \mathbb{R}^n$ is an ordered list of $n$ real numbers. Geometrically, it represents a point or direction in $n$-dimensional space.
Vector operations:
- Addition: $(\mathbf{u} + \mathbf{v})_i = u_i + v_i$ (component-wise)
- Scalar multiplication: $(c\mathbf{v})_i = c \cdot v_i$
- Dot product: $\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i = |\mathbf{u}||\mathbf{v}|\cos\theta$
Vector norms measure magnitude:
import numpy as np
# Define two 3-dimensional vectors
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 0.0, -1.0])
# Basic operations
print("Vector operations:")
print(f" u + v = {u + v}")
print(f" 2 * u = {2 * u}")
print(f" u · v (dot) = {np.dot(u, v):.2f}")
# Norms
print("\nVector norms of u:")
print(f" L1 norm ||u||_1 = {np.linalg.norm(u, ord=1):.4f}")
print(f" L2 norm ||u||_2 = {np.linalg.norm(u, ord=2):.4f}")
print(f" Inf norm ||u||_∞ = {np.linalg.norm(u, ord=np.inf):.4f}")
# Cosine similarity (used extensively in NLP/embeddings)
cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print(f"\nCosine similarity(u, v) = {cos_sim:.4f}")
angle_deg = np.degrees(np.arccos(cos_sim))
print(f"Angle between u and v = {angle_deg:.2f}°")
Matrices
A matrix $A \in \mathbb{R}^{m \times n}$ has $m$ rows and $n$ columns. In ML, a matrix typically represents a dataset (rows = samples, cols = features) or a linear transformation.
Transpose & Special Matrices
The transpose $(A^\top)_{ij} = A_{ji}$ swaps rows and columns. Key properties: $(AB)^\top = B^\top A^\top$, $(A^\top)^\top = A$.
Special matrix types:
| Type | Condition | ML Role |
|---|---|---|
| Square | $m = n$ | Weight matrices in fully-connected layers |
| Symmetric | $A = A^\top$ | Covariance matrices, kernel matrices, Hessians |
| Diagonal | $A_{ij} = 0$ for $i \neq j$ | Scaling transformations, diagonal covariance |
| Identity $I_n$ | Diagonal with 1s | Skip connections in ResNets: $y = F(x) + Ix$ |
| Orthogonal | $A^\top A = I$, $A^{-1} = A^\top$ | Rotation matrices; preserves vector norms |
| Lower/Upper triangular | All entries above/below diagonal are 0 | Cholesky decomposition for sampling Gaussians |
Determinant
The determinant $\det(A)$ is a scalar encoding the signed volume scaling factor of the linear transformation represented by $A$.
- $\det(A) \neq 0 \Leftrightarrow A$ is invertible (non-singular)
- $\det(A) = 0 \Leftrightarrow$ rows/columns are linearly dependent
- $\det(AB) = \det(A)\det(B)$; $\det(A^\top) = \det(A)$; $\det(A^{-1}) = 1/\det(A)$
For a 2×2 matrix:
For a 3×3 matrix (cofactor expansion along row 1):
Cofactor, Minor, and Adjoint
These are the building blocks of the matrix inverse formula.
- Minor $M_{ij}$: determinant of the submatrix obtained by deleting row $i$ and column $j$
- Cofactor $C_{ij}$: $C_{ij} = (-1)^{i+j} M_{ij}$ — minor with checkerboard sign pattern
- Cofactor matrix: matrix of all cofactors $C_{ij}$
- Adjoint (adjugate): $\text{adj}(A) = [C_{ij}]^\top$ — transpose of the cofactor matrix
Matrix Inverse
For a square non-singular matrix, the inverse $A^{-1}$ satisfies $A^{-1}A = AA^{-1} = I$:
Key properties: $(AB)^{-1} = B^{-1}A^{-1}$; $(A^\top)^{-1} = (A^{-1})^\top$; for symmetric $A$: $A^{-1}$ is also symmetric.
In practice, never compute $A^{-1}$ explicitly — use np.linalg.solve(A, b) which is faster and more numerically stable than np.linalg.inv(A) @ b.
import numpy as np
# Define a 3x3 matrix
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=float)
B = np.array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]], dtype=float)
print("Matrix operations:")
print(f"Shape: {A.shape}")
print(f"Transpose A.T:\n{A.T}")
print(f"Matrix multiply A @ B:\n{A @ B}")
print(f"Trace tr(A) = {np.trace(A):.1f}")
print(f"Determinant det(A) = {np.linalg.det(A):.4f}") # 0 = singular
# Invertible matrix
C = np.array([[2, 1], [5, 3]], dtype=float)
print(f"\n2x2 matrix C:\n{C}")
print(f"Inverse C^-1:\n{np.linalg.inv(C)}")
print(f"C @ C^-1 (≈ Identity):\n{C @ np.linalg.inv(C)}")
Linear Transformations
A linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ satisfies $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$ and $T(c\mathbf{v}) = cT(\mathbf{v})$. Every linear transformation can be represented as left-multiplication by a matrix: $T(\mathbf{x}) = A\mathbf{x}$.
Geometric interpretations in $\mathbb{R}^2$:
- Scaling: $\begin{pmatrix} s_x & 0 \\ 0 & s_y \end{pmatrix}$ — stretches or compresses axes independently
- Rotation by $\theta$: $\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$ — rotates vectors counter-clockwise
- Projection onto $\hat{\mathbf{u}}$: $\mathbf{x} \mapsto (\hat{\mathbf{u}} \cdot \mathbf{x})\hat{\mathbf{u}}$ — drops a dimension
Systems of Linear Equations
A system $A\mathbf{x} = \mathbf{b}$ where $A \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{b} \in \mathbb{R}^m$:
- Unique solution: $m = n$, $A$ invertible → $\mathbf{x} = A^{-1}\mathbf{b}$
- Overdetermined ($m \gt n$): more equations than unknowns — least-squares solution: $\mathbf{x} = (A^\top A)^{-1} A^\top \mathbf{b}$ (the normal equations)
- Underdetermined ($m \lt n$): infinitely many solutions — find the minimum norm solution via pseudo-inverse
Linear Regression via Normal Equations
Given $m$ training examples with feature matrix $X \in \mathbb{R}^{m \times n}$ and labels $\mathbf{y} \in \mathbb{R}^m$, the OLS solution minimizes $\|X\mathbf{w} - \mathbf{y}\|^2$:
$$\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$$This is the closed-form analytic solution — no gradient descent needed. It's $O(n^3)$ to compute the matrix inverse, making it feasible for small-to-medium $n$ but impractical for deep learning weight matrices with millions of parameters (hence gradient-based optimization).
Cramer's Rule
For a system $A\mathbf{x} = \mathbf{b}$ with $A$ square and non-singular, Cramer's Rule gives an explicit formula for each unknown:
where $A_i$ is the matrix $A$ with the $i$-th column replaced by $\mathbf{b}$.
import numpy as np
# Solve 3x3 system using Cramer's Rule
# 2x + y - z = 8
# -3x - y + 2z = -11
# -2x + y + 2z = -3
A = np.array([[ 2, 1, -1],
[-3, -1, 2],
[-2, 1, 2]], dtype=float)
b = np.array([8, -11, -3], dtype=float)
det_A = np.linalg.det(A)
print(f"det(A) = {det_A:.4f}")
# Cramer's Rule: replace each column in turn
x_cramer = []
for i in range(3):
Ai = A.copy()
Ai[:, i] = b # replace column i with b
xi = np.linalg.det(Ai) / det_A
x_cramer.append(xi)
print(f" det(A_{i+1}) = {np.linalg.det(Ai):.4f} -> x{i+1} = {xi:.4f}")
print(f"\nCramer solution: {np.round(x_cramer, 6)}")
# Verify with numpy solve (more efficient)
x_solve = np.linalg.solve(A, b)
print(f"np.linalg.solve: {np.round(x_solve, 6)}")
print(f"Match: {np.allclose(x_cramer, x_solve)}")
Rank & Vector Spaces
Key subspaces of a matrix $A \in \mathbb{R}^{m \times n}$:
- Column space (range): $\text{col}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$
- Null space (kernel): $\text{null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\} \subseteq \mathbb{R}^n$
- Row space: column space of $A^\top$
- Rank: $\text{rank}(A)$ = dimension of column space = dimension of row space
Rank-nullity theorem: $\text{rank}(A) + \text{nullity}(A) = n$
Eigenvalues & Eigenvectors
For a square matrix $A \in \mathbb{R}^{n \times n}$, a non-zero vector $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if:
Eigenvectors point in directions that are only scaled (not rotated) by $A$. Eigenvalues tell us by how much.
Find eigenvalues by solving the characteristic equation: $\det(A - \lambda I) = 0$.
Eigendecomposition of a diagonalizable matrix $A$:
Where $Q$ has eigenvectors as columns and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$. For symmetric matrices ($A = A^\top$), $Q$ is orthogonal ($Q^{-1} = Q^\top$), giving $A = Q\Lambda Q^\top$.
import numpy as np
# Symmetric matrix (common in ML: covariance matrices, kernel matrices)
A = np.array([[4, 2],
[2, 3]], dtype=float)
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigendecomposition of symmetric 2x2 matrix:")
print(f"A =\n{A}")
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors (columns):\n{eigenvectors}")
# Verify: A * v = lambda * v for first eigenpair
v0, lam0 = eigenvectors[:, 0], eigenvalues[0]
print(f"\nVerification A*v0 = lambda0*v0:")
print(f" A*v0 = {A @ v0}")
print(f" lam0*v0 = {lam0 * v0}")
print(f" Match? {np.allclose(A @ v0, lam0 * v0)}")
# Reconstruct A from eigendecomposition
Q = eigenvectors
Lambda = np.diag(eigenvalues)
A_reconstructed = Q @ Lambda @ np.linalg.inv(Q)
print(f"\nReconstructed A:\n{np.round(A_reconstructed, 6)}")
Singular Value Decomposition
SVD generalizes eigendecomposition to non-square matrices. Every matrix $A \in \mathbb{R}^{m \times n}$ can be factored as:
- $U \in \mathbb{R}^{m \times m}$: orthogonal — left singular vectors (columns are orthonormal)
- $\Sigma \in \mathbb{R}^{m \times n}$: diagonal with non-negative singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
- $V \in \mathbb{R}^{n \times n}$: orthogonal — right singular vectors
Low-rank approximation: Keeping only the top $k$ singular values gives the best rank-$k$ approximation of $A$ in the Frobenius norm (Eckart–Young theorem):
import numpy as np
# SVD of a 4x3 matrix
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]], dtype=float)
U, s, Vt = np.linalg.svd(A, full_matrices=True)
print("SVD of 4x3 matrix:")
print(f" U shape: {U.shape}")
print(f" s (singular values): {np.round(s, 4)}")
print(f" V^T shape: {Vt.shape}")
# Reconstruct via rank-1 approximation (keep only largest singular value)
Sigma_full = np.zeros_like(A)
Sigma_full[:3, :3] = np.diag(s)
A_reconstructed = U @ Sigma_full @ Vt
print(f"\nReconstructed A (full SVD):\n{np.round(A_reconstructed, 6)}")
# Rank-1 approximation
k = 1
A_rank1 = s[0] * np.outer(U[:, 0], Vt[0, :])
print(f"\nRank-1 approximation:\n{np.round(A_rank1, 4)}")
fro_error = np.linalg.norm(A - A_rank1, 'fro')
print(f"Frobenius reconstruction error: {fro_error:.4f}")
Distance Metrics
Distance metrics quantify similarity between vectors — critical for k-NN classifiers, clustering (k-means), recommendation systems, and embedding spaces.
Common Distance Metrics
| Metric | Formula | Properties & Use Cases |
|---|---|---|
| Euclidean ($L_2$) | $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{\sum_i (u_i - v_i)^2}$ | Straight-line distance; sensitive to scale — must standardize features. Default in k-NN, k-means. |
| Manhattan ($L_1$) | $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_1 = \sum_i |u_i - v_i|$ | Grid distance ("city blocks"); more robust to outliers than $L_2$; better for high dimensions. |
| Minkowski ($L_p$) | $d(\mathbf{u}, \mathbf{v}) = \left(\sum_i |u_i - v_i|^p\right)^{1/p}$ | Generalizes $L_1$ ($p=1$) and $L_2$ ($p=2$); $p \to \infty$ gives Chebyshev distance $\max_i|u_i - v_i|$. |
| Cosine | $\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ | Angle-based — ignores magnitude. Standard for text embeddings, NLP. $\text{dist} = 1 - \cos(\mathbf{u}, \mathbf{v})$. |
| Hamming | $d(\mathbf{u}, \mathbf{v}) = \frac{1}{n}\sum_i \mathbf{1}[u_i \neq v_i]$ | Count of differing positions. Binary/categorical vectors; one-hot encoded features. |
import numpy as np
from scipy.spatial import distance
u = np.array([1.0, 2.0, 3.0, 4.0])
v = np.array([4.0, 1.0, 2.0, 1.0])
# Manual computations
euclidean = np.sqrt(np.sum((u - v)**2))
manhattan = np.sum(np.abs(u - v))
cosine_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
hamming = np.mean(u != v) # for binary/equal-valued vectors
print(f"Euclidean distance: {euclidean:.4f}")
print(f"Manhattan distance: {manhattan:.4f}")
print(f"Cosine similarity: {cosine_sim:.4f}")
print(f"Cosine distance: {1 - cosine_sim:.4f}")
# Minkowski generalisation
for p in [1, 2, 3, np.inf]:
if p == np.inf:
d = np.max(np.abs(u - v))
label = "Chebyshev (p=inf)"
else:
d = np.sum(np.abs(u - v)**p)**(1/p)
label = f"Minkowski p={p}"
print(f"{label}: {d:.4f}")
# scipy equivalents (same results)
print(f"\nscipy euclidean: {distance.euclidean(u, v):.4f}")
print(f"scipy cityblock: {distance.cityblock(u, v):.4f}")
print(f"scipy cosine: {distance.cosine(u, v):.4f}")
ML Connections
PCA via Eigendecomposition of Covariance Matrix
PCA finds directions of maximum variance by computing eigenvectors of the covariance matrix $\Sigma = \frac{1}{n-1}X_c^\top X_c$ (where $X_c$ is mean-centered data):
- Center data: $X_c = X - \bar{X}$
- Compute covariance: $\Sigma = X_c^\top X_c / (n-1)$
- Eigendecompose: $\Sigma = Q\Lambda Q^\top$
- Sort eigenvectors by eigenvalue (descending)
- Project: $Z = X_c Q_k$ (keep top $k$ eigenvectors)
Equivalently: PCA is SVD of the centered data matrix. The right singular vectors $V$ are the principal components. sklearn's PCA uses truncated SVD for efficiency.
import numpy as np
# PCA from scratch using eigendecomposition
np.random.seed(42)
# Generate correlated 2D data
mean = [2, 3]
cov = [[3, 2.5], [2.5, 3]]
X = np.random.multivariate_normal(mean, cov, size=200)
# Step 1: Center data
X_centered = X - X.mean(axis=0)
# Step 2: Compute covariance matrix
n = X_centered.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
print("Covariance matrix:")
print(np.round(cov_matrix, 4))
# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Step 4: Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"\nPrincipal component eigenvalues: {np.round(eigenvalues, 4)}")
explained_var = eigenvalues / eigenvalues.sum()
print(f"Explained variance ratio: {np.round(explained_var, 4)}")
# Step 5: Project to 1D (first PC)
X_pca_1d = X_centered @ eigenvectors[:, 0]
print(f"\nProjected to 1D: shape {X_pca_1d.shape}, var={X_pca_1d.var():.4f}")
Practice Exercises
Verify the OLS Normal Equations
Given $X = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3\end{pmatrix}$ and $\mathbf{y} = \begin{pmatrix}2 \\ 4 \\ 5\end{pmatrix}$, compute $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$ by hand (or with NumPy). What is the fitted line?
Show Answer
$X^\top X = \begin{pmatrix}3 & 6 \\ 6 & 14\end{pmatrix}$, $X^\top \mathbf{y} = \begin{pmatrix}11 \\ 24\end{pmatrix}$.
$(X^\top X)^{-1} = \frac{1}{3 \cdot 14 - 36}\begin{pmatrix}14 & -6 \\ -6 & 3\end{pmatrix} = \frac{1}{6}\begin{pmatrix}14 & -6 \\ -6 & 3\end{pmatrix}$.
$\mathbf{w}^* = \frac{1}{6}\begin{pmatrix}14 \cdot 11 - 6 \cdot 24 \\ -6 \cdot 11 + 3 \cdot 24\end{pmatrix} = \frac{1}{6}\begin{pmatrix}154 - 144 \\ -66 + 72\end{pmatrix} = \frac{1}{6}\begin{pmatrix}10 \\ 6\end{pmatrix} = \begin{pmatrix}5/3 \\ 1\end{pmatrix}$.
Fitted line: $\hat{y} = \frac{5}{3} + 1 \cdot x \approx 1.667 + x$.
Find Eigenvalues of a 2×2 Matrix
Find the eigenvalues and eigenvectors of $A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}$. What is the geometric interpretation?
Show Answer
Characteristic equation: $\det(A - \lambda I) = (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = 0$.
Eigenvalues: $\lambda_1 = 4$, $\lambda_2 = 2$.
Eigenvectors: For $\lambda_1 = 4$: $(A - 4I)\mathbf{v} = 0 \Rightarrow \mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix}$. For $\lambda_2 = 2$: $\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix}$.
Geometric interpretation: $A$ scales the $[1,1]$ direction by 4 and the $[1,-1]$ direction by 2. If $A$ were a covariance matrix in PCA, the first principal component would be along $[1,1]$ (diagonal direction), capturing twice the variance of the second.
Conclusion & Next Steps
Linear algebra provides the language in which all of modern ML is written:
- Vectors: feature representations, embeddings, gradient updates
- Matrix multiply: neural network forward pass, attention, convolution (im2col)
- Eigendecomposition: PCA, spectral clustering, stability analysis
- SVD: low-rank approximation, recommender systems, LSA, compression
- Normal equations: closed-form linear regression
- Positive definiteness: convexity of loss landscapes, covariance validity
Next in the Series
In Part 8: Calculus & Optimization, we cover derivatives, gradients, the chain rule (the foundation of backpropagation), and gradient descent algorithms — from vanilla SGD to Adam.