Back to Mathematics

Part 1: Mathematical Thinking

April 24, 2026 Wasil Zafar 18 min read

Build the mathematical mindset before diving into formulas — how mathematicians think, how to read notation fluently, and the core functions that power every ML and AI model you'll ever build.

Table of Contents

  1. Why Mathematical Thinking?
  2. The Mathematical Mindset
  3. Functions & Graphs
  4. Interactive Function Explorer
  5. Practice Exercises
  6. Conclusion & Next Steps

Why Mathematical Thinking?

Imagine two engineers. Both copy a neural network from a tutorial. It runs. Then the model starts predicting garbage. Engineer A opens Stack Overflow and tweaks random hyperparameters hoping something improves. Engineer B reads the loss curve, recalls that the sigmoid activation is causing vanishing gradients for deep networks, and swaps in ReLU. The model converges in minutes.

The difference isn't raw intelligence. It's mathematical literacy.

Three Levels of ML Understanding:
  • Level 1 — Black Box User: "I run the library and get predictions." No understanding of what happens inside.
  • Level 2 — Configurator: "I know which hyperparameters to tune and why." Some intuition, limited by library docs.
  • Level 3 — Builder: "I can derive the algorithm from first principles and modify it." Full mathematical fluency.

This bootcamp takes you from Level 1 to Level 3 — systematically. But before any formula, we need to talk about how mathematicians think. The mindset comes before the mechanics.

How Math Powers ML

Every concept in ML is underpinned by a branch of mathematics:

Mathematics Behind ML/AI
flowchart LR
    A[Your Data] --> B[Linear Algebra\nVectors & Matrices]
    A --> C[Probability\nUncertainty & Distributions]
    B --> D[Model\nFunction Composition]
    C --> D
    D --> E[Calculus\nGradient Descent]
    E --> F[Trained Model\nPredictions]
    G[Statistics\nEvaluation & Inference] --> F
    H[Information Theory\nLoss Functions] --> E
                            

None of these areas exist in isolation — they connect and reinforce each other. This series follows a deliberate order: each phase builds the vocabulary needed for the next. We start here, at Phase 0, with the prerequisite that makes everything else learnable: mathematical thinking itself.

The Mathematical Mindset

0.1.1 — Abstraction & Modeling

Abstraction is the act of stripping away irrelevant details to reveal underlying structure. It's not dumbing things down — it's identifying what actually matters for a given question.

Think of it like this: You're trying to predict house prices. A house has a front door, a paint colour, noisy neighbours, memories of the previous owner, and 1,400 square feet. Only a few of those matter for price. Abstraction means you decide: size, location, age, number of rooms, proximity to schools. Everything else is noise you deliberately ignore.

The mathematical model that emerges is:

$$\text{price} = f(\text{size},\; \text{location},\; \text{age},\; \text{rooms}) + \varepsilon$$

where $\varepsilon$ captures the noise we've chosen to ignore. That $\varepsilon$ is not laziness — it's an intentional modelling decision. All of machine learning is about building and refining such abstractions.

Real-World Example Spam Filter
Spam Classification as Abstraction

An email has a subject, body, sender, timestamp, HTML formatting, attached images, and metadata. A spam filter abstracts this to a vector of word frequencies — a bag of words. The model then learns: emails containing "FREE!!!", "CLICK HERE", and "WINNER" are spam. By abstracting text into numbers, we made an unstructured problem solvable with algebra.

The abstraction discards meaning, grammar, and context. That's a deliberate tradeoff: we lose nuance but gain tractability. More sophisticated models (like Transformers) preserve more context — they use a richer abstraction, not a different principle.

Key insight: When you see a mathematical formula, ask yourself: what real-world phenomenon is this abstracting, and what is it intentionally ignoring? That question will unlock 90% of what a formula is trying to say.

0.1.2 — Notation Fluency

Mathematical notation is a compressed language. Like any language, it feels intimidating until you build vocabulary. The secret is that most ML papers use a small, consistent set of symbols repeatedly.

Core ML Notation Reference:
SymbolNameMeaning in MLExample
$\sum_{i=1}^{n}$SummationSum over $n$ items$\sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n$
$\prod_{i=1}^{n}$ProductMultiply $n$ items$\prod_{i=1}^{n} p_i = p_1 \cdot p_2 \cdots p_n$
$\forall$For allApplies to every element$\forall x \in \mathbb{R}$: for every real number $x$
$\exists$There existsAt least one element satisfies$\exists\, w$ such that $f(w) = 0$
$\in$Element ofBelongs to a set$x_i \in \mathbb{R}^d$: $x_i$ is a $d$-dimensional vector
$\mathbb{R}^n$Real $n$-spaceFeature vector space$\mathbf{x} \in \mathbb{R}^{784}$ (28×28 image flattened)
$\hat{y}$"y hat"Predicted value$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$
$\|\cdot\|$NormSize/length of vector$\|\mathbf{w}\|^2 = \sum_i w_i^2$
$\nabla$Nabla / GradientDirection of steepest ascent$\nabla_\theta \mathcal{L}$: gradient of loss w.r.t. $\theta$
$\mathbb{E}[\cdot]$ExpectationAverage over distribution$\mathbb{E}[X] = \sum_x x \cdot P(X=x)$

Greek letters you'll see constantly: $\alpha$ (learning rate), $\beta$ (regression coefficients), $\theta$ (model parameters), $\mu$ (mean), $\sigma$ (standard deviation), $\lambda$ (regularisation strength), $\varepsilon$ (small noise / error), $\eta$ (step size).

How to get comfortable with notation: Don't try to memorise. Instead, every time you see an unfamiliar symbol, pause and decode it. Say it out loud: "for all $i$ from 1 to $n$, sum up $x_i$". Within a few weeks, notation fluency becomes automatic — exactly like learning a new spoken language.
import numpy as np

# Notation in code: the Sigma (summation) symbol
# Mathematical: sum_{i=1}^{n} x_i
x = np.array([3, 7, 2, 9, 1])
total = np.sum(x)           # Σ x_i
print("Sum:", total)        # 22

# The Product symbol: prod_{i=1}^{n} x_i
product = np.prod(x)
print("Product:", product)  # 378

# Norm ||w||^2 = sum_i w_i^2
w = np.array([0.3, -0.7, 0.1])
norm_squared = np.sum(w ** 2)
print("||w||²:", round(norm_squared, 4))  # 0.59

0.1.3 — Proof vs Intuition

Pure mathematics demands proof: a rigorous logical argument showing a statement is always true. Machine learning is more pragmatic — you'll use both proof and intuition, knowing when each is appropriate.

Think of it like this: Proof is the map. Intuition is the felt sense of the terrain. A good navigator uses both — the map to plan a route, the terrain sense to adapt when the map is incomplete. In ML, you'll often trust the intuition of a researcher who tested many approaches, then verify with proof (or experiments) once you have a hypothesis.

Example — Why does gradient descent converge?

  • Intuition: Imagine a ball rolling down a hilly surface. At every step, it moves in the direction of steepest descent. Eventually it reaches a valley (minimum). The ball "knows" where to go locally without needing to see the whole landscape.
  • Proof: For a convex function $f$ with Lipschitz-continuous gradient, with step size $\alpha \leq \frac{1}{L}$ (where $L$ is the Lipschitz constant), gradient descent converges at rate $O(1/k)$ after $k$ iterations.

The proof tells you when it works and how fast. The intuition tells you why it makes sense. Non-convex neural network loss landscapes complicate the proof — but the intuition still guides practice.

For this bootcamp: We'll develop both. Some sections are proof-heavy (set theory, probability axioms), others are intuition-led (gradient descent, neural networks). You'll learn to recognise which is appropriate for each context.

0.1.4 — Approximation vs Exactness

Here's a fact that surprises many beginners: virtually nothing in machine learning is exact. And that's completely intentional.

Key Concept Floating Point
Why Computers Can't Be Exact

The number $\frac{1}{3} = 0.33333...$ has infinitely many decimal places. Computers store numbers in 32-bit or 64-bit floating point format, which means every irrational number (like $\pi$, $e$, $\sqrt{2}$) is truncated. This creates floating point error. In most ML applications, this error is negligible — but in certain operations (like subtracting two nearly-equal numbers), it can amplify catastrophically.

import numpy as np

# Floating point arithmetic — not always exact
a = 0.1 + 0.2
print(a)                         # 0.30000000000000004 (not 0.3!)
print(a == 0.3)                  # False

# Use np.isclose() for safe comparisons
print(np.isclose(a, 0.3))        # True

# Catastrophic cancellation — subtracting nearby numbers
x = 1.0000001
y = 1.0000000
result = x - y
print(result)                    # ~1e-7, but accumulated errors can dominate

# ML implication: always use log-sum-exp trick for numerical stability
# Instead of log(exp(a) + exp(b)):
a_val, b_val = 1000.0, 999.0
# Naive (overflows):
# np.log(np.exp(a_val) + np.exp(b_val))  # inf!
# Stable version:
m = max(a_val, b_val)
stable = m + np.log(np.exp(a_val - m) + np.exp(b_val - m))
print("log-sum-exp:", stable)    # 1000.3132...

In ML, we accept approximations willingly — stochastic gradient descent uses a random subset of data each step (an approximation of the full gradient). Monte Carlo methods estimate integrals by sampling. Variational inference approximates intractable posterior distributions. The art is knowing how much error is acceptable and where exactness genuinely matters.

When exactness matters in ML: Numerical stability in loss computation (use log_softmax, not log(softmax(x))); matrix inversion (prefer pseudoinverse for ill-conditioned matrices); probability calculations (work in log-space to prevent underflow with very small values).

Functions & Graphs

A function is a rule that takes an input and produces a unique output. Every ML model is, fundamentally, a function: it takes in features and outputs a prediction. Before building complex models, you need deep familiarity with the building blocks.

$$f: X \to Y \quad \text{means "} f \text{ maps elements of } X \text{ to elements of } Y \text{"}$$

For example, a classifier maps from feature space to label space: $f: \mathbb{R}^d \to \{0, 1\}$. A regressor maps to real numbers: $f: \mathbb{R}^d \to \mathbb{R}$.

0.2.1 — Linear Functions

The simplest and most important class. A linear function of one variable:

$$f(x) = mx + b$$

where $m$ is the slope (rate of change) and $b$ is the y-intercept (value when $x=0$). The slope tells you: "for every unit increase in $x$, $f(x)$ increases by $m$."

In multiple dimensions (the foundation of linear regression):

$$f(\mathbf{x}) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b$$

where $\mathbf{w} = [w_1, w_2, \ldots, w_n]^\top$ is the weight vector and $b$ is the bias. Every linear regression model, every neuron in a neural network before activation, is exactly this formula.

Geometric interpretation: In 2D, a linear function draws a straight line. In 3D, it draws a plane. In $n$ dimensions, it draws a hyperplane — the decision boundary of a linear classifier. When you hear "linear model", visualise a straight hyperplane cutting through your feature space.
import numpy as np
import matplotlib.pyplot as plt

# Linear function: f(x) = 2x + 1
x = np.linspace(-5, 5, 100)
f = 2 * x + 1  # slope=2, intercept=1

plt.figure(figsize=(7, 4))
plt.plot(x, f, color='#3B9797', linewidth=2.5, label='f(x) = 2x + 1')
plt.axhline(0, color='gray', linewidth=0.5)
plt.axvline(0, color='gray', linewidth=0.5)
plt.scatter([0], [1], color='#BF092F', s=80, zorder=5, label='y-intercept (0, 1)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Linear Function: f(x) = mx + b')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

0.2.2 — Polynomial Functions

A polynomial extends the linear idea to higher powers:

$$f(x) = a_n x^n + a_{n-1} x^{n-1} + \cdots + a_1 x + a_0$$

The degree $n$ determines the shape and complexity:

  • Degree 1 (linear): straight line — $f(x) = 2x + 1$
  • Degree 2 (quadratic): parabola — $f(x) = x^2 - 4$, opens up or down
  • Degree 3 (cubic): S-curve — one inflection point, can model more complex relationships
  • Degree $n$: up to $n-1$ turning points
ML connection — Polynomial Feature Engineering: Linear models struggle with curved relationships. The fix is to add polynomial features. If $x$ is house size, add $x^2$ and $x^3$ as new features. Now your linear model in the new feature space becomes a polynomial model in the original space — more flexible, still analytically tractable.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Original feature: one variable x
x = np.array([[1], [2], [3], [4], [5]])

# Add polynomial features up to degree 3
poly = PolynomialFeatures(degree=3, include_bias=False)
x_poly = poly.fit_transform(x)

# Now each sample has features: [x, x^2, x^3]
print("Original features:\n", x.T)
print("Polynomial features:\n", x_poly.T)
# Output row 0: [1, 2, 3, 4, 5]    <- x
# Output row 1: [1, 4, 9, 16, 25]  <- x^2
# Output row 2: [1, 8, 27, 64, 125] <- x^3
Danger of high-degree polynomials: A degree-100 polynomial can pass through every training point perfectly — but will behave wildly between points. This is overfitting: memorising training data instead of learning the underlying pattern. Understanding polynomial behavior is prerequisite to understanding overfitting.

0.2.3 — Exponential & Logarithmic Functions

The exponential function is everywhere in ML:

$$f(x) = e^x \quad \text{where } e \approx 2.71828\ldots \text{ (Euler's number)}$$

Its defining property: the derivative of $e^x$ is $e^x$ itself. It's the only function that equals its own rate of change. This makes it uniquely tractable in calculus-heavy ML derivations.

The natural logarithm $\ln(x) = \log_e(x)$ is the inverse:

$$\ln(e^x) = x \qquad e^{\ln(x)} = x$$

Where you'll see these in ML:

  • Sigmoid activation: $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ — squashes any value into $(0, 1)$
  • Softmax: $\text{softmax}(x_i) = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$ — turns logits into a probability distribution
  • Cross-entropy loss: $\mathcal{L} = -\sum_i y_i \ln(\hat{p}_i)$ — penalises wrong probability assignments harshly
  • Log-likelihood: products of small probabilities become sums of logs — numerically stable and analytically convenient
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-3, 3, 200)

# Three core exponential/log functions in ML
sigmoid = 1 / (1 + np.exp(-x))
exp_x   = np.exp(x)

x_pos = np.linspace(0.01, 3, 200)
log_x = np.log(x_pos)

fig, axes = plt.subplots(1, 3, figsize=(13, 4))

axes[0].plot(x, sigmoid, color='#3B9797', linewidth=2.5)
axes[0].set_title('Sigmoid: 1 / (1 + e^{-x})')
axes[0].axhline(0.5, linestyle='--', color='gray', alpha=0.5)
axes[0].set_xlabel('x'); axes[0].set_ylabel('σ(x)')

axes[1].plot(x, exp_x, color='#16476A', linewidth=2.5)
axes[1].set_title('Exponential: e^x')
axes[1].set_xlabel('x'); axes[1].set_ylabel('e^x')

axes[2].plot(x_pos, log_x, color='#BF092F', linewidth=2.5)
axes[2].set_title('Natural Log: ln(x)')
axes[2].set_xlabel('x'); axes[2].set_ylabel('ln(x)')

for ax in axes:
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

0.2.4 — Function Transformations

Every activation function is a transformed version of a simpler function. Understanding transformations means you can decode any function visually:

TransformationFormulaEffect on GraphML Example
Vertical shift up by $c$$f(x) + c$Move graph $c$ units upAdding bias $b$: $\mathbf{w}^\top\mathbf{x} + b$
Horizontal shift right by $c$$f(x - c)$Move graph $c$ units rightShifted activation threshold
Vertical scale by $a$$a \cdot f(x)$Stretch ($|a|>1$) or compress ($|a|<1$) verticallyLearning rate scaling gradient
Reflection about $x$-axis$-f(x)$Flip upside downNegating loss for maximisation
Horizontal scale by $\frac{1}{a}$$f(ax)$Compress horizontally by $a$Temperature in softmax: $e^{x/T}$
Softmax temperature — a transformation in action: The softmax function can be parameterised with temperature $T$: $\text{softmax}(x_i/T)$. At $T=1$: standard softmax. At $T \to 0$: output becomes a hard argmax (one class gets probability 1). At $T \to \infty$: output becomes uniform (maximum uncertainty). This is a horizontal scaling transformation with profound practical implications for language model sampling.

0.2.5 — Composition of Functions

Composition means applying one function to the output of another:

$$(f \circ g)(x) = f(g(x))$$

Read it right to left: first apply $g$, then apply $f$ to the result. Example:

$$g(x) = 2x + 1, \quad f(u) = u^2 \implies (f \circ g)(x) = (2x+1)^2$$
Neural networks ARE composition of functions. Every layer applies a linear transformation then an activation:
$$h_1 = \sigma(W_1 \mathbf{x} + b_1), \quad h_2 = \sigma(W_2 h_1 + b_2), \quad \hat{y} = \text{softmax}(W_3 h_2 + b_3)$$
Expanding: $\hat{y} = \text{softmax}(W_3 \cdot \sigma(W_2 \cdot \sigma(W_1 \mathbf{x} + b_1) + b_2) + b_3)$. This is a deeply nested composition. Understanding composition is essential for understanding why the chain rule (the engine of backpropagation) works the way it does.
Neural Network as Nested Function Composition
flowchart LR
    X["x ∈ ℝ^d\n(Input)"] --> L1["Layer 1\nh₁ = σ(W₁x + b₁)"]
    L1 --> L2["Layer 2\nh₂ = σ(W₂h₁ + b₂)"]
    L2 --> OUT["Output\nŷ = softmax(W₃h₂ + b₃)"]
    OUT --> LOSS["Loss ℒ(ŷ, y)"]
    LOSS -->|"Backprop:\nchain rule"| L2
    LOSS -->|"chain rule"| L1
                            

During training, we need $\frac{\partial \mathcal{L}}{\partial W_1}$ — how much does the loss change with respect to the first layer's weights? The chain rule for compositions gives us exactly that, layer by layer. We'll derive this fully in Part 8 (Calculus). For now, appreciate that knowing how composition works is what makes that derivation possible.

import numpy as np

# Composition of functions — the neural network forward pass
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def softmax(z):
    e = np.exp(z - np.max(z))  # subtract max for numerical stability
    return e / e.sum()

# Simulate a 3-layer network: input -> hidden1 -> hidden2 -> output
np.random.seed(42)
x  = np.random.randn(4)        # 4 input features

W1 = np.random.randn(6, 4) * 0.1   # 6 hidden units, 4 inputs
b1 = np.zeros(6)

W2 = np.random.randn(6, 6) * 0.1   # 6 -> 6
b2 = np.zeros(6)

W3 = np.random.randn(3, 6) * 0.1   # 6 -> 3 output classes
b3 = np.zeros(3)

# Forward pass: composition f3(f2(f1(x)))
h1 = sigmoid(W1 @ x + b1)          # g1(x)
h2 = sigmoid(W2 @ h1 + b2)         # g2(g1(x))
y_hat = softmax(W3 @ h2 + b3)      # g3(g2(g1(x)))

print("Input:        ", x.round(3))
print("Hidden layer 1:", h1.round(3))
print("Hidden layer 2:", h2.round(3))
print("Output probs: ", y_hat.round(4))
print("Predicted class:", np.argmax(y_hat))

Interactive Function Explorer

The chart below visualises the four core function families on the same axes. Use it to develop geometric intuition — notice how their rates of change differ.

Practice Exercises

Exercise 1 Notation
Decode These Expressions

Translate each mathematical expression into plain English, then compute the value:

  1. $\displaystyle\sum_{i=1}^{4} i^2$  (hint: $1^2 + 2^2 + \ldots$)
  2. $\displaystyle\prod_{k=1}^{3} (2k - 1)$  (hint: $1 \times 3 \times 5$)
  3. $\mathbf{w}^\top \mathbf{x}$ where $\mathbf{w} = [2, -1, 3]^\top$ and $\mathbf{x} = [1, 4, 2]^\top$
Show Answers
  1. $1 + 4 + 9 + 16 = 30$
  2. $(2\cdot1-1)(2\cdot2-1)(2\cdot3-1) = 1 \cdot 3 \cdot 5 = 15$
  3. $2(1) + (-1)(4) + 3(2) = 2 - 4 + 6 = 4$
Exercise 2 Functions
Identify the Function Family

For each ML formula below, identify which function family it belongs to (linear, polynomial, exponential, or composition), and explain why:

  1. Sigmoid: $\sigma(z) = \dfrac{1}{1 + e^{-z}}$
  2. Linear regression prediction: $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$
  3. ReLU activation: $\text{ReLU}(x) = \max(0, x)$
  4. A two-layer neural net: $\hat{y} = \sigma(W_2 \sigma(W_1 x + b_1) + b_2)$
Show Answers
  1. Composition of exponential and linear: First compute $-z$ (linear), then $e^{-z}$ (exponential), then $1/(1 + \cdot)$ (rational/transformation)
  2. Linear: It's $\mathbf{w}^\top \mathbf{x} + b$ — a dot product (sum of products) plus a constant. Exactly $f(x) = mx + b$ in multiple dimensions.
  3. Piecewise linear: Two linear pieces joined at $x=0$. Not polynomial — it has a non-smooth "kink".
  4. Nested composition: Apply linear, then sigmoid, then linear again, then sigmoid — $(f_4 \circ f_3 \circ f_2 \circ f_1)(x)$.
Exercise 3 Coding
Implement from Scratch

Without using any ML library, implement a linear function and evaluate it on a dataset. Then plot it against the data points.

import numpy as np
import matplotlib.pyplot as plt

# TODO: complete the linear function
def linear(x, w, b):
    # Your implementation here
    pass

# Generate toy data: y = 3x - 2 + noise
np.random.seed(0)
x_data = np.linspace(-3, 3, 30)
y_data = 3 * x_data - 2 + np.random.randn(30) * 0.8

# Evaluate your function with w=3, b=-2
# y_pred = linear(x_data, w=3, b=-2)

# Plot: scatter data, line prediction
# plt.scatter(x_data, y_data, ...)
# plt.plot(x_data, y_pred, ...)
# plt.show()
Show Solution
import numpy as np
import matplotlib.pyplot as plt

def linear(x, w, b):
    return w * x + b   # f(x) = wx + b

np.random.seed(0)
x_data = np.linspace(-3, 3, 30)
y_data = 3 * x_data - 2 + np.random.randn(30) * 0.8

y_pred = linear(x_data, w=3, b=-2)

plt.figure(figsize=(7, 4))
plt.scatter(x_data, y_data, color='#3B9797', label='Data', alpha=0.7)
plt.plot(x_data, y_pred, color='#BF092F', linewidth=2, label='f(x) = 3x - 2')
plt.xlabel('x'); plt.ylabel('y')
plt.legend(); plt.grid(True, alpha=0.3)
plt.title('Linear Function vs Data')
plt.tight_layout(); plt.show()

Conclusion & Next Steps

In this first part of the bootcamp, we've laid the psychological and conceptual groundwork for everything that follows:

  • Abstraction & Modeling — every ML model is an intentional simplification of reality; the art is choosing what to keep
  • Notation Fluency — $\Sigma$, $\Pi$, $\nabla$, $\mathbb{E}[\cdot]$, Greek letters — these are vocabulary, not barriers
  • Proof vs Intuition — use intuition to understand, use proof to validate; both are necessary
  • Approximation — nearly everything in ML is approximate by design; know when exactness matters (log-space, numerical stability)
  • Functions — linear, polynomial, exponential, and compositions are the atoms of all ML models

Next in the Series

In Part 2: Set Theory & Foundations, we'll build the rigorous language for talking about collections of objects — sets, subsets, power sets, operations (union, intersection, complement), De Morgan's Laws, and their direct connections to feature spaces and probability event spaces in ML.