Back to Mathematics

Part 2: Set Theory & Foundations

April 25, 2026 Wasil Zafar 28 min read

Sets are the mathematical language of probability, feature spaces, and decision boundaries — every time you say "the model predicts class A", you're implicitly partitioning a set. This part gives you the vocabulary.

Table of Contents

  1. Why Set Theory for ML?
  2. Sets and Notation
  3. Set Operations
  4. Venn Diagrams
  5. Laws of Sets
  6. De Morgan's Laws
  7. Cartesian Products & Feature Spaces
  8. Partitions & ML Decision Regions
  9. Jaccard Similarity
  10. Practice Exercises
  11. Conclusion & Next Steps

Why Set Theory for ML?

Before learning probability, you need sets. Before defining a feature space, you need sets. Before talking about "predicted class A" or "ground truth label B", you need sets. Set theory is the bedrock of the mathematical language used across all of ML.

Where you'll see sets in ML:
  • Probability: Events are sets. $P(A)$ is "the probability that the outcome falls in set $A$."
  • Feature spaces: Your training data $\mathcal{X} \subseteq \mathbb{R}^d$ is a subset of a Euclidean space.
  • Classification: A classifier partitions the feature space into disjoint class regions.
  • Evaluation: Precision and recall are set-theoretic ratios — TP, FP, FN are sets of examples.
  • Boolean logic: Decision tree splits use AND (∩) and NOT (complement) implicitly.
ExampleSpam Filter
Email Classification as Set Partition

Let $\mathcal{U}$ = all emails received. A spam filter classifies each email into one of two disjoint sets: $S$ = {spam emails} and $S^c$ = {non-spam emails}. These two sets cover all of $\mathcal{U}$ (every email is classified) and never overlap (an email is either spam or not-spam, never both). This is a partition of $\mathcal{U}$ — one of the most fundamental set-theoretic concepts in ML.

Sets and Notation

A set is an unordered collection of distinct objects (called elements or members). The key properties: order doesn't matter, and elements can't repeat.

$$A = \{1, 2, 3\} \qquad B = \{x \in \mathbb{R} : x > 0\} \qquad C = \{\text{cat}, \text{dog}, \text{fish}\}$$

The second form uses set-builder notation: "$B$ is the set of all real numbers $x$ such that $x > 0$" — this is how infinite sets are described compactly.

1 — Membership & Cardinality

Membership: $x \in A$ means "$x$ is an element of $A$". Its negation: $x \notin A$.

$$3 \in \{1, 2, 3\} \qquad 5 \notin \{1, 2, 3\}$$

Cardinality $|A|$ is the number of elements in set $A$:

$$|\{1, 2, 3\}| = 3 \qquad |\emptyset| = 0 \qquad |\mathbb{R}| = \infty$$

The empty set $\emptyset = \{\}$ contains no elements. It is a subset of every set (by vacuous truth).

2 — Subsets & Power Sets

$A$ is a subset of $B$, written $A \subseteq B$, if every element of $A$ is also in $B$:

$$A \subseteq B \iff \forall x,\; x \in A \Rightarrow x \in B$$

If additionally $B$ has elements not in $A$, it's a proper subset: $A \subsetneq B$.

The power set $\mathcal{P}(A)$ is the set of ALL subsets of $A$ (including $\emptyset$ and $A$ itself):

$$A = \{1, 2\} \implies \mathcal{P}(A) = \{\emptyset,\; \{1\},\; \{2\},\; \{1,2\}\}$$

The size of the power set grows exponentially:

$$|\mathcal{P}(A)| = 2^{|A|}$$
ML connection — Feature subsets: If you have $d$ binary features, the number of possible feature subsets is $2^d$. With $d = 30$ features, that's over one billion possible subsets to evaluate. This exponential blowup is why exhaustive feature selection is computationally infeasible — and why greedy methods (forward/backward selection) or regularization (L1/LASSO) are used instead.

3 — Special Number Sets

SymbolNameExamplesML Use
$\mathbb{N}$Natural numbers$0, 1, 2, 3, \ldots$Indices, counts
$\mathbb{Z}$Integers$\ldots, -2, -1, 0, 1, 2, \ldots$Discrete labels
$\mathbb{Q}$Rationals$\frac{1}{2}, \frac{3}{4}, \ldots$Rational learning rates
$\mathbb{R}$Real numbers$\pi, \sqrt{2}, 3.14\ldots$Model parameters, features
$\mathbb{R}^d$$d$-dim real space$(x_1, \ldots, x_d)$Feature vectors
$[0,1]$Unit interval$0.3, 0.7$Probabilities, dropout rates

Note: $\mathbb{N} \subsetneq \mathbb{Z} \subsetneq \mathbb{Q} \subsetneq \mathbb{R}$ — each is a proper subset of the next.

## Python sets — basic operations
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}

# Membership
print(3 in A)       # True
print(7 in A)       # False

# Cardinality
print(len(A))       # 4

# Subset check
print({1, 2} <= A)  # True  (subset)
print({1, 2} < A)   # True  (proper subset)

# Power set (manual implementation)
def power_set(s):
    s = list(s)
    result = [[]]
    for elem in s:
        result += [subset + [elem] for subset in result]
    return [set(x) for x in result]

ps = power_set({1, 2, 3})
print(f"Power set size: {len(ps)}")  # 8 = 2^3
for s in sorted(ps, key=len):
    print(s)

Universal Set

The universal set $\mathcal{U}$ is the "master set" containing ALL objects under discussion in a given context. Every other set in the problem is automatically a subset of $\mathcal{U}$.

Context-dependent: The universal set changes with the problem:
  • Classifying emails: $\mathcal{U}$ = all emails received
  • ML dataset: $\mathcal{U}$ = all possible input examples
  • Survey: $\mathcal{U}$ = all survey respondents
  • Number theory: $\mathcal{U}$ could be $\mathbb{N}$, $\mathbb{Z}$, or $\mathbb{R}$ depending on context

In probability, $\mathcal{U}$ directly corresponds to the sample space $\Omega$ — the set of all possible outcomes. Complement always depends on $\mathcal{U}$: if $\mathcal{U} = \{1,2,3,4,5\}$ and $A = \{1,2\}$, then $A^c = \{3,4,5\}$. Change $\mathcal{U}$ and you change the complement.

$$\mathcal{U} = \text{all elements in context} \qquad \emptyset \subseteq A \subseteq \mathcal{U} \text{ for every set } A$$

Singleton Set & Empty (Null) Set

A singleton set is a set with exactly one element:

$$\{5\},\quad \{\text{"cat"}\},\quad \{x_0\} \qquad \text{all singleton sets with } |\{x\}| = 1$$
Singleton vs the element itself: $\{5\} \neq 5$. The set $\{5\}$ is a container holding the number 5:
  • $5 \in \{5\}$ — the number 5 is a member of the singleton
  • $|\{5\}| = 1$ — the singleton has cardinality 1
  • $\{5\} \subseteq \{1, 2, 5\}$ — the singleton is a subset, but $5 \in \{1,2,5\}$ is membership

The empty set (also called the null set) $\emptyset = \{\}$ contains no elements at all: $|\emptyset| = 0$. Key properties:

$$\emptyset \subseteq A \text{ for every set } A \qquad A \cup \emptyset = A \qquad A \cap \emptyset = \emptyset \qquad A \setminus A = \emptyset$$

The empty set appears in ML when: a filter returns no results, two disjoint classes have no overlap, or a search returns no valid solutions. It is the unique set of cardinality 0.

Set Builder Notation

Set builder notation defines a set by a rule (a property members must satisfy) rather than listing elements individually. It is essential for defining infinite sets and feature spaces:

$$\{x \mid P(x)\} \quad \text{or} \quad \{x : P(x)\} \qquad \text{read: "the set of all } x \text{ such that } P(x) \text{ is true"}$$
Set Builder FormRoster / DescriptionML Use
$\{x \in \mathbb{Z} : x \gt 0\}$$\{1, 2, 3, \ldots\}$Positive integer class labels
$\{x \in \mathbb{R} : 0 \leq x \leq 1\}$The unit interval $[0, 1]$Probabilities, dropout rates
$\{(x,y) \in \mathbb{R}^2 : x^2 + y^2 \leq 1\}$Unit diskConstrained parameter space
$\{(\mathbf{x}, y) : i = 1,\ldots,N\}$Labeled dataset of $N$ examplesTraining data $\mathcal{D}$
$\{\mathbf{w} \in \mathbb{R}^d : \|\mathbf{w}\|_2 \leq \lambda\}$$\ell_2$-ball of radius $\lambda$L2 regularization constraint
import numpy as np

# Set builder notation ↔ Python list/set comprehensions
# "All even numbers from 0 to 20"
even_numbers = {x for x in range(0, 21) if x % 2 == 0}
print("Even numbers:", sorted(even_numbers))
# {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20}

# "All real x in [-3, 3] such that x^2 < 4"
xs = np.linspace(-3, 3, 1000)
satisfying = xs[xs**2 < 4]
print(f"\nx in [-3,3] satisfying x^2 < 4: range ({satisfying.min():.3f}, {satisfying.max():.3f})")
# Mathematically: (-2, 2)

# Feature filtering — select features with |correlation| > 0.5
correlations = {'age': 0.72, 'income': 0.61, 'height': 0.12, 'weight': 0.08, 'score': 0.89}
high_corr = {f for f, c in correlations.items() if abs(c) > 0.5}
print(f"\nHighly correlated features: {high_corr}")
# {'age', 'income', 'score'}

Equal & Equivalent Sets

These are two distinct relationships beginners frequently confuse:

Equal sets contain exactly the same elements (order does not matter in sets):

$$A = B \iff (A \subseteq B \text{ and } B \subseteq A) \iff \forall x,\; (x \in A \leftrightarrow x \in B)$$
$$\{1, 2, 3\} = \{3, 1, 2\} = \{2, 1, 3\} \quad \text{(same elements, order irrelevant)}$$

Equivalent sets (also called equinumerous) have the same cardinality but may contain completely different elements:

$$A \sim B \iff |A| = |B|$$
$$\{1, 2, 3\} \sim \{\text{cat}, \text{dog}, \text{fish}\} \quad \text{(both } |\cdot| = 3 \text{, but not equal)}$$
Direction matters: Equal sets are always equivalent, but equivalent sets are not necessarily equal:
  • $A = B \Rightarrow A \sim B$ ✓ (equal implies equivalent)
  • $A \sim B \not\Rightarrow A = B$ (equivalent does NOT imply equal)
Equal is a stronger condition — it requires same elements, not just same count.
# Equal sets: same elements regardless of order
A = {1, 2, 3}
B = {3, 2, 1}   # same elements
C = {'cat', 'dog', 'fish'}  # same size, different elements

print("A == B:", A == B)              # True — equal (same elements)
print("A == C:", A == C)              # False — not equal
print("|A| == |C|:", len(A) == len(C))  # True — equivalent (same cardinality)

# ML context: predicted classes vs actual class labels
predicted = {0, 1, 2}                 # numeric class indices
actual_names = {'cat', 'dog', 'bird'} # string class names
print("\nSame number of classes:", len(predicted) == len(actual_names))  # True
print("Same class sets:", predicted == actual_names)  # False — equivalent, not equal

Proper Subsets, Improper Subsets & Supersets

The general concept of "subset" has important subtypes that appear frequently in formal ML writing:

TypeSymbolMeaningExample
Subset$A \subseteq B$Every element of $A$ is in $B$ (includes $A = B$)$\{1,2\} \subseteq \{1,2,3\}$
Proper subset$A \subsetneq B$$A \subseteq B$ AND $A \neq B$ — $B$ has at least one extra element$\{1,2\} \subsetneq \{1,2,3\}$
Improper subset$A \subseteq A$A set is always a subset of itself — the only non-proper subset$\{1,2,3\} \subseteq \{1,2,3\}$
Superset$B \supseteq A$$B$ contains all elements of $A$ (and possibly more)$\{1,2,3\} \supseteq \{1,2\}$
Proper superset$B \supsetneq A$$B \supseteq A$ AND $B \neq A$$\{1,2,3\} \supsetneq \{1,2\}$
Every set has exactly one improper subset — itself. The empty set $\emptyset$ is a proper subset of every non-empty set. For $A = \{1,2,3\}$: the improper subset is $\{1,2,3\}$ itself; all other $2^3 - 1 = 7$ subsets are proper subsets.
# Python set comparison operators
A = frozenset({1, 2})
B = frozenset({1, 2, 3})

print("A ⊆ B  (subset):",         A <= B)   # True
print("A ⊂ B  (proper subset):",  A < B)    # True — A ≠ B
print("B ⊆ A  (subset):",         B <= A)   # False
print("B ⊇ A  (superset):",       B >= A)   # True
print("B ⊃ A  (proper superset):", B > A)   # True
print("A ⊆ A  (improper):",       A <= A)   # True — every set is its own subset
print("A ⊂ A  (proper):",         A < A)    # False — no proper subset of itself

Finite & Infinite Sets

A finite set has a specific, countable number of elements ($|A| = n$ for some $n \in \mathbb{N}$). An infinite set continues without bound.

TypeDescriptionExampleML context
Finite$|A| = n$, $n \in \mathbb{N}$$\{0, 1\}$, $\{\text{cat, dog, bird}\}$Class label set, training dataset
Countably infiniteCan be listed as a sequence $a_1, a_2, \ldots$$\mathbb{N}$, $\mathbb{Z}$, $\mathbb{Q}$Token vocabulary, layer indices
Uncountably infiniteCannot be listed — strictly larger than $\mathbb{N}$$\mathbb{R}$, $(0,1)$, $\mathbb{R}^d$Feature values, model weights
Why this matters in ML:
  • Classification: output space is finite ($k$ classes). Regression: output space is $\mathbb{R}$ — uncountably infinite
  • Continuous RVs: because $\mathbb{R}$ is uncountable, $P(X = x_0) = 0$ for any specific value — hence probability density rather than probability mass
  • Neural weights: live in $\mathbb{R}^n$ — the search space is uncountably infinite, making exhaustive search impossible

Set Operations

Set operations are the algebraic tools for combining and manipulating sets. They have direct parallels in Boolean logic and SQL WHERE clauses.

4 — Union & Intersection

Union $A \cup B$: all elements in $A$ OR in $B$ (or both):

$$A \cup B = \{x : x \in A \text{ or } x \in B\}$$

Intersection $A \cap B$: all elements in BOTH $A$ AND $B$:

$$A \cap B = \{x : x \in A \text{ and } x \in B\}$$

Two sets are disjoint if $A \cap B = \emptyset$.

Set Operations as Logic Gates
flowchart LR
    A["Set A\n{1,2,3,4}"] --> U["A ∪ B\n{1,2,3,4,5,6}\nOR — everything"]
    B["Set B\n{3,4,5,6}"] --> U
    A --> I["A ∩ B\n{3,4}\nAND — overlap only"]
    B --> I
    A --> D["A \\ B\n{1,2}\nIN A but NOT B"]
    B --> D
    A --> S["A △ B\n{1,2,5,6}\nXOR — one but not both"]
    B --> S

    style A fill:#e8f4f4,stroke:#3B9797,color:#132440
    style B fill:#f0f4f8,stroke:#16476A,color:#132440
    style U fill:#d4edda,stroke:#3B9797,color:#132440
    style I fill:#fff3cd,stroke:#F5A623,color:#132440
    style D fill:#fde8ec,stroke:#BF092F,color:#132440
    style S fill:#e8e0f0,stroke:#6f42c1,color:#132440
                            

5 — Complement & Set Difference

The complement $A^c$ (also written $\bar{A}$ or $A'$) is everything in the universal set $\mathcal{U}$ that is NOT in $A$:

$$A^c = \mathcal{U} \setminus A = \{x \in \mathcal{U} : x \notin A\}$$

Set difference $A \setminus B$ (also $A - B$) removes elements of $B$ from $A$:

$$A \setminus B = \{x : x \in A \text{ and } x \notin B\}$$

Symmetric difference $A \triangle B$: elements in exactly one of $A$ or $B$:

$$A \triangle B = (A \setminus B) \cup (B \setminus A) = (A \cup B) \setminus (A \cap B)$$
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}

print("Union:", A | B)              # {1, 2, 3, 4, 5, 6}
print("Intersection:", A & B)       # {3, 4}
print("Difference A-B:", A - B)     # {1, 2}
print("Difference B-A:", B - A)     # {5, 6}
print("Symmetric diff:", A ^ B)     # {1, 2, 5, 6}

# Complement requires knowing the universal set
U = set(range(1, 11))  # Universal set: {1,...,10}
A_complement = U - A
print("Complement of A:", A_complement)  # {5, 6, 7, 8, 9, 10}

6 — The Inclusion-Exclusion Principle

When counting elements in a union, naively adding $|A|$ and $|B|$ double-counts the intersection:

$$|A \cup B| = |A| + |B| - |A \cap B|$$

For three sets:

$$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$$
Real WorldEvaluation Metrics
Precision & Recall as Set Operations

Let $P$ = {examples predicted positive} and $T$ = {examples actually positive}. Then:

  • True Positives: $|P \cap T|$ — correctly predicted positive
  • False Positives: $|P \setminus T|$ — predicted positive but actually negative
  • False Negatives: $|T \setminus P|$ — actually positive but missed

$$\text{Precision} = \frac{|P \cap T|}{|P|} \qquad \text{Recall} = \frac{|P \cap T|}{|T|}$$

F1 = harmonic mean of precision and recall. Every classification metric reduces to set arithmetic over predictions and ground truth labels.

Venn Diagrams

A Venn diagram is a visual tool for representing sets as overlapping circles inside a rectangle (the universal set). They make set relationships immediately intuitive — especially for beginners.

Standard layout: the rectangle = $\mathcal{U}$, each circle = one set, overlapping regions = intersections.

Venn Diagram — Two Sets A and B
𝒰 A A ∩ B B (A∪B)ᶜ A ∖ B B ∖ A
Reading the two-set Venn diagram:
  • Left only = $A \setminus B$ — elements in A but not in B
  • Overlap = $A \cap B$ — elements in both A and B
  • Right only = $B \setminus A$ — elements in B but not in A
  • Both circles = $A \cup B$ — elements in A or B or both
  • Outside both = $(A \cup B)^c$ — elements in neither

Venn diagrams are powerful for verifying set identities visually. To check De Morgan's Law $(A \cup B)^c = A^c \cap B^c$: shade $A \cup B$ (both circles), then take its complement (everything outside both circles). This exactly equals the region outside A AND outside B — confirming the law.

Venn DiagramsSpam Classification
Three-Set Venn Diagram: Email Features

Let $S$ = spam emails, $W_1$ = emails with "free", $W_2$ = emails with "prize". Three overlapping circles create 7 non-empty regions (plus the universal set exterior):

  1. Only $W_1$ (contains "free", not spam, no "prize")
  2. Only $W_2$ (contains "prize", not spam, no "free")
  3. Only $S$ (spam with neither word)
  4. $W_1 \cap W_2$ only (both words, not spam)
  5. $S \cap W_1$ only (spam with "free", no "prize")
  6. $S \cap W_2$ only (spam with "prize", no "free")
  7. $S \cap W_1 \cap W_2$ (spam with both words — core region)

Counting emails in each region and dividing by totals gives the conditional probabilities Naive Bayes needs. Venn diagrams make these conditional counts concrete.

Laws of Sets

Set theory has a complete algebra with named laws — analogous to arithmetic laws for numbers. These laws let you simplify complex expressions and are the mathematical foundation of Boolean logic (the basis of all digital computing and ML decision conditions).

LawUnion form ($\cup$)Intersection form ($\cap$)
Commutative $A \cup B = B \cup A$ $A \cap B = B \cap A$
Associative $(A \cup B) \cup C = A \cup (B \cup C)$ $(A \cap B) \cap C = A \cap (B \cap C)$
Distributive $A \cup (B \cap C) = (A \cup B) \cap (A \cup C)$ $A \cap (B \cup C) = (A \cap B) \cup (A \cap C)$
Identity $A \cup \emptyset = A$ $A \cap \mathcal{U} = A$
Domination $A \cup \mathcal{U} = \mathcal{U}$ $A \cap \emptyset = \emptyset$
Idempotent $A \cup A = A$ $A \cap A = A$
Complement $A \cup A^c = \mathcal{U}$ $A \cap A^c = \emptyset$
Double complement $(A^c)^c = A$
Absorption $A \cup (A \cap B) = A$ $A \cap (A \cup B) = A$
De Morgan's I $(A \cup B)^c = A^c \cap B^c$
De Morgan's II $(A \cap B)^c = A^c \cup B^c$
Boolean logic connection: Every set law maps directly to logic:
  • $\cup$ ↔ OR, $\cap$ ↔ AND, $A^c$ ↔ NOT
  • Commutative: A or B = B or A
  • Distributive: A and (B or C) = (A and B) or (A and C) — compiler optimization
  • De Morgan's: not (A and B) = not A or not B — crucial for simplifying filter conditions
# Verify Laws of Sets programmatically
U = frozenset(range(1, 11))  # Universal set: {1,...,10}
A = frozenset({1, 2, 3, 4, 5})
B = frozenset({3, 4, 5, 6, 7})
C = frozenset({5, 6, 7, 8, 9})
Ac = U - A
Bc = U - B

# Commutative
assert A | B == B | A and A & B == B & A
print("Commutative: A∪B = B∪A and A∩B = B∩A  ✓")

# Associative
assert (A | B) | C == A | (B | C)
assert (A & B) & C == A & (B & C)
print("Associative: (A∪B)∪C = A∪(B∪C)  ✓")

# Distributive
assert A | (B & C) == (A | B) & (A | C)
assert A & (B | C) == (A & B) | (A & C)
print("Distributive: A∪(B∩C) = (A∪B)∩(A∪C)  ✓")

# Identity and Domination
assert A | frozenset() == A and A & U == A
assert A | U == U and A & frozenset() == frozenset()
print("Identity and Domination: ✓")

# Complement
assert A | Ac == U and A & Ac == frozenset()
print("Complement: A∪Aᶜ = U and A∩Aᶜ = ∅  ✓")

# De Morgan's
assert U - (A | B) == Ac & Bc  # (A∪B)^c = A^c ∩ B^c
assert U - (A & B) == Ac | Bc  # (A∩B)^c = A^c ∪ B^c
print("De Morgan's: both laws verified  ✓")

# Absorption
assert A | (A & B) == A and A & (A | B) == A
print("Absorption: A∪(A∩B) = A  ✓")

De Morgan's Laws

De Morgan's Laws are among the most useful identities in set theory — and they appear directly in Boolean logic, probability, and decision tree conditions.

$$\boxed{(A \cup B)^c = A^c \cap B^c} \qquad \boxed{(A \cap B)^c = A^c \cup B^c}$$

Reading Law 1: "NOT (A or B)" is the same as "NOT A AND NOT B." If an email is neither spam nor phishing, that's the same as "it's not spam" AND "it's not phishing."

Reading Law 2: "NOT (A and B)" is the same as "NOT A OR NOT B." If a transaction is not (large AND international), then it's either not large OR not international.

Proof sketch of Law 1: To show two sets are equal, show each is a subset of the other. Let $x \in (A \cup B)^c$. Then $x \notin A \cup B$, so $x \notin A$ AND $x \notin B$, so $x \in A^c$ AND $x \in B^c$, so $x \in A^c \cap B^c$. The reverse inclusion follows symmetrically. $\square$
U = set(range(1, 11))   # Universal set: 1..10
A = {1, 2, 3, 4, 5}
B = {3, 4, 5, 6, 7}

# De Morgan's Law 1: (A ∪ B)^c = A^c ∩ B^c
left_1  = U - (A | B)
right_1 = (U - A) & (U - B)
print("Law 1 holds:", left_1 == right_1)  # True
print("(A ∪ B)^c =", left_1)             # {8, 9, 10}

# De Morgan's Law 2: (A ∩ B)^c = A^c ∪ B^c
left_2  = U - (A & B)
right_2 = (U - A) | (U - B)
print("Law 2 holds:", left_2 == right_2)  # True
print("(A ∩ B)^c =", left_2)             # {1, 2, 6, 7, 8, 9, 10}

Cartesian Products & Feature Spaces

The Cartesian product $A \times B$ creates a new set of all ordered pairs $(a, b)$ where $a \in A$ and $b \in B$:

$$A \times B = \{(a, b) : a \in A,\; b \in B\} \qquad |A \times B| = |A| \cdot |B|$$
Feature space is a Cartesian product. If feature 1 can be any real number and feature 2 can also be any real number, the feature space is $\mathbb{R} \times \mathbb{R} = \mathbb{R}^2$. For $d$ features: $$\mathcal{X} = \mathbb{R}^d = \underbrace{\mathbb{R} \times \mathbb{R} \times \cdots \times \mathbb{R}}_{d \text{ copies}}$$ A data point $\mathbf{x} = (x_1, x_2, \ldots, x_d) \in \mathbb{R}^d$ is a $d$-tuple — an ordered element of this Cartesian product.

Mixed feature spaces are also Cartesian products. A dataset with 3 numeric features and 2 binary categorical features lives in $\mathbb{R}^3 \times \{0,1\}^2$.

import itertools

# Cartesian product of two sets
A = {1, 2, 3}
B = {'a', 'b'}
AxB = list(itertools.product(A, B))
print("A × B:", AxB)
print("|A × B| =", len(AxB))          # 6 = 3 × 2

# Feature space example: 2 continuous + 1 binary feature
# A data point from R × R × {0,1}
data_point = (3.14, -0.5, 1)           # (income, credit_score, employed)
print("Feature vector:", data_point)

# Cartesian product for hyperparameter grid search
learning_rates = [0.001, 0.01, 0.1]
hidden_sizes   = [32, 64, 128]
grid = list(itertools.product(learning_rates, hidden_sizes))
print(f"\nHyperparameter grid ({len(grid)} combinations):")
for lr, hs in grid:
    print(f"  lr={lr}, hidden_size={hs}")

Partitions & ML Decision Regions

A partition of set $S$ is a collection of non-empty subsets $\{A_1, A_2, \ldots, A_k\}$ such that:

$$A_i \cap A_j = \emptyset \text{ for all } i \neq j \quad \text{(disjoint)} \qquad \bigcup_{i=1}^k A_i = S \quad \text{(covers } S\text{)}$$
Multi-class Classifier as Partition of Feature Space
flowchart TD
    FS["Feature Space ℝ^d\n(All possible inputs)"]
    FS --> R1["Region R₁\n(Class A predictions)"]
    FS --> R2["Region R₂\n(Class B predictions)"]
    FS --> R3["Region R₃\n(Class C predictions)"]
    
    R1 --> P1["R₁ ∩ R₂ = ∅\nDisjoint"]
    R2 --> P1
    R2 --> P2["R₁ ∪ R₂ ∪ R₃ = ℝ^d\nEvery point classified"]
    R1 --> P2
    R3 --> P2

    style FS fill:#132440,stroke:#132440,color:#fff
    style R1 fill:#e8f4f4,stroke:#3B9797,color:#132440
    style R2 fill:#f0f4f8,stroke:#16476A,color:#132440
    style R3 fill:#fff5f5,stroke:#BF092F,color:#132440
    style P1 fill:#d4edda,stroke:#3B9797,color:#132440
    style P2 fill:#d4edda,stroke:#3B9797,color:#132440
                            

A $k$-class classifier implicitly partitions $\mathbb{R}^d$ into $k$ decision regions. The decision boundaries are the borders between these regions. For a linear classifier, each region is a convex polytope; for a neural network, the regions can be arbitrarily shaped.

K-fold cross-validation is a partition: The training set $\mathcal{D}$ is split into $k$ equal-sized, non-overlapping folds $F_1, F_2, \ldots, F_k$. Each pair of folds is disjoint ($F_i \cap F_j = \emptyset$) and together they cover all data ($\bigcup F_i = \mathcal{D}$). In each round, one fold is the validation set and the rest form the training set.

Jaccard Similarity & Set-Based Distance

The Jaccard similarity measures how much two sets overlap, on a scale from 0 (no overlap) to 1 (identical):

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

Its complement, the Jaccard distance: $d_J(A, B) = 1 - J(A, B)$.

ApplicationRecommendation Systems
Jaccard in Practice
  • Document similarity: Represent documents as sets of words. Jaccard measures vocabulary overlap.
  • Recommendation systems: Represent users as sets of purchased items. Similar users have high Jaccard similarity.
  • MinHash: Approximates Jaccard similarity efficiently for large sets — used by Google, Amazon, Netflix at scale.
  • IoU in object detection: Intersection over Union is exactly Jaccard — measures overlap between predicted bounding box and ground truth box. Standard metric for YOLO, R-CNN etc.
def jaccard_similarity(A, B):
    """Jaccard similarity between two sets."""
    intersection = len(A & B)
    union = len(A | B)
    if union == 0:
        return 1.0  # both empty sets are identical
    return intersection / union

def jaccard_distance(A, B):
    return 1 - jaccard_similarity(A, B)

# Document similarity example
doc1_words = {"machine", "learning", "neural", "network", "deep"}
doc2_words = {"machine", "learning", "algorithm", "model", "deep"}
doc3_words = {"cooking", "recipe", "pasta", "sauce", "italian"}

print(f"doc1 vs doc2: {jaccard_similarity(doc1_words, doc2_words):.3f}")  # ~0.43
print(f"doc1 vs doc3: {jaccard_similarity(doc1_words, doc3_words):.3f}")  # 0.0

# Object detection IoU
def iou(box_pred, box_true):
    """
    box = (x1, y1, x2, y2) — top-left and bottom-right corners
    Returns Intersection over Union (Jaccard for 2D boxes)
    """
    xi1 = max(box_pred[0], box_true[0])
    yi1 = max(box_pred[1], box_true[1])
    xi2 = min(box_pred[2], box_true[2])
    yi2 = min(box_pred[3], box_true[3])

    inter_area = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    pred_area  = (box_pred[2]-box_pred[0]) * (box_pred[3]-box_pred[1])
    true_area  = (box_true[2]-box_true[0]) * (box_true[3]-box_true[1])
    union_area = pred_area + true_area - inter_area

    return inter_area / union_area if union_area > 0 else 0.0

predicted = (50, 50, 150, 150)
ground_truth = (75, 75, 175, 175)
print(f"\nIoU: {iou(predicted, ground_truth):.3f}")  # 0.143

Practice Exercises

Exercise 1Power Sets
Enumerate the Power Set

For $A = \{a, b, c, d\}$: (1) What is $|\mathcal{P}(A)|$? (2) List all subsets of cardinality exactly 2. (3) How many subsets contain element $a$?

Show Answer

(1) $|\mathcal{P}(A)| = 2^4 = 16$. (2) $\binom{4}{2} = 6$ subsets of size 2: $\{a,b\},\{a,c\},\{a,d\},\{b,c\},\{b,d\},\{c,d\}$. (3) Half of all subsets contain any fixed element: $2^4 / 2 = 8$ subsets contain $a$ (for each subset, you either include $a$ or not, and the remaining 3 elements can vary freely: $2^3 = 8$).

Exercise 2Metrics from Sets
Compute Precision, Recall & F1

Ground truth positives: $T = \{1, 3, 5, 7, 9\}$. Predicted positives: $P = \{1, 2, 3, 4, 5\}$. Compute precision, recall, F1-score, and Jaccard similarity between $P$ and $T$.

Show Answer

$|P \cap T| = |\{1,3,5\}| = 3$. Precision $= 3/5 = 0.6$. Recall $= 3/5 = 0.6$. $F1 = 2 \cdot 0.6 \cdot 0.6 / (0.6+0.6) = 0.6$. $|P \cup T| = |\{1,2,3,4,5,7,9\}| = 7$. Jaccard $= 3/7 \approx 0.43$. Note: Jaccard differs from F1 (F1 weights precision and recall equally but is not equivalent to IoU).

Exercise 3De Morgan's
Apply De Morgan's in NLP

A content filter blocks a post if it contains "hate" AND "violence". Using De Morgan's Law, rewrite the condition for when a post passes the filter (is NOT blocked).

Show Answer

Blocked $= H \cap V$ where $H$ = {posts with "hate"}, $V$ = {posts with "violence"}. Passes filter $= (H \cap V)^c = H^c \cup V^c$ (De Morgan's Law 2). A post passes if it does NOT contain "hate" OR does NOT contain "violence" — it only needs to be missing at least one of the two terms.

Conclusion & Next Steps

Set theory gives us a precise language for talking about collections, membership, and relationships. The key takeaways:

  • Sets are unordered, distinct collections — the atoms of probability and feature spaces
  • Power sets grow as $2^{|A|}$ — explaining why exhaustive search over features is impossible
  • Operations (union ∪, intersection ∩, complement, difference) have direct parallels in Boolean logic, SQL, and ML evaluation
  • De Morgan's Laws let you convert OR conditions to AND conditions and vice versa
  • Cartesian products define feature spaces — your data lives in $\mathbb{R}^d$
  • Partitions are what classifiers create — disjoint regions covering the entire input space
  • Jaccard / IoU quantifies set similarity — fundamental to recommendation and detection tasks

Next in the Series

In Part 3: Combinatorics, we'll build on sets to count: permutations, combinations, the binomial theorem, and the exponential growth of search spaces — giving you the tools to reason about hyperparameter grids, model configurations, and probabilistic counting arguments.