Complete Database Mastery Part 7: Data Modeling & Normalization

Introduction: Why Data Modeling Matters

Data modeling is the foundation of database design. A well-designed schema ensures data integrity, enables efficient queries, and scales gracefully as your application grows.

                        
                        Series Context: This is Part 7 of 15 in the Complete Database Mastery series. We're elevating from SQL skills to architect-level database design.
                    

Database Mastery

Your 15-step learning path • Currently on Step 7

1

7

Data Modeling & Normalization

ERDs, normal forms, schema design

You Are Here

8

MongoDB & Document Databases

NoSQL, aggregation, sharding

9

Redis & Caching Strategies

Data structures, caching patterns, pub/sub

10

Database Administration & Migrations

Backup, versioning, maintenance

11

Scaling & Distributed Systems

Replication, sharding, CAP theorem

12

Cloud Databases & Managed Services

AWS, Azure, GCP database offerings

13

Database Security & Governance

Encryption, access control, compliance

14

Data Warehousing & Analytics

OLAP, star schemas, columnar DBs

15

Capstone Projects

Portfolio-ready database implementations

Think of data modeling like architecture for a building—you need blueprints before construction. A poorly designed database leads to performance problems, data inconsistencies, and constant refactoring.

Signs of Poor Data Modeling

Same data stored in multiple places (update one, forget another)
Queries require complex workarounds
Adding features requires major schema changes
NULL values everywhere, unclear what they mean
Performance degrades as data grows

                        
                        Key Principle: Model your data based on how it will be queried, not just how it looks in the real world. The best schema depends on your access patterns.
                    

Modeling Approaches

Conceptual Modeling (What)

High-level view of business entities and relationships. No technical details—focuses on what data exists.

Three-tier data modeling progression from conceptual to logical to physical model — Data modeling progression: conceptual (business view), logical (abstract structure), physical (database-specific)

-- Conceptual: Business perspective
-- "Customers place Orders. Orders contain Products."

-- Entities:
-- - Customer (name, email)
-- - Order (date, total)
-- - Product (name, price)

-- Relationships:
-- - Customer places Order (one-to-many)
-- - Order contains Product (many-to-many)

Logical Modeling (How - Abstract)

Detailed design with tables, columns, data types, and keys—but database-agnostic. Defines the structure without implementation specifics.

-- Logical: Structure without implementation details
-- Customer(customer_id PK, name, email UNIQUE, created_at)
-- Order(order_id PK, customer_id FK, order_date, status, total)
-- Product(product_id PK, name, price, category)
-- OrderItem(order_id FK, product_id FK, quantity, unit_price)
--   PK: (order_id, product_id)

Physical Modeling (How - Concrete)

Database-specific implementation: exact data types, indexes, partitioning, storage parameters.

-- Physical: PostgreSQL-specific implementation
CREATE TABLE customers (
    customer_id BIGSERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

CREATE INDEX idx_customers_email ON customers(email);

CREATE TABLE orders (
    order_id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT NOT NULL REFERENCES customers(customer_id),
    order_date DATE NOT NULL DEFAULT CURRENT_DATE,
    status VARCHAR(20) DEFAULT 'pending' CHECK (status IN ('pending', 'shipped', 'delivered', 'cancelled')),
    total DECIMAL(12,2) NOT NULL
) PARTITION BY RANGE (order_date);

CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_orders_date ON orders(order_date);

Entity Relationship Diagrams (ERDs)

ERDs visualize database structure. They show entities (tables), attributes (columns), and relationships (foreign keys).

Entity relationship diagram showing one-to-one, one-to-many, and many-to-many cardinality notations — ERD cardinality notations: one-to-one (1:1), one-to-many (1:N), and many-to-many (M:N) relationships

Relationship Types

Type	Description	Implementation
One-to-One (1:1)	User ↔ Profile	FK with UNIQUE, or same table
One-to-Many (1:N)	Customer → Orders	FK on the "many" side
Many-to-Many (M:N)	Orders ↔ Products	Junction/bridge table

-- One-to-One: User Profile (separate for privacy/performance)
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL
);

CREATE TABLE user_profiles (
    user_id INT PRIMARY KEY REFERENCES users(id),
    bio TEXT,
    avatar_url VARCHAR(500)
);

-- One-to-Many: Customer Orders
CREATE TABLE customers (id SERIAL PRIMARY KEY, name VARCHAR(100));
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(id),  -- FK here
    total DECIMAL(10,2)
);

-- Many-to-Many: Order Products (junction table)
CREATE TABLE products (id SERIAL PRIMARY KEY, name VARCHAR(100));
CREATE TABLE order_items (
    order_id INT REFERENCES orders(id),
    product_id INT REFERENCES products(id),
    quantity INT NOT NULL DEFAULT 1,
    PRIMARY KEY (order_id, product_id)  -- Composite PK
);

Normal Forms

Normalization eliminates redundancy and ensures data integrity. Each normal form builds on the previous one.

Step-by-step normalization process from unnormalized data through 1NF, 2NF, and 3NF — Normalization progression: eliminating repeating groups (1NF), partial dependencies (2NF), and transitive dependencies (3NF)

First Normal Form (1NF): Atomic Values

Each column contains only atomic (indivisible) values. No repeating groups or arrays.

-- ❌ Violates 1NF: Multiple values in one column
CREATE TABLE orders_bad (
    order_id INT,
    customer VARCHAR(100),
    products VARCHAR(500)  -- "iPhone, iPad, MacBook"
);

-- ✅ 1NF: Atomic values, one product per row
CREATE TABLE order_items_1nf (
    order_id INT,
    customer VARCHAR(100),
    product VARCHAR(100)
);
-- But this has redundancy... customer repeated for every item!

Second Normal Form (2NF): No Partial Dependencies

1NF + every non-key column depends on the entire primary key (not just part of it).

-- ❌ Violates 2NF: customer depends only on order_id, not product
-- PK is (order_id, product) but customer doesn't need product
CREATE TABLE orders_bad (
    order_id INT,
    product VARCHAR(100),
    customer VARCHAR(100),  -- Partial dependency!
    quantity INT,
    PRIMARY KEY (order_id, product)
);

-- ✅ 2NF: Separate tables
CREATE TABLE orders_2nf (
    order_id INT PRIMARY KEY,
    customer VARCHAR(100)
);

CREATE TABLE order_items_2nf (
    order_id INT,
    product VARCHAR(100),
    quantity INT,
    PRIMARY KEY (order_id, product),
    FOREIGN KEY (order_id) REFERENCES orders_2nf(order_id)
);

Third Normal Form (3NF): No Transitive Dependencies

2NF + no non-key column depends on another non-key column.

-- ❌ Violates 3NF: customer_city depends on customer_id, not order_id
CREATE TABLE orders_bad (
    order_id INT PRIMARY KEY,
    customer_id INT,
    customer_city VARCHAR(100)  -- Transitive: order→customer→city
);

-- ✅ 3NF: City belongs with customer
CREATE TABLE customers_3nf (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    city VARCHAR(100)
);

CREATE TABLE orders_3nf (
    order_id INT PRIMARY KEY,
    customer_id INT REFERENCES customers_3nf(customer_id),
    total DECIMAL(10,2)
);

Boyce-Codd Normal Form (BCNF)

A stricter version of 3NF. Every determinant (column that determines others) must be a candidate key.

-- Example: Course scheduling
-- Student can take one course per semester
-- Each course is taught by one professor
-- Each professor teaches only one course

-- ❌ Anomaly: Professor determines Course, but Professor isn't a key
CREATE TABLE enrollments_bad (
    student_id INT,
    semester VARCHAR(20),
    professor VARCHAR(100),
    course VARCHAR(100),  -- Determined by professor!
    PRIMARY KEY (student_id, semester)
);

-- ✅ BCNF: Separate professor-course relationship
CREATE TABLE professor_courses (
    professor VARCHAR(100) PRIMARY KEY,
    course VARCHAR(100) NOT NULL
);

CREATE TABLE enrollments_bcnf (
    student_id INT,
    semester VARCHAR(20),
    professor VARCHAR(100) REFERENCES professor_courses(professor),
    PRIMARY KEY (student_id, semester)
);

                        
                        Practical Rule: Most applications target 3NF. BCNF is occasionally needed for complex business rules. Higher normal forms (4NF, 5NF) are rarely used in practice.
                    

Denormalization Tradeoffs

Denormalization intentionally adds redundancy to improve read performance. It trades storage space and write complexity for faster queries.

Tradeoff diagram comparing normalized vs denormalized schema designs for read and write performance — Normalization vs denormalization tradeoffs: balancing write simplicity against read performance

Normalization vs Denormalization

Aspect	Normalized	Denormalized
Storage	Less (no redundancy)	More (duplicated data)
Write Speed	Faster (update one place)	Slower (update many places)
Read Speed	Slower (more JOINs)	Faster (data co-located)
Consistency	Guaranteed	Must maintain manually
Best For	OLTP, frequent updates	OLAP, read-heavy workloads

-- Normalized: Separate product details
-- Query requires JOIN
SELECT o.order_id, p.name, p.price
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = 123;

-- Denormalized: Store product snapshot in order_items
CREATE TABLE order_items_denorm (
    order_id INT,
    product_id INT,
    quantity INT,
    -- Denormalized: snapshot of product at order time
    product_name VARCHAR(100),
    unit_price DECIMAL(10,2),
    PRIMARY KEY (order_id, product_id)
);

-- Benefits:
-- 1. Faster queries (no JOIN needed)
-- 2. Historical accuracy (price at time of order)
-- Downside: product_name won't update if product renamed

                        
                        When to Denormalize: Only after proving normalized schema is too slow. Profile first, then denormalize specific hotspots. Consider materialized views as an alternative.
                    

Designing for Scalability

-- 1. Use surrogate keys (not natural keys)
-- ✅ Good: system-generated ID
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,  -- Can handle billions
    email VARCHAR(255) UNIQUE
);

-- ❌ Risky: natural key that might change
CREATE TABLE users_bad (
    email VARCHAR(255) PRIMARY KEY  -- What if user changes email?
);

-- 2. Prepare for partitioning
-- Include partition key in primary key
CREATE TABLE events (
    id BIGSERIAL,
    event_date DATE NOT NULL,
    data JSONB,
    PRIMARY KEY (id, event_date)  -- Include partition key
) PARTITION BY RANGE (event_date);

-- 3. Avoid wide tables
-- Split into logical groups
CREATE TABLE user_core (id BIGINT PRIMARY KEY, email VARCHAR(255));
CREATE TABLE user_profile (user_id BIGINT PRIMARY KEY, bio TEXT, avatar_url VARCHAR(500));
CREATE TABLE user_settings (user_id BIGINT PRIMARY KEY, preferences JSONB);

-- 4. Use appropriate data types
-- INT vs BIGINT: Know your scale
-- VARCHAR(n) vs TEXT: PostgreSQL treats them similarly
-- DECIMAL vs FLOAT: DECIMAL for money, FLOAT for science

Real-World Schema Modeling

E-Commerce Schema

-- Core E-Commerce Schema
CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE addresses (
    id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT REFERENCES customers(id),
    type VARCHAR(20) CHECK (type IN ('billing', 'shipping')),
    street VARCHAR(255),
    city VARCHAR(100),
    postal_code VARCHAR(20),
    country CHAR(2)
);

CREATE TABLE products (
    id BIGSERIAL PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(255) NOT NULL,
    price DECIMAL(10,2) NOT NULL,
    inventory_count INT DEFAULT 0,
    category_id BIGINT REFERENCES categories(id)
);

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT REFERENCES customers(id),
    status VARCHAR(20) DEFAULT 'pending',
    shipping_address_id BIGINT REFERENCES addresses(id),
    total DECIMAL(12,2),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
    order_id BIGINT REFERENCES orders(id),
    product_id BIGINT REFERENCES products(id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,  -- Snapshot at order time
    PRIMARY KEY (order_id, product_id)
);

Banking/Financial Schema

-- Banking: Every transaction must be traceable
CREATE TABLE accounts (
    id BIGSERIAL PRIMARY KEY,
    account_number VARCHAR(20) UNIQUE NOT NULL,
    customer_id BIGINT REFERENCES customers(id),
    account_type VARCHAR(20) CHECK (account_type IN ('checking', 'savings', 'credit')),
    balance DECIMAL(15,2) NOT NULL DEFAULT 0,
    currency CHAR(3) DEFAULT 'USD',
    created_at TIMESTAMP DEFAULT NOW()
);

-- Double-entry bookkeeping: Every transaction has debit AND credit
CREATE TABLE transactions (
    id BIGSERIAL PRIMARY KEY,
    transaction_date TIMESTAMP DEFAULT NOW(),
    description VARCHAR(255)
);

CREATE TABLE transaction_entries (
    id BIGSERIAL PRIMARY KEY,
    transaction_id BIGINT REFERENCES transactions(id),
    account_id BIGINT REFERENCES accounts(id),
    entry_type VARCHAR(10) CHECK (entry_type IN ('debit', 'credit')),
    amount DECIMAL(15,2) NOT NULL,
    CHECK (amount > 0)
);

-- Constraint: debits must equal credits (enforced by trigger)

-- Social: Optimized for feeds and relationships
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    display_name VARCHAR(100)
);

-- Self-referencing many-to-many for follows
CREATE TABLE follows (
    follower_id BIGINT REFERENCES users(id),
    following_id BIGINT REFERENCES users(id),
    created_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (follower_id, following_id),
    CHECK (follower_id != following_id)  -- Can't follow yourself
);

CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),
    content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Denormalized counters for performance
ALTER TABLE posts ADD COLUMN like_count INT DEFAULT 0;
ALTER TABLE posts ADD COLUMN comment_count INT DEFAULT 0;
ALTER TABLE users ADD COLUMN follower_count INT DEFAULT 0;

Data Integrity & Constraints

-- NOT NULL: Required fields
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    price DECIMAL(10,2) NOT NULL
);

-- UNIQUE: No duplicates
ALTER TABLE users ADD CONSTRAINT unique_email UNIQUE (email);

-- CHECK: Domain validation
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    quantity INT CHECK (quantity > 0),
    status VARCHAR(20) CHECK (status IN ('pending', 'shipped', 'delivered'))
);

-- FOREIGN KEY: Referential integrity
-- ON DELETE options:
-- CASCADE: Delete related rows
-- SET NULL: Set FK to NULL
-- RESTRICT: Prevent delete if referenced
CREATE TABLE order_items (
    order_id INT REFERENCES orders(id) ON DELETE CASCADE,
    product_id INT REFERENCES products(id) ON DELETE RESTRICT
);

-- EXCLUSION: Prevent overlapping ranges (PostgreSQL)
CREATE TABLE room_bookings (
    room_id INT,
    during TSRANGE,
    EXCLUDE USING gist (room_id WITH =, during WITH &&)
);

Schema Evolution Planning

Schemas change over time. Plan for evolution without breaking existing applications.

Timeline showing safe schema migration strategies including add column, backfill, and constrain steps — Safe schema evolution strategy: add nullable column, backfill data, then apply constraints

-- Safe column additions (backward compatible)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);  -- Nullable by default

-- Dangerous: Adding NOT NULL without default
-- ALTER TABLE users ADD COLUMN age INT NOT NULL;  -- FAILS if rows exist!

-- Safe approach: Add nullable, backfill, then constrain
ALTER TABLE users ADD COLUMN age INT;
UPDATE users SET age = 0 WHERE age IS NULL;
ALTER TABLE users ALTER COLUMN age SET NOT NULL;
ALTER TABLE users ALTER COLUMN age SET DEFAULT 0;

-- Renaming columns: Create view for backward compatibility
ALTER TABLE users RENAME COLUMN name TO full_name;
CREATE VIEW users_v1 AS SELECT id, full_name AS name, email FROM users;

-- Soft deprecation: Add new column, keep old
ALTER TABLE products ADD COLUMN price_cents BIGINT;  -- New: cents
-- Old 'price' DECIMAL column stays until migration complete

                        
                        Migration Strategy: Use tools like Flyway, Liquibase, or Alembic to version control schema changes. Every change should be reversible.
                    

Conclusion & Next Steps

Data modeling is both art and science. Understanding normal forms, knowing when to denormalize, and designing for real-world requirements are skills that distinguish database architects from developers.

Cookie Consent

Cookie Preferences