Complete Database Mastery Part 14: Data Warehousing & Analytics

Introduction: OLTP vs OLAP

Data warehousing transforms raw operational data into actionable business intelligence. Understanding the differences between transactional (OLTP) and analytical (OLAP) workloads is fundamental to designing effective analytics systems.

OLTP versus OLAP workload comparison showing transactional vs analytical database patterns — OLTP vs OLAP — comparing transactional and analytical workload characteristics, query patterns, and optimization goals

                        
                        Series Context: This is Part 14 of 15 in the Complete Database Mastery series. We're exploring the world of analytics and business intelligence.
                    

Database Mastery

Your 15-step learning path • Currently on Step 14

1

14

Data Warehousing & Analytics

OLAP, star schemas, columnar DBs

You Are Here

15

Capstone Projects

Portfolio-ready database implementations

Think of OLTP (Online Transaction Processing) as the cash register at a store—fast, frequent, small transactions. OLAP (Online Analytical Processing) is the quarterly sales analysis—complex queries across millions of records.

OLTP vs OLAP Comparison

Characteristic	OLTP	OLAP
Purpose	Day-to-day operations	Historical analysis
Queries	Simple, focused	Complex, aggregations
Data Volume	Current data (GB)	Historical data (TB-PB)
Schema	Normalized (3NF)	Denormalized (Star/Snowflake)
Users	Many concurrent users	Fewer analysts/BI tools
Response Time	Milliseconds	Seconds to minutes

Dimensional Modeling

Dimensional modeling organizes data for analytical queries, optimizing for readability and query performance rather than normalization.

Star Schema

The star schema places a central fact table surrounded by dimension tables—like a star. It's the most common data warehouse design.

-- Fact Table: Stores metrics and foreign keys
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    store_key INT REFERENCES dim_store(store_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    -- Measures (what we're analyzing)
    quantity INT,
    unit_price DECIMAL(10,2),
    discount_amount DECIMAL(10,2),
    total_amount DECIMAL(12,2)
);

-- Dimension Table: Date
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE,
    day_of_week VARCHAR(10),
    month_name VARCHAR(10),
    quarter INT,
    year INT,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN
);

-- Dimension Table: Product
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(20),
    product_name VARCHAR(100),
    category VARCHAR(50),
    subcategory VARCHAR(50),
    brand VARCHAR(50),
    unit_cost DECIMAL(10,2)
);

Snowflake Schema

The snowflake schema normalizes dimension tables further, reducing redundancy but adding joins.

-- Snowflake: Product dimension normalized
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_name VARCHAR(100),
    subcategory_key INT REFERENCES dim_subcategory(subcategory_key),
    brand_key INT REFERENCES dim_brand(brand_key)
);

CREATE TABLE dim_subcategory (
    subcategory_key INT PRIMARY KEY,
    subcategory_name VARCHAR(50),
    category_key INT REFERENCES dim_category(category_key)
);

CREATE TABLE dim_category (
    category_key INT PRIMARY KEY,
    category_name VARCHAR(50)
);

Facts & Dimensions

                        
                        Key Concepts:
                        Facts = numeric measures you analyze (sales, quantity, revenue)
Dimensions = context for analysis (who, what, when, where)
Surrogate Keys = auto-generated keys (product_key) vs natural keys (product_id)
Slowly Changing Dimensions (SCD) = handling historical changes in dimensions

                    

-- SCD Type 2: Keep full history
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(20),       -- Natural key
    name VARCHAR(100),
    city VARCHAR(50),
    state VARCHAR(50),
    -- SCD Type 2 tracking
    effective_date DATE,
    end_date DATE,
    is_current BOOLEAN
);

-- When customer moves:
-- 1. Update current record: end_date = yesterday, is_current = false
-- 2. Insert new record: effective_date = today, is_current = true

Columnar Databases

Column-Oriented Architecture

Traditional row-based databases store entire rows together. Columnar databases store each column separately, dramatically improving analytical query performance.

Column-oriented versus row-oriented storage architecture comparison — Columnar vs row storage — how column-oriented layout reduces I/O for analytical queries scanning specific fields

Row-Based Storage:
Row 1: [Alice, 28, NYC, Engineer]
Row 2: [Bob,   35, LA,  Manager]
Row 3: [Carol, 42, CHI, Director]

Column-Based Storage:
Names:  [Alice, Bob, Carol]
Ages:   [28, 35, 42]
Cities: [NYC, LA, CHI]
Titles: [Engineer, Manager, Director]

-- Analytical query: SELECT AVG(age) FROM employees
-- Row-based: Must read ALL columns for ALL rows
-- Columnar: Only reads the 'age' column - much less I/O!

Compression Techniques

Columnar storage enables excellent compression because similar data is stored together.

Compression Techniques:

1. Run-Length Encoding (RLE)
   Input:  [USA, USA, USA, USA, Canada, Canada, UK]
   Output: [(USA, 4), (Canada, 2), (UK, 1)]
   
2. Dictionary Encoding
   Values: [Engineer, Manager, Engineer, Engineer, Director]
   Dictionary: {0: Engineer, 1: Manager, 2: Director}
   Encoded: [0, 1, 0, 0, 2]
   
3. Delta Encoding (for sorted numbers)
   Values: [1000, 1001, 1003, 1004, 1010]
   Deltas: [1000, +1, +2, +1, +6]

Result: 10:1 compression ratios are common in analytics databases

ETL & ELT Pipelines

ETL vs ELT

ETL vs ELT Comparison

Aspect	ETL (Extract-Transform-Load)	ELT (Extract-Load-Transform)
Transform Location	External server	Within data warehouse
Best For	Traditional warehouses	Cloud warehouses (BigQuery, Snowflake)
Data Volume	Transform bottleneck	Scales with warehouse compute
Flexibility	Schema-on-write	Schema-on-read

ETL vs ELT Pipeline Comparison

flowchart LR
    subgraph ETL ["ETL Traditional"]
        direction LR
        S1["Sources"] --> E1["Extract"]
        E1 --> T1["Transform
External Server"]
        T1 --> L1["Load"]
        L1 --> W1["Warehouse"]
    end

    subgraph ELT ["ELT Modern"]
        direction LR
        S2["Sources"] --> E2["Extract"]
        E2 --> L2["Load"]
        L2 --> W2["Warehouse"]
        W2 --> T2["Transform
Inside Warehouse"]
    end

    style ETL fill:#f0f4f8,stroke:#16476A
    style ELT fill:#e8f4f4,stroke:#3B9797

Popular ETL/ELT Tools

# dbt (data build tool) - Modern ELT transformation
# models/sales_summary.sql

# {{config(materialized='table')}}

SELECT
    d.year,
    d.month_name,
    p.category,
    SUM(f.total_amount) as total_sales,
    COUNT(DISTINCT f.customer_key) as unique_customers
FROM {{ ref('fact_sales') }} f
JOIN {{ ref('dim_date') }} d ON f.date_key = d.date_key
JOIN {{ ref('dim_product') }} p ON f.product_key = p.product_key
GROUP BY d.year, d.month_name, p.category

# Apache Airflow DAG for ETL pipeline
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

dag = DAG(
    'sales_etl',
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1)
)

def extract():
    # Extract from source systems
    pass

def transform():
    # Clean and transform data
    pass

def load():
    # Load into warehouse
    pass

extract_task = PythonOperator(
    task_id='extract',
    python_callable=extract,
    dag=dag
)

transform_task = PythonOperator(
    task_id='transform',
    python_callable=transform,
    dag=dag
)

load_task = PythonOperator(
    task_id='load',
    python_callable=load,
    dag=dag
)

extract_task >> transform_task >> load_task

Modern Analytics Platforms

Google BigQuery

-- BigQuery: Serverless, pay-per-query
SELECT
    DATE_TRUNC(order_date, MONTH) as month,
    product_category,
    SUM(total_amount) as revenue,
    COUNT(*) as orders
FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, revenue DESC;

-- BigQuery ML: Machine learning in SQL
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
    model_type='ARIMA_PLUS',
    time_series_timestamp_col='date',
    time_series_data_col='revenue'
) AS
SELECT date, SUM(total_amount) as revenue
FROM `project.dataset.orders`
GROUP BY date;

Modern cloud analytics platforms comparing BigQuery, Redshift, and Snowflake capabilities — Modern analytics platforms — BigQuery, Redshift, and Snowflake feature comparison for cloud data warehousing

Amazon Redshift

-- Redshift: Distribution and sort keys
CREATE TABLE fact_sales (
    sale_id BIGINT IDENTITY(1,1),
    date_key INT,
    product_key INT,
    store_key INT,
    quantity INT,
    total_amount DECIMAL(12,2)
)
DISTKEY(product_key)           -- Distribute by product for JOINs
SORTKEY(date_key);             -- Sort by date for range queries

-- Redshift Spectrum: Query S3 directly
CREATE EXTERNAL SCHEMA s3_data
FROM DATA CATALOG
DATABASE 'external_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3';

SELECT * FROM s3_data.parquet_table;

Snowflake

-- Snowflake: Virtual warehouses for compute scaling
CREATE WAREHOUSE analytics_wh
  WITH WAREHOUSE_SIZE = 'LARGE'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE;

-- Zero-copy cloning for dev/test
CREATE DATABASE analytics_dev CLONE analytics_prod;

-- Time travel: Query historical data
SELECT * FROM orders
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP);

-- Snowflake Streams: CDC for incremental processing
CREATE STREAM orders_changes ON TABLE orders;

SELECT * FROM orders_changes;  -- Only changed rows

Analytics Query Optimization

-- Partitioning: Reduce data scanned
-- BigQuery partition by date
CREATE TABLE sales_partitioned
PARTITION BY DATE(order_date)
CLUSTER BY product_category
AS SELECT * FROM raw_sales;

-- Query only needed partitions
SELECT SUM(amount)
FROM sales_partitioned
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Only scans January partition!

-- Materialized views for common aggregations
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
    order_date,
    product_category,
    SUM(amount) as total,
    COUNT(*) as orders
FROM sales_partitioned
GROUP BY 1, 2;

                        
                        Analytics Optimization Tips:
                        Partition by date (most common filter)
Cluster by frequently filtered columns
Use materialized views for dashboards
Avoid SELECT * — only request needed columns
Pre-aggregate where possible (summary tables)

                    

Data Lakehouse Architecture

The Data Lakehouse combines data lake flexibility (store anything cheaply) with data warehouse reliability (ACID, schema enforcement).

Traditional Architecture:
[Sources] → [Data Lake (cheap storage)] → [Data Warehouse (expensive)]
                ↓                              ↓
           ML/Data Science              BI/Analytics
           (duplicate data)

Lakehouse Architecture:
[Sources] → [Lakehouse (Delta Lake, Iceberg, Hudi)]
                        ↓
            Unified: BI + ML + Streaming
            (Single source of truth)

# Delta Lake with PySpark
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Lakehouse") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Create Delta table
df.write.format("delta").save("/data/delta/sales")

# ACID transactions on data lake!
DeltaTable.forPath(spark, "/data/delta/sales") \
    .update(
        condition="year = 2024",
        set={"status": "'archived'"}
    )

# Time travel
df_historical = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/data/delta/sales")

Conclusion & Next Steps

Data warehousing transforms raw data into business value. Mastering dimensional modeling, columnar storage, and modern analytics platforms enables you to build powerful business intelligence solutions.

Cookie Consent

Cookie Preferences

Complete Database Mastery Part 14: Data Warehousing & Analytics

Table of Contents

Introduction: OLTP vs OLAP

Database Mastery

Part 1: SQL Fundamentals & Syntax

Advanced SQL & Query Mastery

PostgreSQL Deep Dive

MySQL & MariaDB

Transactions & Concurrency

Query Optimization & Indexing

Data Modeling & Normalization

MongoDB & Document Databases

Redis & Caching Strategies

Database Administration & Migrations

Scaling & Distributed Systems

Cloud Databases & Managed Services

Database Security & Governance

Data Warehousing & Analytics

Capstone Projects

OLTP vs OLAP Comparison

Dimensional Modeling

Star Schema

Snowflake Schema

Facts & Dimensions

Columnar Databases

Column-Oriented Architecture

Compression Techniques

ETL & ELT Pipelines

ETL vs ELT

ETL vs ELT Comparison

Popular ETL/ELT Tools

Modern Analytics Platforms

Google BigQuery

Amazon Redshift

Snowflake

Analytics Query Optimization

Data Lakehouse Architecture

Conclusion & Next Steps

Cookie Consent

Cookie Preferences

Complete Database Mastery Part 14: Data Warehousing & Analytics

Table of Contents

Introduction: OLTP vs OLAP

Database Mastery

Part 1: SQL Fundamentals & Syntax

Advanced SQL & Query Mastery

PostgreSQL Deep Dive

MySQL & MariaDB

Transactions & Concurrency

Query Optimization & Indexing

Data Modeling & Normalization

MongoDB & Document Databases

Redis & Caching Strategies

Database Administration & Migrations

Scaling & Distributed Systems

Cloud Databases & Managed Services

Database Security & Governance

Data Warehousing & Analytics

Capstone Projects

OLTP vs OLAP Comparison

Dimensional Modeling

Star Schema

Snowflake Schema

Facts & Dimensions

Columnar Databases

Column-Oriented Architecture

Compression Techniques

ETL & ELT Pipelines

ETL vs ELT

ETL vs ELT Comparison

Popular ETL/ELT Tools

Modern Analytics Platforms

Google BigQuery

Amazon Redshift

Snowflake

Analytics Query Optimization

Data Lakehouse Architecture

Conclusion & Next Steps

Continue the Database Mastery Series

Part 13: Database Security & Governance

Part 15: Capstone Projects

Part 7: Data Modeling & Normalization