Introduction: OLTP vs OLAP
Data warehousing transforms raw operational data into actionable business intelligence. Understanding the differences between transactional (OLTP) and analytical (OLAP) workloads is fundamental to designing effective analytics systems.
Series Context: This is Part 14 of 15 in the Complete Database Mastery series. We're exploring the world of analytics and business intelligence.
1
Part 1: SQL Fundamentals & Syntax
Database basics, CRUD operations, joins, constraints
2
Advanced SQL & Query Mastery
CTEs, window functions, stored procedures
3
PostgreSQL Deep Dive
Advanced types, indexing, extensions, tuning
4
MySQL & MariaDB
Storage engines, replication, optimization
5
Transactions & Concurrency
ACID, isolation levels, locking, MVCC
6
Query Optimization & Indexing
EXPLAIN plans, index design, performance
7
Data Modeling & Normalization
ERDs, normal forms, schema design
8
MongoDB & Document Databases
NoSQL, aggregation, sharding
9
Redis & Caching Strategies
Data structures, caching patterns, pub/sub
10
Database Administration & Migrations
Backup, versioning, maintenance
11
Scaling & Distributed Systems
Replication, sharding, CAP theorem
12
Cloud Databases & Managed Services
AWS, Azure, GCP database offerings
13
Database Security & Governance
Encryption, access control, compliance
14
Data Warehousing & Analytics
OLAP, star schemas, columnar DBs
You Are Here
15
Capstone Projects
Portfolio-ready database implementations
Think of OLTP (Online Transaction Processing) as the cash register at a store—fast, frequent, small transactions. OLAP (Online Analytical Processing) is the quarterly sales analysis—complex queries across millions of records.
OLTP vs OLAP Comparison
| Characteristic |
OLTP |
OLAP |
| Purpose |
Day-to-day operations |
Historical analysis |
| Queries |
Simple, focused |
Complex, aggregations |
| Data Volume |
Current data (GB) |
Historical data (TB-PB) |
| Schema |
Normalized (3NF) |
Denormalized (Star/Snowflake) |
| Users |
Many concurrent users |
Fewer analysts/BI tools |
| Response Time |
Milliseconds |
Seconds to minutes |
Dimensional Modeling
Dimensional modeling organizes data for analytical queries, optimizing for readability and query performance rather than normalization.
Star Schema
The star schema places a central fact table surrounded by dimension tables—like a star. It's the most common data warehouse design.
-- Fact Table: Stores metrics and foreign keys
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
customer_key INT REFERENCES dim_customer(customer_key),
-- Measures (what we're analyzing)
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
total_amount DECIMAL(12,2)
);
-- Dimension Table: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE,
day_of_week VARCHAR(10),
month_name VARCHAR(10),
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
-- Dimension Table: Product
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(20),
product_name VARCHAR(100),
category VARCHAR(50),
subcategory VARCHAR(50),
brand VARCHAR(50),
unit_cost DECIMAL(10,2)
);
Snowflake Schema
The snowflake schema normalizes dimension tables further, reducing redundancy but adding joins.
-- Snowflake: Product dimension normalized
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_name VARCHAR(100),
subcategory_key INT REFERENCES dim_subcategory(subcategory_key),
brand_key INT REFERENCES dim_brand(brand_key)
);
CREATE TABLE dim_subcategory (
subcategory_key INT PRIMARY KEY,
subcategory_name VARCHAR(50),
category_key INT REFERENCES dim_category(category_key)
);
CREATE TABLE dim_category (
category_key INT PRIMARY KEY,
category_name VARCHAR(50)
);
Facts & Dimensions
Key Concepts:
- Facts = numeric measures you analyze (sales, quantity, revenue)
- Dimensions = context for analysis (who, what, when, where)
- Surrogate Keys = auto-generated keys (product_key) vs natural keys (product_id)
- Slowly Changing Dimensions (SCD) = handling historical changes in dimensions
-- SCD Type 2: Keep full history
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(20), -- Natural key
name VARCHAR(100),
city VARCHAR(50),
state VARCHAR(50),
-- SCD Type 2 tracking
effective_date DATE,
end_date DATE,
is_current BOOLEAN
);
-- When customer moves:
-- 1. Update current record: end_date = yesterday, is_current = false
-- 2. Insert new record: effective_date = today, is_current = true
Columnar Databases
Column-Oriented Architecture
Traditional row-based databases store entire rows together. Columnar databases store each column separately, dramatically improving analytical query performance.
Row-Based Storage:
Row 1: [Alice, 28, NYC, Engineer]
Row 2: [Bob, 35, LA, Manager]
Row 3: [Carol, 42, CHI, Director]
Column-Based Storage:
Names: [Alice, Bob, Carol]
Ages: [28, 35, 42]
Cities: [NYC, LA, CHI]
Titles: [Engineer, Manager, Director]
-- Analytical query: SELECT AVG(age) FROM employees
-- Row-based: Must read ALL columns for ALL rows
-- Columnar: Only reads the 'age' column - much less I/O!
Compression Techniques
Columnar storage enables excellent compression because similar data is stored together.
Compression Techniques:
1. Run-Length Encoding (RLE)
Input: [USA, USA, USA, USA, Canada, Canada, UK]
Output: [(USA, 4), (Canada, 2), (UK, 1)]
2. Dictionary Encoding
Values: [Engineer, Manager, Engineer, Engineer, Director]
Dictionary: {0: Engineer, 1: Manager, 2: Director}
Encoded: [0, 1, 0, 0, 2]
3. Delta Encoding (for sorted numbers)
Values: [1000, 1001, 1003, 1004, 1010]
Deltas: [1000, +1, +2, +1, +6]
Result: 10:1 compression ratios are common in analytics databases
ETL & ELT Pipelines
ETL vs ELT
ETL vs ELT Comparison
| Aspect |
ETL (Extract-Transform-Load) |
ELT (Extract-Load-Transform) |
| Transform Location |
External server |
Within data warehouse |
| Best For |
Traditional warehouses |
Cloud warehouses (BigQuery, Snowflake) |
| Data Volume |
Transform bottleneck |
Scales with warehouse compute |
| Flexibility |
Schema-on-write |
Schema-on-read |
# dbt (data build tool) - Modern ELT transformation
# models/sales_summary.sql
# {{config(materialized='table')}}
SELECT
d.year,
d.month_name,
p.category,
SUM(f.total_amount) as total_sales,
COUNT(DISTINCT f.customer_key) as unique_customers
FROM {{ ref('fact_sales') }} f
JOIN {{ ref('dim_date') }} d ON f.date_key = d.date_key
JOIN {{ ref('dim_product') }} p ON f.product_key = p.product_key
GROUP BY d.year, d.month_name, p.category
# Apache Airflow DAG for ETL pipeline
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
dag = DAG(
'sales_etl',
schedule_interval='@daily',
start_date=datetime(2024, 1, 1)
)
def extract():
# Extract from source systems
pass
def transform():
# Clean and transform data
pass
def load():
# Load into warehouse
pass
extract_task = PythonOperator(
task_id='extract',
python_callable=extract,
dag=dag
)
transform_task = PythonOperator(
task_id='transform',
python_callable=transform,
dag=dag
)
load_task = PythonOperator(
task_id='load',
python_callable=load,
dag=dag
)
extract_task >> transform_task >> load_task
Modern Analytics Platforms
Google BigQuery
-- BigQuery: Serverless, pay-per-query
SELECT
DATE_TRUNC(order_date, MONTH) as month,
product_category,
SUM(total_amount) as revenue,
COUNT(*) as orders
FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, revenue DESC;
-- BigQuery ML: Machine learning in SQL
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
model_type='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='revenue'
) AS
SELECT date, SUM(total_amount) as revenue
FROM `project.dataset.orders`
GROUP BY date;
Amazon Redshift
-- Redshift: Distribution and sort keys
CREATE TABLE fact_sales (
sale_id BIGINT IDENTITY(1,1),
date_key INT,
product_key INT,
store_key INT,
quantity INT,
total_amount DECIMAL(12,2)
)
DISTKEY(product_key) -- Distribute by product for JOINs
SORTKEY(date_key); -- Sort by date for range queries
-- Redshift Spectrum: Query S3 directly
CREATE EXTERNAL SCHEMA s3_data
FROM DATA CATALOG
DATABASE 'external_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3';
SELECT * FROM s3_data.parquet_table;
Snowflake
-- Snowflake: Virtual warehouses for compute scaling
CREATE WAREHOUSE analytics_wh
WITH WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- Zero-copy cloning for dev/test
CREATE DATABASE analytics_dev CLONE analytics_prod;
-- Time travel: Query historical data
SELECT * FROM orders
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP);
-- Snowflake Streams: CDC for incremental processing
CREATE STREAM orders_changes ON TABLE orders;
SELECT * FROM orders_changes; -- Only changed rows
Analytics Query Optimization
-- Partitioning: Reduce data scanned
-- BigQuery partition by date
CREATE TABLE sales_partitioned
PARTITION BY DATE(order_date)
CLUSTER BY product_category
AS SELECT * FROM raw_sales;
-- Query only needed partitions
SELECT SUM(amount)
FROM sales_partitioned
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Only scans January partition!
-- Materialized views for common aggregations
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
order_date,
product_category,
SUM(amount) as total,
COUNT(*) as orders
FROM sales_partitioned
GROUP BY 1, 2;
Analytics Optimization Tips:
- Partition by date (most common filter)
- Cluster by frequently filtered columns
- Use materialized views for dashboards
- Avoid SELECT * — only request needed columns
- Pre-aggregate where possible (summary tables)
Data Lakehouse Architecture
The Data Lakehouse combines data lake flexibility (store anything cheaply) with data warehouse reliability (ACID, schema enforcement).
Traditional Architecture:
[Sources] → [Data Lake (cheap storage)] → [Data Warehouse (expensive)]
↓ ↓
ML/Data Science BI/Analytics
(duplicate data)
Lakehouse Architecture:
[Sources] → [Lakehouse (Delta Lake, Iceberg, Hudi)]
↓
Unified: BI + ML + Streaming
(Single source of truth)
# Delta Lake with PySpark
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Lakehouse") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# Create Delta table
df.write.format("delta").save("/data/delta/sales")
# ACID transactions on data lake!
DeltaTable.forPath(spark, "/data/delta/sales") \
.update(
condition="year = 2024",
set={"status": "'archived'"}
)
# Time travel
df_historical = spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("/data/delta/sales")
Conclusion & Next Steps
Data warehousing transforms raw data into business value. Mastering dimensional modeling, columnar storage, and modern analytics platforms enables you to build powerful business intelligence solutions.
Next in the Series
In Part 15: Capstone Projects, we'll apply everything we've learned to build portfolio-ready database implementations.
Continue the Database Mastery Series
Part 13: Database Security & Governance
Secure your analytics data and comply with regulations.
Read Article
Part 15: Capstone Projects
Build portfolio-ready analytics implementations.
Read Article
Part 7: Data Modeling & Normalization
Compare OLTP modeling with dimensional modeling.
Read Article