Back to Technology

Complete Database Mastery Part 3: PostgreSQL Deep Dive

January 31, 2026 Wasil Zafar 45 min read

Master PostgreSQL—the most powerful open-source RDBMS. Learn advanced data types, indexing strategies, full-text search, extensions like PostGIS, partitioning, and performance tuning techniques.

Table of Contents

  1. Introduction
  2. PostgreSQL Architecture
  3. Advanced Data Types
  4. Index Types
  5. Full-Text Search
  6. Extensions
  7. Advanced Features
  8. Performance Tuning
  9. Monitoring with pg_stat
  10. Conclusion & Next Steps

Introduction: Why PostgreSQL?

PostgreSQL (often called "Postgres") is the world's most advanced open-source relational database. It combines SQL compliance with powerful features that rival—and often exceed—commercial databases like Oracle.

Series Context: This is Part 3 of 15 in the Complete Database Mastery series. We're diving deep into PostgreSQL's unique capabilities.

Here's why PostgreSQL dominates in many high-stakes environments:

  • ACID Compliant: Full transaction support with no compromises
  • Extensible: Custom data types, operators, indexes, and extensions
  • Standards-based: Most SQL-compliant database available
  • Rich Data Types: JSONB, arrays, ranges, UUIDs, geospatial, and more
  • Mature Ecosystem: Extensions like PostGIS, TimescaleDB, and pg_cron

Who Uses PostgreSQL?

  • Apple: Core infrastructure
  • Instagram: Primary datastore (billions of photos)
  • Spotify: Metadata and user data
  • Reddit: Core database
  • Twitch: Primary database for streaming platform
Version Note: This guide covers PostgreSQL 14-16. Most features work in earlier versions, but newer features like JSON path queries and enhanced partitioning require 12+.

PostgreSQL Architecture

Understanding PostgreSQL's internal architecture helps you optimize performance, troubleshoot issues, and make informed configuration decisions.

Process Model

PostgreSQL uses a process-per-connection model—each client connection gets its own backend process. This is simpler than thread-based models and provides better isolation.

# View PostgreSQL processes
ps aux | grep postgres

# Output:
# postgres   1234  postmaster (main daemon)
# postgres   1235  checkpointer
# postgres   1236  background writer
# postgres   1237  walwriter
# postgres   1238  autovacuum launcher
# postgres   1240  postgres: mydb user [local] idle  (client connection)

Key PostgreSQL Processes

Process Purpose
postmaster Main daemon—listens for connections, spawns backends
checkpointer Writes dirty buffers to disk at checkpoint intervals
background writer Gradually writes dirty buffers to reduce checkpoint spike
walwriter Flushes Write-Ahead Log (WAL) to disk
autovacuum Reclaims dead row space, updates statistics
backend Handles individual client connections

Memory Structure

PostgreSQL memory is divided into shared memory (used by all processes) and local memory (per-process).

-- View current memory settings
SHOW shared_buffers;     -- Main cache for table/index data
SHOW work_mem;           -- Memory for sorts/hash joins (per operation!)
SHOW maintenance_work_mem;  -- Memory for maintenance ops (VACUUM, CREATE INDEX)
SHOW effective_cache_size;  -- Planner hint for available cache

-- Real-world example configuration (for 16GB RAM server)
-- In postgresql.conf:
-- shared_buffers = 4GB          # 25% of RAM
-- work_mem = 64MB               # Per-operation, be careful!
-- maintenance_work_mem = 1GB    # For maintenance operations
-- effective_cache_size = 12GB   # 75% of RAM (includes OS cache)
work_mem Danger: work_mem is allocated per operation, per connection. With 100 connections and a complex query using 3 sorts, you could need 100 × 3 × 64MB = 19GB! Start low (32-64MB) and increase carefully.

Advanced Data Types

PostgreSQL's rich type system is a major competitive advantage. Beyond standard SQL types, it offers arrays, JSONB, ranges, UUIDs, and the ability to create custom types.

Arrays

PostgreSQL has native array support—store multiple values in a single column. Great for tags, preferences, or any multi-valued attribute.

-- Creating tables with arrays
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title VARCHAR(200),
    tags TEXT[],                    -- Array of text
    scores INTEGER[]                -- Array of integers
);

-- Inserting array values
INSERT INTO articles (title, tags, scores) VALUES
    ('PostgreSQL Tips', ARRAY['database', 'postgresql', 'sql'], ARRAY[85, 92, 88]),
    ('Python Guide', '{"programming", "python", "tutorial"}', '{90, 95}');

-- Querying arrays
SELECT title, tags FROM articles WHERE 'postgresql' = ANY(tags);
SELECT title, tags FROM articles WHERE tags @> ARRAY['sql'];  -- Contains all
SELECT title, tags FROM articles WHERE tags && ARRAY['python', 'java'];  -- Overlaps

-- Array operations
SELECT 
    title,
    tags[1] AS first_tag,           -- 1-indexed!
    array_length(tags, 1) AS tag_count,
    array_to_string(tags, ', ') AS tags_csv,
    scores[1] + scores[2] AS sum_first_two
FROM articles;

-- Unnest: Expand array into rows
SELECT title, unnest(tags) AS tag FROM articles;

JSONB: Binary JSON

JSONB is PostgreSQL's flagship semi-structured data type. It's binary (faster queries) and indexable—getting the best of both SQL and NoSQL worlds.

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    attributes JSONB
);

INSERT INTO products (name, attributes) VALUES
    ('Gaming Laptop', '{
        "brand": "ASUS",
        "specs": {"ram_gb": 32, "storage_gb": 1024, "gpu": "RTX 4080"},
        "ports": ["USB-C", "HDMI", "Ethernet"],
        "available": true
    }'),
    ('Wireless Mouse', '{
        "brand": "Logitech",
        "dpi": 25600,
        "buttons": 8,
        "wireless": true,
        "color_options": ["black", "white"]
    }');

-- JSONB querying
SELECT name, attributes->>'brand' AS brand FROM products;
SELECT name, attributes->'specs'->>'gpu' AS gpu FROM products;
SELECT name FROM products WHERE attributes->>'brand' = 'ASUS';
SELECT name FROM products WHERE (attributes->'specs'->>'ram_gb')::INT >= 16;

-- Check for existence
SELECT name FROM products WHERE attributes ? 'wireless';  -- Has key
SELECT name FROM products WHERE attributes @> '{"available": true}';  -- Contains

-- JSONB modifications
UPDATE products SET attributes = attributes || '{"warranty_years": 2}'
WHERE name = 'Gaming Laptop';

UPDATE products SET attributes = jsonb_set(attributes, '{specs,ram_gb}', '64')
WHERE name = 'Gaming Laptop';

-- Index JSONB for fast queries
CREATE INDEX idx_products_brand ON products ((attributes->>'brand'));
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);

UUID & Custom Types

-- UUID: Universally unique identifiers
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

CREATE TABLE sessions (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    user_id INT,
    started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO sessions (user_id) VALUES (1);
SELECT * FROM sessions;
-- id: 550e8400-e29b-41d4-a716-446655440000

-- Range types: Represent intervals
CREATE TABLE reservations (
    id SERIAL PRIMARY KEY,
    room_id INT,
    during TSRANGE  -- Timestamp range
);

INSERT INTO reservations (room_id, during) VALUES
    (101, '[2024-03-15 09:00, 2024-03-15 11:00)');

-- Prevent double-booking with exclusion constraint
ALTER TABLE reservations
ADD CONSTRAINT no_overlap EXCLUDE USING GIST (
    room_id WITH =,
    during WITH &&
);

-- Enum types: Fixed set of values
CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered');

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    status order_status DEFAULT 'pending'
);

INSERT INTO orders (status) VALUES ('shipped');
-- INSERT INTO orders (status) VALUES ('cancelled');  -- ERROR: invalid
Other Useful Types: INET/CIDR for IP addresses, MACADDR, MONEY, INTERVAL for durations, POINT/BOX/POLYGON for geometry (or use PostGIS).

Index Types

PostgreSQL offers multiple index types, each optimized for different query patterns. Choosing the right index can make queries 100x faster.

B-Tree & Hash Indexes

B-Tree is the default and most versatile index type. It supports equality and range queries, ordering, and partial matches (prefixes).

-- B-Tree index (default)
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_date ON orders(order_date);

-- Multi-column index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- ✅ Fast: WHERE customer_id = 1
-- ✅ Fast: WHERE customer_id = 1 AND order_date > '2024-01-01'
-- ❌ Slow: WHERE order_date > '2024-01-01'  (can't use index)

-- Partial index: Only index rows that match condition
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Index is small, fast, and perfect for "show pending orders" query

-- Index with expressions
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now LOWER(email) queries use the index

-- Hash index (equality only, smaller than B-Tree for high-cardinality)
CREATE INDEX idx_sessions_token ON sessions USING HASH (token);
-- Only for: WHERE token = 'abc123' (not LIKE, not ranges)

GIN & GiST Indexes

GIN (Generalized Inverted Index) excels at indexing composite values—arrays, JSONB, full-text search.

GiST (Generalized Search Tree) is great for geometric data, ranges, and proximity searches.

-- GIN for JSONB (index all keys and values)
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);

-- Now these are fast:
SELECT * FROM products WHERE attributes @> '{"brand": "Apple"}';
SELECT * FROM products WHERE attributes ? 'wireless';
SELECT * FROM products WHERE attributes ?| ARRAY['wireless', 'bluetooth'];

-- GIN for arrays
CREATE INDEX idx_articles_tags ON articles USING GIN (tags);
SELECT * FROM articles WHERE tags @> ARRAY['postgresql'];

-- GIN for full-text search
CREATE INDEX idx_articles_content ON articles 
USING GIN (to_tsvector('english', content));

-- GiST for range types (prevents overlapping reservations)
CREATE INDEX idx_reservations_period ON reservations 
USING GiST (during);

-- GiST for geometry (with PostGIS)
CREATE INDEX idx_locations_point ON locations 
USING GiST (coordinates);

BRIN Indexes

BRIN (Block Range Index) is incredibly compact—great for large, naturally-ordered data like time-series or log tables.

-- BRIN for time-series data (events inserted in order)
CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    event_time TIMESTAMP NOT NULL,
    data JSONB
);

-- BRIN stores min/max per block range (very small!)
CREATE INDEX idx_events_time ON events USING BRIN (event_time);

-- Size comparison (1 billion rows):
-- B-Tree: ~20 GB
-- BRIN: ~200 KB (!!)

-- Works because data is physically ordered by time
-- Query: WHERE event_time BETWEEN '2024-01-01' AND '2024-01-31'
-- BRIN quickly eliminates most blocks, then scans remaining

Index Type Selection Guide

Index Type Best For Size
B-Tree =, <, >, BETWEEN, LIKE 'prefix%', ORDER BY Medium
Hash = only (high cardinality) Small
GIN Arrays, JSONB, full-text search, contains Large
GiST Geometry, ranges, nearest-neighbor Medium
BRIN Sorted data (timestamps, sequential IDs) Tiny

Extensions

Extensions are PostgreSQL's superpower—add specialized functionality without leaving the database. There are extensions for geospatial data, time-series, cryptography, foreign data, and much more.

-- List available extensions
SELECT * FROM pg_available_extensions;

-- Install an extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_stat_statements";

-- List installed extensions
\dx

PostGIS (Geospatial)

PostGIS adds geographic objects and functions—turning PostgreSQL into a powerful spatial database used by mapping applications worldwide.

CREATE EXTENSION postgis;

-- Create table with geometry
CREATE TABLE stores (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    location GEOGRAPHY(POINT, 4326)  -- SRID 4326 = WGS84 (GPS)
);

INSERT INTO stores (name, location) VALUES
    ('Downtown', ST_Point(-73.9857, 40.7484)),  -- NYC
    ('Brooklyn', ST_Point(-73.9442, 40.6782));

-- Find stores within 5km of a point
SELECT name, 
       ST_Distance(location, ST_Point(-73.9712, 40.7831)::geography) AS distance_m
FROM stores
WHERE ST_DWithin(location, ST_Point(-73.9712, 40.7831)::geography, 5000)
ORDER BY distance_m;

-- Create spatial index
CREATE INDEX idx_stores_location ON stores USING GIST (location);

Other Popular Extensions

Must-Know PostgreSQL Extensions

Extension Purpose
pg_stat_statements Track query execution statistics (essential for optimization)
uuid-ossp Generate UUIDs (v4 random, v1 timestamp-based)
pg_trgm Fuzzy text matching, similarity search, LIKE optimization
hstore Key-value store in a column (predecessor to JSONB)
pgcrypto Cryptographic functions (hashing, encryption)
postgres_fdw Query other PostgreSQL databases as local tables
TimescaleDB Time-series optimization (auto-partitioning, compression)
pg_cron Schedule jobs within the database
-- pg_trgm: Fuzzy search with LIKE optimization
CREATE EXTENSION pg_trgm;

CREATE INDEX idx_products_name_trgm ON products 
USING GIN (name gin_trgm_ops);

-- Now LIKE '%laptop%' uses the index!
SELECT * FROM products WHERE name ILIKE '%gaming%';

-- Similarity search
SELECT name, similarity(name, 'laptp') AS sim
FROM products
WHERE similarity(name, 'laptp') > 0.3
ORDER BY sim DESC;

-- pgcrypto: Password hashing
CREATE EXTENSION pgcrypto;

-- Hash password
UPDATE users SET password_hash = crypt('user_password', gen_salt('bf'))
WHERE id = 1;

-- Verify password
SELECT id FROM users
WHERE email = 'user@example.com'
  AND password_hash = crypt('user_password', password_hash);

Advanced Features

Table Partitioning

Partitioning splits large tables into smaller, more manageable pieces. Queries automatically route to relevant partitions, dramatically improving performance for time-series and historical data.

-- Create partitioned table (PostgreSQL 10+)
CREATE TABLE events (
    id BIGSERIAL,
    event_time TIMESTAMP NOT NULL,
    event_type VARCHAR(50),
    data JSONB
) PARTITION BY RANGE (event_time);

-- Create partitions for each month
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE events_2024_03 PARTITION OF events
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Insert goes to correct partition automatically
INSERT INTO events (event_time, event_type, data) VALUES
    ('2024-02-15 10:30:00', 'page_view', '{"page": "/home"}');

-- Query automatically prunes irrelevant partitions
EXPLAIN SELECT * FROM events 
WHERE event_time >= '2024-02-01' AND event_time < '2024-03-01';
-- Only scans events_2024_02!

-- List partitioning (for categories)
CREATE TABLE orders (
    id SERIAL,
    region VARCHAR(10),
    total DECIMAL(10,2)
) PARTITION BY LIST (region);

CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('EU', 'UK');
CREATE TABLE orders_apac PARTITION OF orders FOR VALUES IN ('APAC');

Materialized Views

Materialized views store query results physically—perfect for expensive reports that don't need real-time data.

-- Create materialized view
CREATE MATERIALIZED VIEW monthly_sales_summary AS
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue,
    AVG(total) AS avg_order
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- Query is instant (reads stored data)
SELECT * FROM monthly_sales_summary WHERE month >= '2024-01-01';

-- Refresh when needed
REFRESH MATERIALIZED VIEW monthly_sales_summary;

-- Concurrent refresh (doesn't lock reads)
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_sales_summary;
-- Requires unique index:
CREATE UNIQUE INDEX ON monthly_sales_summary (month);

Logical Replication

Logical replication streams changes at the row level—replicate specific tables, support different PostgreSQL versions, or feed data to other systems.

-- On PRIMARY (publisher)
-- postgresql.conf: wal_level = logical

CREATE PUBLICATION my_publication FOR TABLE orders, customers;
-- Or all tables:
CREATE PUBLICATION all_tables FOR ALL TABLES;

-- On REPLICA (subscriber)
CREATE SUBSCRIPTION my_subscription
    CONNECTION 'host=primary_host dbname=mydb user=replicator'
    PUBLICATION my_publication;

-- Check replication status
SELECT * FROM pg_stat_subscription;
SELECT * FROM pg_replication_slots;
Replication Types:
  • Physical (streaming): Block-level copy, identical replica, same PG version required
  • Logical: Row-level changes, selective tables, cross-version, cross-platform

Performance Tuning

PostgreSQL ships with conservative defaults. Tuning these settings for your hardware and workload can dramatically improve performance.

Essential postgresql.conf Settings

# Memory settings (for 32GB RAM server, OLTP workload)
shared_buffers = 8GB              # 25% of RAM
effective_cache_size = 24GB       # 75% of RAM (includes OS cache)
work_mem = 64MB                   # Per-operation (careful!)
maintenance_work_mem = 2GB        # For VACUUM, CREATE INDEX

# Write-Ahead Log
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB

# Query planner
random_page_cost = 1.1            # Lower for SSD (default 4.0)
effective_io_concurrency = 200    # Higher for SSD

# Connections
max_connections = 200             # Use connection pooling!

# Parallelism (multi-core)
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_worker_processes = 8
-- EXPLAIN ANALYZE: Understand query execution
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE customer_id = 1234
ORDER BY order_date DESC
LIMIT 10;

-- Output includes:
-- Execution time, rows examined, buffers read, index usage
-- Compare "estimated rows" vs "actual rows" for planner accuracy

-- Common issues to look for:
-- Seq Scan on large table → needs index
-- Large "rows removed by filter" → index not selective enough
-- "Sort" with high memory → increase work_mem or add index
-- Nested Loop with many iterations → check join order
Vacuum & Autovacuum: PostgreSQL MVCC creates dead rows on UPDATE/DELETE. VACUUM reclaims this space. Never disable autovacuum—instead, tune its aggressiveness if needed.
-- Check table bloat
SELECT 
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Force vacuum on specific table
VACUUM (VERBOSE, ANALYZE) orders;

-- Tune autovacuum for high-write tables
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,  -- Vacuum at 1% dead rows
    autovacuum_analyze_scale_factor = 0.005
);

Monitoring with pg_stat

PostgreSQL has built-in statistics views that expose everything happening in your database. These are essential for troubleshooting and optimization.

-- Active connections and queries
SELECT 
    pid,
    usename,
    application_name,
    client_addr,
    state,
    query_start,
    NOW() - query_start AS duration,
    LEFT(query, 80) AS query_snippet
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Kill a long-running query
SELECT pg_cancel_backend(pid);   -- Graceful cancel
SELECT pg_terminate_backend(pid); -- Force terminate

-- Table statistics: Which tables are hot?
SELECT 
    relname AS table_name,
    seq_scan,                      -- Full table scans
    idx_scan,                      -- Index scans
    n_tup_ins AS inserts,
    n_tup_upd AS updates,
    n_tup_del AS deletes,
    n_live_tup AS live_rows
FROM pg_stat_user_tables
ORDER BY seq_scan DESC
LIMIT 10;

-- Index usage: Find unused indexes (candidates for removal)
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan < 50  -- Rarely used
ORDER BY pg_relation_size(indexrelid) DESC;
-- pg_stat_statements: Top queries by time
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT 
    ROUND(total_exec_time::numeric, 2) AS total_time_ms,
    calls,
    ROUND((mean_exec_time)::numeric, 2) AS avg_time_ms,
    ROUND((total_exec_time / SUM(total_exec_time) OVER()) * 100, 2) AS pct,
    LEFT(query, 100) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Reset statistics periodically
SELECT pg_stat_statements_reset();
SELECT pg_stat_reset();  -- Reset all stats

Key Metrics to Monitor

Metric Source Alert Threshold
Active connections pg_stat_activity > 80% of max_connections
Long queries pg_stat_activity > 5 minutes
Replication lag pg_stat_replication > 1 minute
Cache hit ratio pg_stat_database < 95%
Transaction wraparound pg_database.datfrozenxid < 200M from wraparound
-- Cache hit ratio (should be > 95%)
SELECT 
    SUM(heap_blks_read) AS disk_reads,
    SUM(heap_blks_hit) AS cache_hits,
    ROUND(SUM(heap_blks_hit) / NULLIF(SUM(heap_blks_hit) + SUM(heap_blks_read), 0) * 100, 2) AS cache_hit_ratio
FROM pg_statio_user_tables;

Conclusion & Next Steps

PostgreSQL's advanced features make it the database of choice for complex applications. From JSONB to PostGIS, from partitioning to logical replication—you now have the knowledge to leverage its full power.

Next in the Series

In Part 4: MySQL & MariaDB, we'll explore the industry-standard databases for web systems—storage engines, replication, and optimization strategies.

Technology