Part 4: Logging Deep Dive — Monitoring, Observability & Reliability

Logging Fundamentals

Logs are discrete, timestamped records of events that occurred within a system. Every significant action — a request arriving, a database query executing, an error occurring, a user authenticating — can be captured as a log entry. Unlike metrics (which aggregate), logs record individual events with full context.

Log Types

Type	Purpose	Examples
Application logs	Record application events and errors	Request logs, error logs, business events
Access logs	Record all inbound requests	Nginx access.log, AWS ALB access logs
Audit logs	Record who did what, when	Admin actions, data modifications, auth events
Security logs	Record security-relevant events	Login attempts, privilege escalation, firewall blocks
Infrastructure logs	Record system-level events	Kernel messages, systemd events, container runtime logs

Log Levels — Choosing the Right Verbosity

Log levels control verbosity. In production, overly verbose logging creates noise and storage costs; insufficient logging leaves you blind when debugging. The standard levels:

Level	Use When	Production Default?
`TRACE`	Extremely detailed execution path (function entry/exit)	Never
`DEBUG`	Detailed diagnostic information for development	No — only during incidents
`INFO`	Normal significant events (request completed, job started)	Yes
`WARN`	Unexpected situations that don't cause errors yet	Yes
`ERROR`	Errors that need attention but don't crash the system	Yes
`FATAL`	Critical errors causing immediate shutdown	Yes

                            
                            Dynamic Log Levels: Modern systems support changing log levels at runtime without restarting. When an incident occurs, temporarily elevate to DEBUG to capture more detail, then revert to INFO. This avoids the choice between always-verbose (expensive) and always-quiet (blind during incidents).
                        

Structured Logging

Traditional logs are unstructured text strings. Structured logging is the practice of emitting log entries as machine-parseable data — typically JSON — where every field has a consistent key and predictable value type.

Unstructured log (bad):

2026-05-14 14:23:45 ERROR Failed to process order 12345 for user john@example.com: timeout after 5000ms

Structured log (good):

{
  "timestamp": "2026-05-14T14:23:45.123Z",
  "level": "ERROR",
  "service": "order-service",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error_type": "TimeoutException",
  "duration_ms": 5000,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "environment": "production",
  "version": "2.4.1"
}

Mandatory Fields for Every Log Entry

Define a logging schema for your organisation. Every service must include these fields:

Field	Type	Purpose
`timestamp`	ISO 8601 UTC string	When the event occurred
`level`	Enum: DEBUG/INFO/WARN/ERROR/FATAL	Severity filter
`service`	String	Which service emitted this
`message`	String	Human-readable description
`trace_id`	String (hex)	Link to distributed trace
`span_id`	String (hex)	Current span in trace
`environment`	String: prod/staging/dev	Filter by environment
`version`	Semver string	Identify which deployment introduced a bug

Correlation IDs — Connecting Logs Across Services

In a microservice system, a single user action may span 10 services. To trace the complete picture across logs, every service must propagate a correlation ID (also called request ID or trace ID) through all log entries.

import logging
import json
from contextvars import ContextVar

# Store correlation ID in context variable
_correlation_id: ContextVar[str] = ContextVar('correlation_id', default='unknown')

class JSONLogger:
    def __init__(self, service_name: str):
        self.service = service_name
        self.logger = logging.getLogger(service_name)

    def _log(self, level: str, message: str, **extra):
        entry = {
            "timestamp": "2026-05-14T14:23:45.123Z",  # Use datetime.utcnow().isoformat()
            "level": level,
            "service": self.service,
            "message": message,
            "trace_id": _correlation_id.get(),
            **extra
        }
        print(json.dumps(entry))  # Output to stdout for collection

    def info(self, message: str, **kwargs):
        self._log("INFO", message, **kwargs)

    def error(self, message: str, **kwargs):
        self._log("ERROR", message, **kwargs)

logger = JSONLogger("order-service")

# Usage
_correlation_id.set("4bf92f3577b34da6a3ce929d0e0e4736")
logger.info("Processing order", order_id="12345", customer_id="cust-789")
logger.error("Payment failed", order_id="12345", error_code="CARD_DECLINED")

Centralized Logging Architecture

Distributed systems generate massive log volumes across many hosts. Centralized logging aggregates all these logs into a single queryable system, enabling cross-service debugging and organisation-wide visibility.

Pipeline Design

Centralized Log Pipeline Architecture

                                flowchart TD
                                    A[Applications\nJSON logs to stdout] -->|container logs| B[Log Collector\nFluent Bit agent]
                                    C[System Logs\n/var/log/syslog] -->|tail| B
                                    D[Access Logs\nNginx / Envoy] -->|tail| B
                                    B -->|parse + enrich| E[Log Aggregator\nFluent Bit / Vector]
                                    E -->|forward| F[Loki\nLog Storage]
                                    E -->|forward| G[Elasticsearch\nFull-text Search]
                                    F -->|LogQL queries| H[Grafana]
                                    G -->|KQL queries| I[Kibana]

Fluent Bit — Lightweight Log Collector

Fluent Bit is the industry-standard lightweight log shipper. It runs as a DaemonSet in Kubernetes (one pod per node), collecting logs from all containers and forwarding them to your log backend.

# fluent-bit.conf — collecting Kubernetes container logs
[SERVICE]
    Flush        1
    Log_Level    info
    Parsers_File parsers.conf

# INPUT: Collect container logs from Kubernetes
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10

# FILTER: Enrich with Kubernetes metadata
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    Keep_Log            Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

# FILTER: Parse JSON log bodies
[FILTER]
    Name   parser
    Match  kube.*
    Key_Name log
    Parser  json

# OUTPUT: Forward to Loki
[OUTPUT]
    Name            loki
    Match           kube.*
    Host            loki.monitoring.svc.cluster.local
    Port            3100
    Labels          job=fluentbit, node=${NODE_NAME}
    Label_Keys      $kubernetes['namespace_name'],$kubernetes['pod_name'],$kubernetes['container_name']
    Remove_Keys     kubernetes,stream
    Auto_Kubernetes_Labels On

Grafana Loki — Log Storage for the Prometheus Generation

Loki is a horizontally-scalable log aggregation system designed by Grafana Labs. Its key design decision: index only labels, not log content. This makes it dramatically cheaper to operate than Elasticsearch at scale.

Loki architecture components:

Distributor: Receives log streams, validates, and routes to ingesters
Ingester: Buffers recent logs in memory, flushes chunks to object storage
Querier: Executes LogQL queries against chunks in object storage
Compactor: Merges small chunks, applies retention policies
Ruler: Evaluates alerting rules based on log patterns

                            
                            Loki vs Elasticsearch: Loki costs 10x less to operate than Elasticsearch at scale because it stores raw compressed log chunks in cheap object storage (S3/GCS) with minimal indexing overhead. The trade-off: it cannot do full-text search across all fields — only label-based filtering plus line-content regex. If you need full-text search across all log fields, Elasticsearch is better. For most operational use cases, Loki is the right choice.
                        

LogQL — Querying Logs in Loki

LogQL is Loki's query language, inspired by PromQL. It has two query types: log queries (return log lines) and metric queries (return numeric values derived from logs).

# Log query syntax: {stream selectors} |= "filter"
# Select all logs from the order-service in production
{service="order-service", environment="production"}

# Filter for ERROR level logs
{service="order-service"} | json | level="ERROR"

# Filter for logs containing "timeout"
{service="order-service"} |= "timeout"

# Exclude debug logs
{service="order-service"} != "DEBUG"

# Parse JSON and filter on fields
{service="order-service"}
  | json
  | level="ERROR"
  | duration_ms > 1000

# Count error rate per minute (metric query)
sum(rate({service="order-service"} | json | level="ERROR" [1m])) by (service)

# p99 latency from log lines (metric query)
quantile_over_time(0.99,
  {service="order-service"} | json | unwrap duration_ms [5m]
) by (service)

# Find all logs for a specific trace ID
{environment="production"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736"

Debugging Pattern

The Log-Based Incident Investigation Workflow

When a metric alert fires (say, error rate spike at 14:23), here is the standard log investigation workflow:

Query for ERROR/FATAL logs in the time window: {environment="production"} | json | level=~"ERROR|FATAL" | line_format "{{.service}}: {{.message}}"
Identify the service(s) generating errors
Get the error message patterns: {service="payment-service"} | json | level="ERROR" | pattern "<_> error: <error>"
Pick a specific trace_id from an error log and query all services for that trace
Reconstruct the full request timeline from log entries

Incident Response Root Cause Analysis LogQL

Advanced Logging Topics

Log Enrichment — Adding Context in the Pipeline

Log enrichment adds metadata to log entries as they flow through the pipeline, without requiring application code changes. Useful enrichment fields:

Kubernetes metadata: namespace, pod name, container name, labels, node name (added by Fluent Bit kubernetes filter)
GeoIP: Country, region, city for source IP addresses (useful for security logs)
Service ownership: Team name, on-call contact, runbook URL (added via label lookup tables)
Deployment context: Git commit SHA, deployment ID (injected as environment variables into pods)

# Fluent Bit: Add deployment metadata via record_modifier
[FILTER]
    Name          record_modifier
    Match         kube.*
    Record        git_commit  ${GIT_COMMIT_SHA}
    Record        deploy_id   ${DEPLOY_ID}
    Record        cluster     production-us-east-1

# Fluent Bit: GeoIP enrichment for access logs
[FILTER]
    Name          geoip2
    Match         nginx.*
    Database      /etc/fluent-bit/GeoLite2-City.mmdb
    Lookup_Key    remote_addr
    Record        geo_country_code  country.iso_code
    Record        geo_city          city.names.en

Log Retention & Cost Management

Log storage is one of the largest observability cost drivers. A busy microservice architecture can generate terabytes of logs per day. Strategies to manage cost:

Tiered retention: Keep full-fidelity logs for 7-30 days; move to compressed cold storage for 90-365 days; delete after that
Sampling: For very high-volume INFO-level logs (e.g., access logs), sample at 10% — only ship 1 in 10 lines
Level-based routing: Route DEBUG/TRACE logs to cheap short-term storage; route ERROR/FATAL to full-fidelity long-term storage
Deduplication: Identify and deduplicate repetitive error messages (same error, different timestamps)

                            
                            Cost Warning: Never log high-cardinality values at INFO level in hot paths. Logging full request/response bodies for every API call on a high-traffic service can generate gigabytes per minute. Always profile your log volume before going to production.
                        

Conclusion & Next Steps

Logs are the most detailed observability signal — they tell the full story of what happened. The key insights from Part 4:

Structured logging (JSON) is non-negotiable for modern systems — unstructured text cannot be reliably queried
Correlation IDs in every log entry enable tracing a request's journey across services
Fluent Bit is the standard lightweight log collector for Kubernetes; configure it to enrich and route logs
Loki is ideal for operational log queries (label-based filtering + content regex); Elasticsearch for full-text search
LogQL enables both log-line queries and metric queries derived from log patterns
Retention tiering is essential for cost management — not all logs need to live in hot storage forever

Previous Part 3: Time Series Data, Prometheus & PromQL Next Part 5: Distributed Tracing & Context Propagation

Cookie Consent

Part 4: Logging Deep Dive — From Fundamentals to Centralized

Table of Contents

Logging Fundamentals

Log Types

Log Levels — Choosing the Right Verbosity

Structured Logging

Unstructured log (bad):

Structured log (good):

Mandatory Fields for Every Log Entry

Correlation IDs — Connecting Logs Across Services

Centralized Logging Architecture

Pipeline Design

Fluent Bit — Lightweight Log Collector

Grafana Loki — Log Storage for the Prometheus Generation

LogQL — Querying Logs in Loki

The Log-Based Incident Investigation Workflow

Advanced Logging Topics

Log Enrichment — Adding Context in the Pipeline

Log Retention & Cost Management

Conclusion & Next Steps

Cookie Consent

Part 4: Logging Deep Dive — From Fundamentals to Centralized

Table of Contents

Logging Fundamentals

Log Types

Log Levels — Choosing the Right Verbosity

Structured Logging

Unstructured log (bad):

Structured log (good):

Mandatory Fields for Every Log Entry

Correlation IDs — Connecting Logs Across Services

Centralized Logging Architecture

Pipeline Design

Fluent Bit — Lightweight Log Collector

Grafana Loki — Log Storage for the Prometheus Generation

LogQL — Querying Logs in Loki

The Log-Based Incident Investigation Workflow

Advanced Logging Topics

Log Enrichment — Adding Context in the Pipeline

Log Retention & Cost Management

Conclusion & Next Steps

Continue the Series

Part 5: Distributed Tracing & Context Propagation

Tool Deep Dive: Grafana Loki Complete Guide

Part 6: OpenTelemetry — The Modern Observability Standard