Back to Monitoring, Observability & Reliability Series

Part 4: Logging Deep Dive — From Fundamentals to Centralized

May 14, 2026 Wasil Zafar 21 min read

Logs explain events. They are the timestamped story of what happened inside your system. This part covers structured logging design, building centralized log pipelines with Fluent Bit, and querying logs at scale with Loki and LogQL.

Table of Contents

  1. Logging Fundamentals
  2. Structured Logging
  3. Centralized Logging Architecture
  4. LogQL Query Language
  5. Advanced Topics
  6. Conclusion & Next Steps

Logging Fundamentals

Logs are discrete, timestamped records of events that occurred within a system. Every significant action — a request arriving, a database query executing, an error occurring, a user authenticating — can be captured as a log entry. Unlike metrics (which aggregate), logs record individual events with full context.

Log Types

TypePurposeExamples
Application logsRecord application events and errorsRequest logs, error logs, business events
Access logsRecord all inbound requestsNginx access.log, AWS ALB access logs
Audit logsRecord who did what, whenAdmin actions, data modifications, auth events
Security logsRecord security-relevant eventsLogin attempts, privilege escalation, firewall blocks
Infrastructure logsRecord system-level eventsKernel messages, systemd events, container runtime logs

Log Levels — Choosing the Right Verbosity

Log levels control verbosity. In production, overly verbose logging creates noise and storage costs; insufficient logging leaves you blind when debugging. The standard levels:

LevelUse WhenProduction Default?
TRACEExtremely detailed execution path (function entry/exit)Never
DEBUGDetailed diagnostic information for developmentNo — only during incidents
INFONormal significant events (request completed, job started)Yes
WARNUnexpected situations that don't cause errors yetYes
ERRORErrors that need attention but don't crash the systemYes
FATALCritical errors causing immediate shutdownYes
Dynamic Log Levels: Modern systems support changing log levels at runtime without restarting. When an incident occurs, temporarily elevate to DEBUG to capture more detail, then revert to INFO. This avoids the choice between always-verbose (expensive) and always-quiet (blind during incidents).

Structured Logging

Traditional logs are unstructured text strings. Structured logging is the practice of emitting log entries as machine-parseable data — typically JSON — where every field has a consistent key and predictable value type.

Unstructured log (bad):

2026-05-14 14:23:45 ERROR Failed to process order 12345 for user john@example.com: timeout after 5000ms

Structured log (good):

{
  "timestamp": "2026-05-14T14:23:45.123Z",
  "level": "ERROR",
  "service": "order-service",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error_type": "TimeoutException",
  "duration_ms": 5000,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "environment": "production",
  "version": "2.4.1"
}

Mandatory Fields for Every Log Entry

Define a logging schema for your organisation. Every service must include these fields:

FieldTypePurpose
timestampISO 8601 UTC stringWhen the event occurred
levelEnum: DEBUG/INFO/WARN/ERROR/FATALSeverity filter
serviceStringWhich service emitted this
messageStringHuman-readable description
trace_idString (hex)Link to distributed trace
span_idString (hex)Current span in trace
environmentString: prod/staging/devFilter by environment
versionSemver stringIdentify which deployment introduced a bug

Correlation IDs — Connecting Logs Across Services

In a microservice system, a single user action may span 10 services. To trace the complete picture across logs, every service must propagate a correlation ID (also called request ID or trace ID) through all log entries.

import logging
import json
from contextvars import ContextVar

# Store correlation ID in context variable
_correlation_id: ContextVar[str] = ContextVar('correlation_id', default='unknown')

class JSONLogger:
    def __init__(self, service_name: str):
        self.service = service_name
        self.logger = logging.getLogger(service_name)

    def _log(self, level: str, message: str, **extra):
        entry = {
            "timestamp": "2026-05-14T14:23:45.123Z",  # Use datetime.utcnow().isoformat()
            "level": level,
            "service": self.service,
            "message": message,
            "trace_id": _correlation_id.get(),
            **extra
        }
        print(json.dumps(entry))  # Output to stdout for collection

    def info(self, message: str, **kwargs):
        self._log("INFO", message, **kwargs)

    def error(self, message: str, **kwargs):
        self._log("ERROR", message, **kwargs)

logger = JSONLogger("order-service")

# Usage
_correlation_id.set("4bf92f3577b34da6a3ce929d0e0e4736")
logger.info("Processing order", order_id="12345", customer_id="cust-789")
logger.error("Payment failed", order_id="12345", error_code="CARD_DECLINED")

Centralized Logging Architecture

Distributed systems generate massive log volumes across many hosts. Centralized logging aggregates all these logs into a single queryable system, enabling cross-service debugging and organisation-wide visibility.

Pipeline Design

Centralized Log Pipeline Architecture
                                flowchart TD
                                    A[Applications\nJSON logs to stdout] -->|container logs| B[Log Collector\nFluent Bit agent]
                                    C[System Logs\n/var/log/syslog] -->|tail| B
                                    D[Access Logs\nNginx / Envoy] -->|tail| B
                                    B -->|parse + enrich| E[Log Aggregator\nFluent Bit / Vector]
                                    E -->|forward| F[Loki\nLog Storage]
                                    E -->|forward| G[Elasticsearch\nFull-text Search]
                                    F -->|LogQL queries| H[Grafana]
                                    G -->|KQL queries| I[Kibana]
                            

Fluent Bit — Lightweight Log Collector

Fluent Bit is the industry-standard lightweight log shipper. It runs as a DaemonSet in Kubernetes (one pod per node), collecting logs from all containers and forwarding them to your log backend.

# fluent-bit.conf — collecting Kubernetes container logs
[SERVICE]
    Flush        1
    Log_Level    info
    Parsers_File parsers.conf

# INPUT: Collect container logs from Kubernetes
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10

# FILTER: Enrich with Kubernetes metadata
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    Keep_Log            Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

# FILTER: Parse JSON log bodies
[FILTER]
    Name   parser
    Match  kube.*
    Key_Name log
    Parser  json

# OUTPUT: Forward to Loki
[OUTPUT]
    Name            loki
    Match           kube.*
    Host            loki.monitoring.svc.cluster.local
    Port            3100
    Labels          job=fluentbit, node=${NODE_NAME}
    Label_Keys      $kubernetes['namespace_name'],$kubernetes['pod_name'],$kubernetes['container_name']
    Remove_Keys     kubernetes,stream
    Auto_Kubernetes_Labels On

Grafana Loki — Log Storage for the Prometheus Generation

Loki is a horizontally-scalable log aggregation system designed by Grafana Labs. Its key design decision: index only labels, not log content. This makes it dramatically cheaper to operate than Elasticsearch at scale.

Loki architecture components:

  • Distributor: Receives log streams, validates, and routes to ingesters
  • Ingester: Buffers recent logs in memory, flushes chunks to object storage
  • Querier: Executes LogQL queries against chunks in object storage
  • Compactor: Merges small chunks, applies retention policies
  • Ruler: Evaluates alerting rules based on log patterns
Loki vs Elasticsearch: Loki costs 10x less to operate than Elasticsearch at scale because it stores raw compressed log chunks in cheap object storage (S3/GCS) with minimal indexing overhead. The trade-off: it cannot do full-text search across all fields — only label-based filtering plus line-content regex. If you need full-text search across all log fields, Elasticsearch is better. For most operational use cases, Loki is the right choice.

LogQL — Querying Logs in Loki

LogQL is Loki's query language, inspired by PromQL. It has two query types: log queries (return log lines) and metric queries (return numeric values derived from logs).

# Log query syntax: {stream selectors} |= "filter"
# Select all logs from the order-service in production
{service="order-service", environment="production"}

# Filter for ERROR level logs
{service="order-service"} | json | level="ERROR"

# Filter for logs containing "timeout"
{service="order-service"} |= "timeout"

# Exclude debug logs
{service="order-service"} != "DEBUG"

# Parse JSON and filter on fields
{service="order-service"}
  | json
  | level="ERROR"
  | duration_ms > 1000

# Count error rate per minute (metric query)
sum(rate({service="order-service"} | json | level="ERROR" [1m])) by (service)

# p99 latency from log lines (metric query)
quantile_over_time(0.99,
  {service="order-service"} | json | unwrap duration_ms [5m]
) by (service)

# Find all logs for a specific trace ID
{environment="production"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736"
Debugging Pattern

The Log-Based Incident Investigation Workflow

When a metric alert fires (say, error rate spike at 14:23), here is the standard log investigation workflow:

  1. Query for ERROR/FATAL logs in the time window: {environment="production"} | json | level=~"ERROR|FATAL" | line_format "{{.service}}: {{.message}}"
  2. Identify the service(s) generating errors
  3. Get the error message patterns: {service="payment-service"} | json | level="ERROR" | pattern "<_> error: <error>"
  4. Pick a specific trace_id from an error log and query all services for that trace
  5. Reconstruct the full request timeline from log entries
Incident Response Root Cause Analysis LogQL

Advanced Logging Topics

Log Enrichment — Adding Context in the Pipeline

Log enrichment adds metadata to log entries as they flow through the pipeline, without requiring application code changes. Useful enrichment fields:

  • Kubernetes metadata: namespace, pod name, container name, labels, node name (added by Fluent Bit kubernetes filter)
  • GeoIP: Country, region, city for source IP addresses (useful for security logs)
  • Service ownership: Team name, on-call contact, runbook URL (added via label lookup tables)
  • Deployment context: Git commit SHA, deployment ID (injected as environment variables into pods)
# Fluent Bit: Add deployment metadata via record_modifier
[FILTER]
    Name          record_modifier
    Match         kube.*
    Record        git_commit  ${GIT_COMMIT_SHA}
    Record        deploy_id   ${DEPLOY_ID}
    Record        cluster     production-us-east-1

# Fluent Bit: GeoIP enrichment for access logs
[FILTER]
    Name          geoip2
    Match         nginx.*
    Database      /etc/fluent-bit/GeoLite2-City.mmdb
    Lookup_Key    remote_addr
    Record        geo_country_code  country.iso_code
    Record        geo_city          city.names.en

Log Retention & Cost Management

Log storage is one of the largest observability cost drivers. A busy microservice architecture can generate terabytes of logs per day. Strategies to manage cost:

  • Tiered retention: Keep full-fidelity logs for 7-30 days; move to compressed cold storage for 90-365 days; delete after that
  • Sampling: For very high-volume INFO-level logs (e.g., access logs), sample at 10% — only ship 1 in 10 lines
  • Level-based routing: Route DEBUG/TRACE logs to cheap short-term storage; route ERROR/FATAL to full-fidelity long-term storage
  • Deduplication: Identify and deduplicate repetitive error messages (same error, different timestamps)
Cost Warning: Never log high-cardinality values at INFO level in hot paths. Logging full request/response bodies for every API call on a high-traffic service can generate gigabytes per minute. Always profile your log volume before going to production.

Conclusion & Next Steps

Logs are the most detailed observability signal — they tell the full story of what happened. The key insights from Part 4:

  • Structured logging (JSON) is non-negotiable for modern systems — unstructured text cannot be reliably queried
  • Correlation IDs in every log entry enable tracing a request's journey across services
  • Fluent Bit is the standard lightweight log collector for Kubernetes; configure it to enrich and route logs
  • Loki is ideal for operational log queries (label-based filtering + content regex); Elasticsearch for full-text search
  • LogQL enables both log-line queries and metric queries derived from log patterns
  • Retention tiering is essential for cost management — not all logs need to live in hot storage forever