Logging Fundamentals
Logs are discrete, timestamped records of events that occurred within a system. Every significant action — a request arriving, a database query executing, an error occurring, a user authenticating — can be captured as a log entry. Unlike metrics (which aggregate), logs record individual events with full context.
Log Types
| Type | Purpose | Examples |
|---|---|---|
| Application logs | Record application events and errors | Request logs, error logs, business events |
| Access logs | Record all inbound requests | Nginx access.log, AWS ALB access logs |
| Audit logs | Record who did what, when | Admin actions, data modifications, auth events |
| Security logs | Record security-relevant events | Login attempts, privilege escalation, firewall blocks |
| Infrastructure logs | Record system-level events | Kernel messages, systemd events, container runtime logs |
Log Levels — Choosing the Right Verbosity
Log levels control verbosity. In production, overly verbose logging creates noise and storage costs; insufficient logging leaves you blind when debugging. The standard levels:
| Level | Use When | Production Default? |
|---|---|---|
TRACE | Extremely detailed execution path (function entry/exit) | Never |
DEBUG | Detailed diagnostic information for development | No — only during incidents |
INFO | Normal significant events (request completed, job started) | Yes |
WARN | Unexpected situations that don't cause errors yet | Yes |
ERROR | Errors that need attention but don't crash the system | Yes |
FATAL | Critical errors causing immediate shutdown | Yes |
Structured Logging
Traditional logs are unstructured text strings. Structured logging is the practice of emitting log entries as machine-parseable data — typically JSON — where every field has a consistent key and predictable value type.
Unstructured log (bad):
2026-05-14 14:23:45 ERROR Failed to process order 12345 for user john@example.com: timeout after 5000ms
Structured log (good):
{
"timestamp": "2026-05-14T14:23:45.123Z",
"level": "ERROR",
"service": "order-service",
"message": "Failed to process order",
"order_id": "12345",
"user_email": "john@example.com",
"error_type": "TimeoutException",
"duration_ms": 5000,
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"environment": "production",
"version": "2.4.1"
}
Mandatory Fields for Every Log Entry
Define a logging schema for your organisation. Every service must include these fields:
| Field | Type | Purpose |
|---|---|---|
timestamp | ISO 8601 UTC string | When the event occurred |
level | Enum: DEBUG/INFO/WARN/ERROR/FATAL | Severity filter |
service | String | Which service emitted this |
message | String | Human-readable description |
trace_id | String (hex) | Link to distributed trace |
span_id | String (hex) | Current span in trace |
environment | String: prod/staging/dev | Filter by environment |
version | Semver string | Identify which deployment introduced a bug |
Correlation IDs — Connecting Logs Across Services
In a microservice system, a single user action may span 10 services. To trace the complete picture across logs, every service must propagate a correlation ID (also called request ID or trace ID) through all log entries.
import logging
import json
from contextvars import ContextVar
# Store correlation ID in context variable
_correlation_id: ContextVar[str] = ContextVar('correlation_id', default='unknown')
class JSONLogger:
def __init__(self, service_name: str):
self.service = service_name
self.logger = logging.getLogger(service_name)
def _log(self, level: str, message: str, **extra):
entry = {
"timestamp": "2026-05-14T14:23:45.123Z", # Use datetime.utcnow().isoformat()
"level": level,
"service": self.service,
"message": message,
"trace_id": _correlation_id.get(),
**extra
}
print(json.dumps(entry)) # Output to stdout for collection
def info(self, message: str, **kwargs):
self._log("INFO", message, **kwargs)
def error(self, message: str, **kwargs):
self._log("ERROR", message, **kwargs)
logger = JSONLogger("order-service")
# Usage
_correlation_id.set("4bf92f3577b34da6a3ce929d0e0e4736")
logger.info("Processing order", order_id="12345", customer_id="cust-789")
logger.error("Payment failed", order_id="12345", error_code="CARD_DECLINED")
Centralized Logging Architecture
Distributed systems generate massive log volumes across many hosts. Centralized logging aggregates all these logs into a single queryable system, enabling cross-service debugging and organisation-wide visibility.
Pipeline Design
flowchart TD
A[Applications\nJSON logs to stdout] -->|container logs| B[Log Collector\nFluent Bit agent]
C[System Logs\n/var/log/syslog] -->|tail| B
D[Access Logs\nNginx / Envoy] -->|tail| B
B -->|parse + enrich| E[Log Aggregator\nFluent Bit / Vector]
E -->|forward| F[Loki\nLog Storage]
E -->|forward| G[Elasticsearch\nFull-text Search]
F -->|LogQL queries| H[Grafana]
G -->|KQL queries| I[Kibana]
Fluent Bit — Lightweight Log Collector
Fluent Bit is the industry-standard lightweight log shipper. It runs as a DaemonSet in Kubernetes (one pod per node), collecting logs from all containers and forwarding them to your log backend.
# fluent-bit.conf — collecting Kubernetes container logs
[SERVICE]
Flush 1
Log_Level info
Parsers_File parsers.conf
# INPUT: Collect container logs from Kubernetes
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
# FILTER: Enrich with Kubernetes metadata
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
# FILTER: Parse JSON log bodies
[FILTER]
Name parser
Match kube.*
Key_Name log
Parser json
# OUTPUT: Forward to Loki
[OUTPUT]
Name loki
Match kube.*
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=fluentbit, node=${NODE_NAME}
Label_Keys $kubernetes['namespace_name'],$kubernetes['pod_name'],$kubernetes['container_name']
Remove_Keys kubernetes,stream
Auto_Kubernetes_Labels On
Grafana Loki — Log Storage for the Prometheus Generation
Loki is a horizontally-scalable log aggregation system designed by Grafana Labs. Its key design decision: index only labels, not log content. This makes it dramatically cheaper to operate than Elasticsearch at scale.
Loki architecture components:
- Distributor: Receives log streams, validates, and routes to ingesters
- Ingester: Buffers recent logs in memory, flushes chunks to object storage
- Querier: Executes LogQL queries against chunks in object storage
- Compactor: Merges small chunks, applies retention policies
- Ruler: Evaluates alerting rules based on log patterns
LogQL — Querying Logs in Loki
LogQL is Loki's query language, inspired by PromQL. It has two query types: log queries (return log lines) and metric queries (return numeric values derived from logs).
# Log query syntax: {stream selectors} |= "filter"
# Select all logs from the order-service in production
{service="order-service", environment="production"}
# Filter for ERROR level logs
{service="order-service"} | json | level="ERROR"
# Filter for logs containing "timeout"
{service="order-service"} |= "timeout"
# Exclude debug logs
{service="order-service"} != "DEBUG"
# Parse JSON and filter on fields
{service="order-service"}
| json
| level="ERROR"
| duration_ms > 1000
# Count error rate per minute (metric query)
sum(rate({service="order-service"} | json | level="ERROR" [1m])) by (service)
# p99 latency from log lines (metric query)
quantile_over_time(0.99,
{service="order-service"} | json | unwrap duration_ms [5m]
) by (service)
# Find all logs for a specific trace ID
{environment="production"} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736"
The Log-Based Incident Investigation Workflow
When a metric alert fires (say, error rate spike at 14:23), here is the standard log investigation workflow:
- Query for ERROR/FATAL logs in the time window:
{environment="production"} | json | level=~"ERROR|FATAL" | line_format "{{.service}}: {{.message}}" - Identify the service(s) generating errors
- Get the error message patterns:
{service="payment-service"} | json | level="ERROR" | pattern "<_> error: <error>" - Pick a specific trace_id from an error log and query all services for that trace
- Reconstruct the full request timeline from log entries
Advanced Logging Topics
Log Enrichment — Adding Context in the Pipeline
Log enrichment adds metadata to log entries as they flow through the pipeline, without requiring application code changes. Useful enrichment fields:
- Kubernetes metadata: namespace, pod name, container name, labels, node name (added by Fluent Bit kubernetes filter)
- GeoIP: Country, region, city for source IP addresses (useful for security logs)
- Service ownership: Team name, on-call contact, runbook URL (added via label lookup tables)
- Deployment context: Git commit SHA, deployment ID (injected as environment variables into pods)
# Fluent Bit: Add deployment metadata via record_modifier
[FILTER]
Name record_modifier
Match kube.*
Record git_commit ${GIT_COMMIT_SHA}
Record deploy_id ${DEPLOY_ID}
Record cluster production-us-east-1
# Fluent Bit: GeoIP enrichment for access logs
[FILTER]
Name geoip2
Match nginx.*
Database /etc/fluent-bit/GeoLite2-City.mmdb
Lookup_Key remote_addr
Record geo_country_code country.iso_code
Record geo_city city.names.en
Log Retention & Cost Management
Log storage is one of the largest observability cost drivers. A busy microservice architecture can generate terabytes of logs per day. Strategies to manage cost:
- Tiered retention: Keep full-fidelity logs for 7-30 days; move to compressed cold storage for 90-365 days; delete after that
- Sampling: For very high-volume INFO-level logs (e.g., access logs), sample at 10% — only ship 1 in 10 lines
- Level-based routing: Route DEBUG/TRACE logs to cheap short-term storage; route ERROR/FATAL to full-fidelity long-term storage
- Deduplication: Identify and deduplicate repetitive error messages (same error, different timestamps)
Conclusion & Next Steps
Logs are the most detailed observability signal — they tell the full story of what happened. The key insights from Part 4:
- Structured logging (JSON) is non-negotiable for modern systems — unstructured text cannot be reliably queried
- Correlation IDs in every log entry enable tracing a request's journey across services
- Fluent Bit is the standard lightweight log collector for Kubernetes; configure it to enrich and route logs
- Loki is ideal for operational log queries (label-based filtering + content regex); Elasticsearch for full-text search
- LogQL enables both log-line queries and metric queries derived from log patterns
- Retention tiering is essential for cost management — not all logs need to live in hot storage forever