Tool Deep Dive: Loki Complete Guide

Loki Architecture

Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. Unlike traditional log systems (Elasticsearch, Splunk), Loki indexes only labels — not the full text of log lines — making it significantly cheaper to operate at scale.

                            
                            Key Insight: Loki is "like Prometheus, but for logs." It uses the same label-based approach for discovery, the same service discovery mechanisms, and integrates natively with Grafana. Logs are stored as compressed chunks indexed only by their label set and timestamp range.
                        

Loki Architecture — Write & Read Paths

flowchart TD
    A[Promtail / Alloy] -->|Push logs| B[Distributor]
    B -->|Hash ring routing| C[Ingester]
    C -->|Flush chunks| D[Object Storage
S3 / GCS / Azure Blob]
    C -->|Write index| E[Index Gateway]
    E -->|Store index| D

    F[Grafana / LogCLI] -->|LogQL query| G[Query Frontend]
    G -->|Split & cache| H[Querier]
    H -->|Recent data| C
    H -->|Historical data| D
    H -->|Index lookups| E

    I[Compactor] -->|Merge & deduplicate| D
    I -->|Retention enforcement| D

The major components in the Loki architecture:

Component	Role
Distributor	Receives incoming log streams, validates labels, and routes to ingesters via consistent hashing
Ingester	Builds compressed chunks in memory, flushes to object storage on size/time thresholds
Querier	Executes LogQL queries — reads from both ingesters (recent) and object storage (historical)
Query Frontend	Splits large queries into smaller sub-queries, caches results, enforces query limits
Compactor	Merges small index files, deduplicates chunks, enforces retention policies
Index Gateway	Serves index queries to queriers, reducing direct object storage reads

LogQL Essentials

LogQL is Loki's query language — structurally similar to PromQL but designed for logs. Every query begins with a log stream selector followed by optional filter, parser, and metric stages.

Log Stream Selectors

Stream selectors use label matchers to identify which log streams to query:

# Exact match
{namespace="production", service="api-gateway"}

# Regex match
{namespace="production", service=~"api-.*"}

# Not equal
{namespace!="kube-system"}

# Regex not match
{service!~"debug-.*"}

Line Filter Expressions

After selecting streams, filter log lines by content:

# Contains string (case-sensitive)
{service="api-gateway"} |= "error"

# Does not contain
{service="api-gateway"} != "health"

# Regex match
{service="api-gateway"} |~ "status=(4|5)\\d{2}"

# Regex not match
{service="api-gateway"} !~ "GET /healthz"

# Chain multiple filters (AND logic)
{service="api-gateway"} |= "error" != "health" |~ "timeout|connection refused"

Parser Expressions

Extract structured fields from log lines for filtering and aggregation:

# JSON parser — extracts all JSON keys as labels
{service="api-gateway"} | json

# Filter on extracted field
{service="api-gateway"} | json | status >= 500

# logfmt parser — for key=value formatted logs
{service="payments"} | logfmt | level="error" | duration > 500ms

# Regexp parser — named capture groups become labels
{service="nginx"} | regexp `(?P<method>\w+) (?P<path>\S+) (?P<status>\d+)`
| status >= 400

# Line format — rewrite the log line for display
{service="api-gateway"} | json
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

# Label format — rename or modify extracted labels
{service="api-gateway"} | json | label_format duration_s="{{divide .duration_ms 1000}}"

Metric Queries

Convert log streams into numeric time series for dashboards and alerting:

# Count log lines per second (error rate)
rate({service="api-gateway"} |= "error" [5m])

# Total count over time window
count_over_time({service="api-gateway"} |= "error" [1h])

# Bytes rate — ingestion throughput per stream
bytes_rate({namespace="production"}[5m])

# Sum by label for top error producers
sum by (service) (rate({namespace="production"} |= "error" [5m]))

# Quantile over extracted numeric values
quantile_over_time(0.99,
  {service="api-gateway"} | json | unwrap duration_ms [5m]
) by (method)

# Average request size using unwrap
avg_over_time(
  {service="api-gateway"} | json | unwrap bytes | __error__="" [5m]
) by (endpoint)

Unwrap Expressions

The unwrap operator extracts a numeric value from a parsed label, enabling mathematical aggregations over log data:

# Extract duration_ms from JSON logs and compute p99 latency
{service="api-gateway"}
| json
| unwrap duration_ms
| __error__=""   # Drop lines where parsing failed
| quantile_over_time(0.99, [5m]) by (endpoint)

# Histogram of response sizes using unwrap
{service="api-gateway"}
| logfmt
| unwrap response_bytes
| __error__=""
| sum_over_time([5m]) by (method)

# Rate of bytes processed per second
{service="api-gateway"}
| json
| unwrap bytes_processed
| __error__=""
| rate([5m]) by (handler)

                            
                            Performance Warning: Metric queries over large time ranges are expensive. Always include tight label selectors and line filters before parsers and unwrap to reduce data scanned. Use recording rules for dashboard queries that aggregate across many streams.
                        

Label Strategy

Labels are the foundation of Loki's indexing model. Each unique combination of labels creates a separate stream. Too many streams (high cardinality) degrades performance exponentially.

Category	Label	Cardinality	Recommendation
Use ✓	`namespace`	Low (10-50)	Kubernetes namespace — primary query dimension
	`service`	Low-Medium (50-200)	Service or deployment name
	`level`	Very Low (4-6)	info, warn, error, debug, fatal
	`cluster`	Low (2-10)	Multi-cluster identification
	`env`	Very Low (3-4)	dev, staging, production
Avoid ✗	`user_id`	Unbounded	Extract at query time with `\| json \| user_id="abc123"`
	`request_id`	Unbounded	Use line filter: `\|= "req-abc123"`
	`ip_address`	High (thousands)	Extract with parser at query time
	`trace_id`	Unbounded	Use derived fields in Grafana for linking
	`pod_name`	High (dynamic)	Pods are ephemeral — use `service` + `namespace`

                            
                            The 10-Label Rule: Keep total unique label combinations (active streams) under 100,000 per tenant. Each label added multiplies stream count. A good target is 5-8 static labels per log stream. Anything you would grep for at query time should be extracted with a parser — not stored as a label.
                        

Storage Backends

Loki stores two types of data: chunks (compressed log data) and index (label-to-chunk mappings). Both can target different backends depending on scale and cost requirements.

Backend	Chunks	Index	Best For	Limitations
Filesystem	✓	✓	Development, single-node testing	No HA, limited scalability, data loss risk
Amazon S3	✓	✓ (TSDB)	AWS production deployments	Egress costs on cross-AZ queries
Google GCS	✓	✓ (TSDB)	GCP production deployments	Less cost-effective for frequent reads
Azure Blob	✓	✓ (TSDB)	Azure production deployments	Higher latency for small objects
MinIO	✓	✓ (TSDB)	On-premise S3-compatible storage	Self-managed, capacity planning needed

# loki-config.yaml — S3 storage with TSDB index
schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  aws:
    s3: s3://us-east-1/my-loki-bucket
    bucketnames: my-loki-bucket
    region: us-east-1
    access_key_id: ${AWS_ACCESS_KEY_ID}
    secret_access_key: ${AWS_SECRET_ACCESS_KEY}

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: s3

Deployment Modes

Loki offers three deployment modes, each suited to different scale requirements:

Mode	Components	Scale	Best For
Monolithic	All in single binary	< 100 GB/day	Development, small teams, single-node
Simple Scalable	Read path + Write path + Backend	100 GB – 10 TB/day	Most production workloads, Kubernetes
Microservices	Each component independently scaled	> 10 TB/day	Large-scale multi-tenant platforms

Monolithic Mode

# Single binary — all components in one process
loki -config.file=/etc/loki/loki-config.yaml -target=all

Simple Scalable Mode (Recommended)

The Simple Scalable deployment groups components into three targets that can be independently scaled:

# Helm values for simple-scalable deployment
loki:
  auth_enabled: true
  commonConfig:
    replication_factor: 3

write:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "2Gi" }
    limits: { cpu: "2", memory: "4Gi" }
  persistence:
    size: 50Gi

read:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "2Gi" }
    limits: { cpu: "2", memory: "4Gi" }

backend:
  replicas: 2
  resources:
    requests: { cpu: "500m", memory: "1Gi" }
    limits: { cpu: "1", memory: "2Gi" }

gateway:
  replicas: 2
  ingress:
    enabled: true
    hosts:
      - host: loki.internal.example.com

Microservices Mode

# Each component runs as a separate deployment
loki -config.file=/etc/loki/loki-config.yaml -target=distributor
loki -config.file=/etc/loki/loki-config.yaml -target=ingester
loki -config.file=/etc/loki/loki-config.yaml -target=querier
loki -config.file=/etc/loki/loki-config.yaml -target=query-frontend
loki -config.file=/etc/loki/loki-config.yaml -target=compactor
loki -config.file=/etc/loki/loki-config.yaml -target=index-gateway

                            
                            Recommendation: Start with Simple Scalable mode for most production deployments. It provides horizontal scaling with far less operational complexity than full microservices. Only move to microservices when you need fine-grained control over individual component resources (typically at multi-TB/day scale).
                        

Production Checklist

Checklist

Loki Production Readiness

Use object storage (S3/GCS/Azure Blob) for chunks and TSDB index — never rely on filesystem storage in production
Set replication_factor: 3 for ingesters to survive node failures without data loss
Keep active stream count below 100,000 per tenant — enforce with max_streams_per_user limit
Configure retention with compactor — set retention_enabled: true and define retention_period per tenant
Enable query frontend caching (memcached or Redis) to avoid repeated object storage reads
Set per-tenant rate limits: ingestion_rate_mb, ingestion_burst_size_mb, max_query_series
Use structured logging (JSON or logfmt) at the application level to enable efficient parser-based queries
Deploy Promtail/Alloy with pipeline stages that drop debug logs before shipping — reduce ingestion volume at the source
Configure chunk_target_size: 1572864 (1.5 MB) for optimal compression ratio and read performance
Monitor Loki itself with Prometheus — track loki_ingester_chunk_utilization, loki_distributor_bytes_received_total, and query latency histograms

LokiProductionLog Aggregation

Previous Deep DiveGrafana Complete Guide Next Deep Dive Jaeger Complete Guide

Cookie Consent

Tool Deep Dive: Loki Complete Guide

Table of Contents

Loki Architecture

LogQL Essentials

Log Stream Selectors

Line Filter Expressions

Parser Expressions

Metric Queries

Unwrap Expressions

Label Strategy

Storage Backends

Deployment Modes

Monolithic Mode

Simple Scalable Mode (Recommended)

Microservices Mode

Production Checklist

Loki Production Readiness

Cookie Consent

Tool Deep Dive: Loki Complete Guide

Table of Contents

Loki Architecture

LogQL Essentials

Log Stream Selectors

Line Filter Expressions

Parser Expressions

Metric Queries

Unwrap Expressions

Label Strategy

Storage Backends

Deployment Modes

Monolithic Mode

Simple Scalable Mode (Recommended)

Microservices Mode

Production Checklist

Loki Production Readiness

Related Deep Dives

Tool Deep Dive: Grafana Complete Guide

Part 4: Logging — Strategies & Pipelines

Part 7: Visualization & Alerting