Back to Monitoring, Observability & Reliability Series

Tool Deep Dive: Loki Complete Guide

May 14, 2026 Wasil Zafar 19 min read

The definitive reference for Grafana Loki — from architecture internals and LogQL mastery to label strategies, storage backends, deployment modes, and production tuning for cost-effective log aggregation at scale.

Table of Contents

  1. Loki Architecture
  2. LogQL Essentials
  3. Label Strategy
  4. Storage Backends
  5. Deployment Modes
  6. Production Checklist

Loki Architecture

Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. Unlike traditional log systems (Elasticsearch, Splunk), Loki indexes only labels — not the full text of log lines — making it significantly cheaper to operate at scale.

Key Insight: Loki is "like Prometheus, but for logs." It uses the same label-based approach for discovery, the same service discovery mechanisms, and integrates natively with Grafana. Logs are stored as compressed chunks indexed only by their label set and timestamp range.
Loki Architecture — Write & Read Paths
flowchart TD
    A[Promtail / Alloy] -->|Push logs| B[Distributor]
    B -->|Hash ring routing| C[Ingester]
    C -->|Flush chunks| D[Object Storage
S3 / GCS / Azure Blob] C -->|Write index| E[Index Gateway] E -->|Store index| D F[Grafana / LogCLI] -->|LogQL query| G[Query Frontend] G -->|Split & cache| H[Querier] H -->|Recent data| C H -->|Historical data| D H -->|Index lookups| E I[Compactor] -->|Merge & deduplicate| D I -->|Retention enforcement| D

The major components in the Loki architecture:

ComponentRole
DistributorReceives incoming log streams, validates labels, and routes to ingesters via consistent hashing
IngesterBuilds compressed chunks in memory, flushes to object storage on size/time thresholds
QuerierExecutes LogQL queries — reads from both ingesters (recent) and object storage (historical)
Query FrontendSplits large queries into smaller sub-queries, caches results, enforces query limits
CompactorMerges small index files, deduplicates chunks, enforces retention policies
Index GatewayServes index queries to queriers, reducing direct object storage reads

LogQL Essentials

LogQL is Loki's query language — structurally similar to PromQL but designed for logs. Every query begins with a log stream selector followed by optional filter, parser, and metric stages.

Log Stream Selectors

Stream selectors use label matchers to identify which log streams to query:

# Exact match
{namespace="production", service="api-gateway"}

# Regex match
{namespace="production", service=~"api-.*"}

# Not equal
{namespace!="kube-system"}

# Regex not match
{service!~"debug-.*"}

Line Filter Expressions

After selecting streams, filter log lines by content:

# Contains string (case-sensitive)
{service="api-gateway"} |= "error"

# Does not contain
{service="api-gateway"} != "health"

# Regex match
{service="api-gateway"} |~ "status=(4|5)\\d{2}"

# Regex not match
{service="api-gateway"} !~ "GET /healthz"

# Chain multiple filters (AND logic)
{service="api-gateway"} |= "error" != "health" |~ "timeout|connection refused"

Parser Expressions

Extract structured fields from log lines for filtering and aggregation:

# JSON parser — extracts all JSON keys as labels
{service="api-gateway"} | json

# Filter on extracted field
{service="api-gateway"} | json | status >= 500

# logfmt parser — for key=value formatted logs
{service="payments"} | logfmt | level="error" | duration > 500ms

# Regexp parser — named capture groups become labels
{service="nginx"} | regexp `(?P<method>\w+) (?P<path>\S+) (?P<status>\d+)`
| status >= 400

# Line format — rewrite the log line for display
{service="api-gateway"} | json
| line_format "{{.timestamp}} [{{.level}}] {{.message}}"

# Label format — rename or modify extracted labels
{service="api-gateway"} | json | label_format duration_s="{{divide .duration_ms 1000}}"

Metric Queries

Convert log streams into numeric time series for dashboards and alerting:

# Count log lines per second (error rate)
rate({service="api-gateway"} |= "error" [5m])

# Total count over time window
count_over_time({service="api-gateway"} |= "error" [1h])

# Bytes rate — ingestion throughput per stream
bytes_rate({namespace="production"}[5m])

# Sum by label for top error producers
sum by (service) (rate({namespace="production"} |= "error" [5m]))

# Quantile over extracted numeric values
quantile_over_time(0.99,
  {service="api-gateway"} | json | unwrap duration_ms [5m]
) by (method)

# Average request size using unwrap
avg_over_time(
  {service="api-gateway"} | json | unwrap bytes | __error__="" [5m]
) by (endpoint)

Unwrap Expressions

The unwrap operator extracts a numeric value from a parsed label, enabling mathematical aggregations over log data:

# Extract duration_ms from JSON logs and compute p99 latency
{service="api-gateway"}
| json
| unwrap duration_ms
| __error__=""   # Drop lines where parsing failed
| quantile_over_time(0.99, [5m]) by (endpoint)

# Histogram of response sizes using unwrap
{service="api-gateway"}
| logfmt
| unwrap response_bytes
| __error__=""
| sum_over_time([5m]) by (method)

# Rate of bytes processed per second
{service="api-gateway"}
| json
| unwrap bytes_processed
| __error__=""
| rate([5m]) by (handler)
Performance Warning: Metric queries over large time ranges are expensive. Always include tight label selectors and line filters before parsers and unwrap to reduce data scanned. Use recording rules for dashboard queries that aggregate across many streams.

Label Strategy

Labels are the foundation of Loki's indexing model. Each unique combination of labels creates a separate stream. Too many streams (high cardinality) degrades performance exponentially.

CategoryLabelCardinalityRecommendation
Use ✓namespaceLow (10-50)Kubernetes namespace — primary query dimension
serviceLow-Medium (50-200)Service or deployment name
levelVery Low (4-6)info, warn, error, debug, fatal
clusterLow (2-10)Multi-cluster identification
envVery Low (3-4)dev, staging, production
Avoid ✗user_idUnboundedExtract at query time with | json | user_id="abc123"
request_idUnboundedUse line filter: |= "req-abc123"
ip_addressHigh (thousands)Extract with parser at query time
trace_idUnboundedUse derived fields in Grafana for linking
pod_nameHigh (dynamic)Pods are ephemeral — use service + namespace
The 10-Label Rule: Keep total unique label combinations (active streams) under 100,000 per tenant. Each label added multiplies stream count. A good target is 5-8 static labels per log stream. Anything you would grep for at query time should be extracted with a parser — not stored as a label.

Storage Backends

Loki stores two types of data: chunks (compressed log data) and index (label-to-chunk mappings). Both can target different backends depending on scale and cost requirements.

BackendChunksIndexBest ForLimitations
FilesystemDevelopment, single-node testingNo HA, limited scalability, data loss risk
Amazon S3✓ (TSDB)AWS production deploymentsEgress costs on cross-AZ queries
Google GCS✓ (TSDB)GCP production deploymentsLess cost-effective for frequent reads
Azure Blob✓ (TSDB)Azure production deploymentsHigher latency for small objects
MinIO✓ (TSDB)On-premise S3-compatible storageSelf-managed, capacity planning needed
# loki-config.yaml — S3 storage with TSDB index
schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  aws:
    s3: s3://us-east-1/my-loki-bucket
    bucketnames: my-loki-bucket
    region: us-east-1
    access_key_id: ${AWS_ACCESS_KEY_ID}
    secret_access_key: ${AWS_SECRET_ACCESS_KEY}

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: s3

Deployment Modes

Loki offers three deployment modes, each suited to different scale requirements:

ModeComponentsScaleBest For
MonolithicAll in single binary< 100 GB/dayDevelopment, small teams, single-node
Simple ScalableRead path + Write path + Backend100 GB – 10 TB/dayMost production workloads, Kubernetes
MicroservicesEach component independently scaled> 10 TB/dayLarge-scale multi-tenant platforms

Monolithic Mode

# Single binary — all components in one process
loki -config.file=/etc/loki/loki-config.yaml -target=all

Simple Scalable Mode (Recommended)

The Simple Scalable deployment groups components into three targets that can be independently scaled:

# Helm values for simple-scalable deployment
loki:
  auth_enabled: true
  commonConfig:
    replication_factor: 3

write:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "2Gi" }
    limits: { cpu: "2", memory: "4Gi" }
  persistence:
    size: 50Gi

read:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "2Gi" }
    limits: { cpu: "2", memory: "4Gi" }

backend:
  replicas: 2
  resources:
    requests: { cpu: "500m", memory: "1Gi" }
    limits: { cpu: "1", memory: "2Gi" }

gateway:
  replicas: 2
  ingress:
    enabled: true
    hosts:
      - host: loki.internal.example.com

Microservices Mode

# Each component runs as a separate deployment
loki -config.file=/etc/loki/loki-config.yaml -target=distributor
loki -config.file=/etc/loki/loki-config.yaml -target=ingester
loki -config.file=/etc/loki/loki-config.yaml -target=querier
loki -config.file=/etc/loki/loki-config.yaml -target=query-frontend
loki -config.file=/etc/loki/loki-config.yaml -target=compactor
loki -config.file=/etc/loki/loki-config.yaml -target=index-gateway
Recommendation: Start with Simple Scalable mode for most production deployments. It provides horizontal scaling with far less operational complexity than full microservices. Only move to microservices when you need fine-grained control over individual component resources (typically at multi-TB/day scale).

Production Checklist

Checklist

Loki Production Readiness

  1. Use object storage (S3/GCS/Azure Blob) for chunks and TSDB index — never rely on filesystem storage in production
  2. Set replication_factor: 3 for ingesters to survive node failures without data loss
  3. Keep active stream count below 100,000 per tenant — enforce with max_streams_per_user limit
  4. Configure retention with compactor — set retention_enabled: true and define retention_period per tenant
  5. Enable query frontend caching (memcached or Redis) to avoid repeated object storage reads
  6. Set per-tenant rate limits: ingestion_rate_mb, ingestion_burst_size_mb, max_query_series
  7. Use structured logging (JSON or logfmt) at the application level to enable efficient parser-based queries
  8. Deploy Promtail/Alloy with pipeline stages that drop debug logs before shipping — reduce ingestion volume at the source
  9. Configure chunk_target_size: 1572864 (1.5 MB) for optimal compression ratio and read performance
  10. Monitor Loki itself with Prometheus — track loki_ingester_chunk_utilization, loki_distributor_bytes_received_total, and query latency histograms
LokiProductionLog Aggregation