Introducing Grafana Loki
Grafana Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. Unlike traditional log management systems that index the full text of every log line, Loki indexes only a small set of labels (key-value pairs) associated with each log stream. The actual log content is stored compressed in object storage, making Loki significantly cheaper to operate at scale.
Loki was created at Grafana Labs in 2018 to solve a fundamental problem: organizations needed a logging backend that was cost-effective, operationally simple, and deeply integrated with their existing Prometheus and Grafana workflows. The result was a system that treats logs like metrics — using the same label-based model that made Prometheus successful.
Why Loki?
The core philosophy behind Loki can be summarized in three principles:
- Logs should be cheap — By not indexing log content, Loki dramatically reduces storage and compute costs compared to full-text search engines
- Labels are the index — The same labels you use in Prometheus (namespace, pod, container, job) identify log streams, creating a natural correlation between metrics and logs
- Simple operations — Loki uses object storage (S3, GCS, Azure Blob) for chunks, requires no complex cluster management like sharding or rebalancing, and scales horizontally by adding read or write replicas
Label-Based Indexing Explained
In Loki's model, a log stream is a unique combination of labels. For example:
# These are three distinct log streams:
{namespace="production", app="checkout", container="api"}
{namespace="production", app="checkout", container="worker"}
{namespace="staging", app="checkout", container="api"}
Each stream receives a continuous flow of timestamped log entries. Loki's index maps label sets to chunk locations in object storage. When you query, Loki first resolves which streams match your label selectors, then scans only those chunks — rather than searching the entire log corpus.
This approach creates a massive performance advantage for targeted queries. If you have 10,000 log streams but your query matches only 3, Loki reads data from just those 3 streams. A full-text search engine would need to consult its inverted index across all documents regardless.
Loki vs Elasticsearch
Understanding Loki's trade-offs compared to Elasticsearch (the traditional log search engine) helps you decide when each tool is appropriate:
Loki vs Elasticsearch: Key Differences
| Aspect | Loki | Elasticsearch |
|---|---|---|
| Indexing | Labels only (metadata) | Full-text inverted index |
| Storage Cost | Low (compressed chunks in object storage) | High (indexed data on fast disks) |
| Query Speed | Fast for label-scoped queries; slower for grep-all | Fast for arbitrary full-text search |
| Operational Complexity | Low (stateless components + object storage) | High (JVM tuning, shard management, disk I/O) |
| Ingestion Format | Push-based (Promtail, Alloy, OTel Collector) | Push-based (Beats, Logstash, Fluentd) |
| Query Language | LogQL (Prometheus-inspired) | KQL / Lucene / ES|QL |
| Best For | Cloud-native, Kubernetes, cost-sensitive | Security analytics, full-text search, complex aggregations |
Understanding LogQL
LogQL is Loki's query language, designed to feel familiar to Prometheus users. It combines label-based stream selection with a powerful pipeline of transformations that filter, parse, and reshape log lines. LogQL has two main query types:
- Log queries — Return log lines (the actual text content)
- Metric queries — Return numeric values computed from logs (rates, counts, percentiles)
The LogQL Query Builder in Grafana
Before diving into raw LogQL syntax, it's worth noting that Grafana provides an excellent visual query builder for Loki. In the Explore view, select your Loki data source and switch to "Builder" mode. The builder provides:
- Label browser — Dropdown menus showing all available label names and their values
- Pipeline stages — Visual drag-and-drop for adding line filters, parsers, and label filters
- Operation selector — Choose between log queries and metric queries with guided parameter input
- Query preview — Real-time rendering of the raw LogQL expression as you build visually
The builder is excellent for learning and exploration. However, for complex queries and dashboards, writing raw LogQL gives you full control. The rest of this article focuses on the raw syntax.
LogQL Feature Overview
A LogQL query is structured as a pipeline, flowing from left to right:
flowchart LR
A[Stream Selector
"{labels}"] --> B[Line Filters
"|= |~ !~ !="]
B --> C[Parser
"| json | logfmt
| pattern | regexp"]
C --> D[Label Filter
"| status >= 400"]
D --> E[Line Format
"| line_format"]
E --> F[Unwrap
"| unwrap duration"]
style A fill:#3B9797,color:#fff
style B fill:#16476A,color:#fff
style C fill:#132440,color:#fff
style D fill:#BF092F,color:#fff
style E fill:#16476A,color:#fff
style F fill:#3B9797,color:#fff
Each stage is optional — you can have as few or as many pipeline stages as needed. The only required element is the stream selector.
Log Stream Selectors
Every LogQL query begins with a stream selector enclosed in curly braces. This selects which log streams to read from, using label matchers:
# Exact match — select logs from the "checkout" app
{app="checkout"}
# Multiple labels — AND logic (all must match)
{namespace="production", app="checkout", container="api"}
# Not equal — exclude a specific value
{namespace!="kube-system"}
# Regex match — labels matching a pattern
{app=~"checkout|payment|inventory"}
# Regex not match — exclude labels matching a pattern
{namespace=~".+", app!~"debug-.*"}
The four matcher operators are:
=— Exact equality!=— Not equal=~— Regex match (RE2 syntax)!~— Regex not match
= or =~ match. You cannot query {} (all streams) or use only exclusions like {app!="debug"}. Loki needs a positive selector to identify which streams to read.
The Log Pipeline
After the stream selector, you can chain multiple pipeline stages using the pipe operator |. Each stage processes log lines sequentially, transforming or filtering them before passing to the next stage.
Line Filters
Line filters are the simplest and fastest pipeline stage — they match against the text content of each log line:
# Contains — keep lines containing "error"
{app="checkout"} |= "error"
# Does not contain — drop lines with "health"
{app="checkout"} != "healthcheck"
# Regex match — lines matching a pattern
{app="checkout"} |~ "status=(4|5)\\d{2}"
# Regex not match — drop lines matching pattern
{app="checkout"} !~ "DEBUG|TRACE"
# Case-insensitive contains
{app="checkout"} |= "error" or |= "Error" or |= "ERROR"
# Multiple filters — AND logic (all must pass)
{app="checkout"} |= "error" != "timeout" |~ "user_id=\\d+"
Line filters are evaluated before any parsing happens, making them extremely fast. Always place the most selective filter first to reduce the volume of data flowing through subsequent stages.
Parsers
Parsers extract structured fields from log lines and create new labels from them. Loki supports four built-in parsers:
JSON Parser
# Parse entire JSON log line — all top-level keys become labels
{app="checkout"} | json
# Parse specific fields only
{app="checkout"} | json level, method, status, duration
# Access nested fields with parameter expressions
{app="checkout"} | json request_id="request.id", user="context.user_id"
Given a log line like:
{"level":"error","method":"POST","path":"/api/checkout","status":500,"duration":1.23,"request":{"id":"abc-123"},"context":{"user_id":"user-456"}}
After | json, you can filter on extracted labels: | status >= 400
Logfmt Parser
# Parse logfmt-formatted lines
{app="ingester"} | logfmt
# Parse specific keys
{app="ingester"} | logfmt level, caller, msg, duration
Logfmt lines look like: level=info caller=ingester.go:123 msg="chunk flushed" duration=2.5s bytes=1048576
Pattern Parser
# Extract fields from structured text using a pattern template
# Underscores (_) are unnamed captures (discarded), named fields become labels
{app="nginx"} | pattern "<ip> - - [<_>] \"<method> <path> <_>\" <status> <bytes>"
# Common log format extraction
{job="apache"} | pattern "<ip> <_> <user> [<timestamp>] \"<method> <uri> <_>\" <status> <size>"
Regexp Parser
# Use named capture groups to extract labels
{app="legacy"} | regexp "(?P<timestamp>\\d{4}-\\d{2}-\\d{2}T[\\d:.]+Z) \\[(?P<level>\\w+)\\] (?P<msg>.+)"
# Extract specific values
{app="gateway"} | regexp "request_id=(?P<request_id>[a-f0-9-]+)"
json for JSON logs, logfmt for key=value logs, pattern for fixed-format logs with variable fields, and regexp as a last resort for irregular formats. Pattern is faster than regexp because it doesn't use regex internally.
Label Filters
After parsing, you can filter on the extracted labels using comparison operators:
# Numeric comparison
{app="checkout"} | json | status >= 400
# String equality
{app="checkout"} | json | level = "error"
# Multiple conditions (AND)
{app="checkout"} | json | status >= 500 | method = "POST"
# Duration comparison (auto-parses units like ms, s, m, h)
{app="checkout"} | json | duration > 2s
# Byte size comparison
{app="ingester"} | logfmt | bytes > 1MB
# Regex filter on extracted label
{app="checkout"} | json | path =~ "/api/(checkout|payment)/.*"
# Combining label filters with OR using parentheses
{app="checkout"} | json | (status >= 500 or duration > 5s)
Line Format Expression
The line_format expression rewrites the entire log line using Go's text/template syntax. This is useful for creating clean, readable output from structured logs:
# Reformat log output for readability
{app="checkout"} | json
| line_format "{{.level | ToUpper}} [{{.method}}] {{.path}} → {{.status}} ({{.duration}})"
# Include conditional formatting
{app="checkout"} | json
| line_format "{{if eq .level \"error\"}}🔴{{else}}🟢{{end}} {{.msg}}"
# Align columns for terminal viewing
{app="checkout"} | json
| line_format "{{printf \"%-5s\" .level}} {{printf \"%-6s\" .method}} {{printf \"%-30s\" .path}} {{.status}}"
Unwrap Expression
The unwrap expression converts an extracted label into a numeric sample value, bridging the gap between log queries and metric queries. This is essential for creating metrics from log data:
# Unwrap a duration field for quantile calculations
{app="checkout"} | json | unwrap duration
# Unwrap with unit conversion (duration strings auto-parsed)
{app="checkout"} | logfmt | unwrap duration [5m]
# Unwrap bytes_processed for sum calculations
{app="ingester"} | logfmt | unwrap bytes
Unwrapped values feed into unwrap range aggregations like quantile_over_time, avg_over_time, and max_over_time, which we'll cover next.
LogQL Metric Queries
Metric queries transform logs into numeric time series — enabling you to build dashboards, set alerts, and compute statistics from log data. They wrap a log query in an aggregation function that operates over a time range.
Log Range Aggregations
These functions operate on the count or size of log lines within a time window:
# rate() — log lines per second over the range
# "How many errors per second is checkout generating?"
rate({app="checkout"} |= "error" [5m])
# count_over_time() — total log lines in the range
# "How many 500 errors in the last hour?"
count_over_time({app="checkout"} | json | status = 500 [1h])
# bytes_over_time() — total bytes of log lines in the range
# "How much log data is the ingester producing?"
bytes_over_time({app="ingester"} [5m])
# bytes_rate() — bytes per second
bytes_rate({app="ingester"} [5m])
# absent_over_time() — returns 1 if no logs exist in the range (deadman alert)
# "Alert if checkout stops logging entirely"
absent_over_time({app="checkout"} [15m])
[5m] at the end is the range vector — it defines the time window for the aggregation. Choose a range that balances smoothness vs. responsiveness: [1m] is noisy but responsive, [15m] is smooth but lags behind changes.
Unwrap Range Aggregations
When you've used unwrap to extract a numeric value from logs, these functions compute statistics over the unwrapped values:
# quantile_over_time() — percentiles of a numeric field
# "What's the p99 response time from checkout logs?"
quantile_over_time(0.99,
{app="checkout"} | json | unwrap duration [5m]
)
# avg_over_time() — average of numeric values
avg_over_time(
{app="checkout"} | json | unwrap duration [5m]
)
# max_over_time() / min_over_time()
max_over_time(
{app="checkout"} | json | status >= 200 | unwrap duration [5m]
)
# sum_over_time() — total of numeric values
# "Total bytes processed by ingester in 5-minute windows"
sum_over_time(
{app="ingester"} | logfmt | unwrap bytes [5m]
)
# stddev_over_time() / stdvar_over_time() — variability
stddev_over_time(
{app="checkout"} | json | unwrap duration [5m]
)
# first_over_time() / last_over_time() — boundary values
last_over_time(
{app="checkout"} | json | unwrap status [5m]
)
Aggregation Operators
Aggregation operators combine multiple time series produced by metric queries. They work identically to Prometheus aggregation operators:
# sum — total error rate across all containers
sum(rate({namespace="production"} |= "error" [5m]))
# sum by — error rate grouped by app
sum by (app) (rate({namespace="production"} |= "error" [5m]))
# topk — top 5 apps by error rate
topk(5, sum by (app) (rate({namespace="production"} |= "error" [5m])))
# avg — average request duration across pods
avg(
avg_over_time({app="checkout"} | json | unwrap duration [5m])
)
# max by — highest p99 latency per service
max by (app) (
quantile_over_time(0.99, {namespace="production"} | json | unwrap duration [5m])
)
# count — number of services with errors
count(sum by (app) (rate({namespace="production"} |= "error" [5m])) > 0)
Other available aggregation operators include: min, stddev, stdvar, bottomk, and sort / sort_desc.
Practical Metric Query Examples
Here are real-world LogQL metric queries you'd use in production dashboards:
# Error rate percentage — what % of requests are errors?
sum(rate({app="checkout"} | json | status >= 500 [5m]))
/
sum(rate({app="checkout"} | json | status > 0 [5m]))
* 100
# Request throughput by HTTP method
sum by (method) (rate({app="checkout"} | json [5m]))
# Slow request detection — requests over 2 seconds
sum by (path) (
count_over_time({app="checkout"} | json | duration > 2s [5m])
)
# Log volume by severity level
sum by (level) (bytes_rate({namespace="production"} | json [5m]))
# Unique user count approximation (using distinct pattern counts)
count(
sum by (user_id) (
count_over_time({app="checkout"} | json | user_id != "" [1h])
)
)
Building a Log-Based Error Budget
Using the OpenTelemetry Demo from Part 3, create a dashboard panel that shows the error budget consumption for the frontend service over the past 24 hours:
# SLO: 99.5% of requests should be non-5xx
# Error budget = 0.5% of total requests
# Current error ratio (should be < 0.005 for healthy SLO)
sum(rate({app="frontend"} | json | status >= 500 [24h]))
/
sum(rate({app="frontend"} | json | status > 0 [24h]))
If this value exceeds 0.005, you've consumed your entire error budget for that window.
Exploring Loki's Architecture
Understanding Loki's internal architecture helps you optimize queries, plan capacity, and troubleshoot performance issues. Loki is composed of several microservices that can run together (monolithic mode) or independently (microservices mode).
flowchart TB
subgraph Clients["Log Shippers"]
Alloy[Grafana Alloy]
Prom[Promtail]
OTel[OTel Collector]
end
subgraph WritePath["Write Path"]
Dist[Distributor
validates, rate-limits,
hashes to ingesters]
Ing[Ingester
batches chunks in memory,
flushes to storage]
end
subgraph ReadPath["Read Path"]
QF[Query Frontend
splits, caches,
retries queries]
Sched[Query Scheduler
fair queuing across
tenants]
Quer[Querier
executes query against
ingesters + storage]
end
subgraph Storage["Storage Layer"]
Obj[(Object Storage
S3 / GCS / Azure
Chunks)]
Idx[(Index
BoltDB / TSDB
Label → Chunk mapping)]
end
subgraph Maintenance["Background"]
Comp[Compactor
merges index files,
applies retention]
IG[Index Gateway
serves index queries
from cache]
end
Alloy --> Dist
Prom --> Dist
OTel --> Dist
Dist --> Ing
Ing --> Obj
Ing --> Idx
QF --> Sched
Sched --> Quer
Quer --> Ing
Quer --> IG
IG --> Idx
Quer --> Obj
Comp --> Idx
Comp --> Obj
The Write Path
When log data arrives at Loki, it flows through these components:
Distributor
The distributor is the first component to receive log pushes. It performs several critical functions:
- Validation — Checks that timestamps are within acceptable bounds, labels are valid, and line sizes don't exceed limits
- Rate limiting — Enforces per-tenant ingestion rate limits to prevent noisy neighbors
- Consistent hashing — Uses a hash ring to determine which ingester(s) should receive each stream, ensuring all entries for a given stream go to the same ingester
- Replication — Writes to multiple ingesters (typically RF=3) for durability
Ingester
Ingesters are stateful components that build compressed chunks in memory:
- Chunk building — Accumulates log entries into chunks (typically ~1.5 MB compressed). Each chunk represents a time window of a single stream
- Write-Ahead Log (WAL) — Persists incoming data to disk immediately for crash recovery
- Flushing — When a chunk reaches its target size or age, the ingester flushes it to object storage and updates the index
- Live tail — Serves real-time log queries against in-memory (not-yet-flushed) data
The Read Path
Query Frontend
The query frontend is an optional (but recommended) component that sits in front of queriers:
- Query splitting — Large time-range queries are split into smaller sub-queries executed in parallel
- Result caching — Caches query results to accelerate repeated dashboard loads
- Query retries — Automatically retries failed sub-queries
- Query limits — Enforces maximum query length and complexity per tenant
Query Scheduler
In microservices mode, the scheduler provides fair queuing across tenants, preventing any single tenant's expensive queries from starving others.
Querier
The querier executes the actual LogQL queries by reading from both sources:
- Ingesters — For recent, in-memory data
- Object storage — For historical, flushed chunks
The querier merges results from both sources, deduplicates entries (since replication means data exists on multiple ingesters), and applies the full LogQL pipeline.
Storage & Compaction
Chunk Storage
Log content is stored as compressed chunks in object storage. Each chunk is a compressed block of log entries for a single stream within a time window. Loki supports S3, GCS, Azure Blob Storage, and local filesystem (for testing).
Index
The index maps label sets to chunk locations. Loki has evolved through several index formats:
- BoltDB Shipper (legacy) — BoltDB files uploaded to object storage periodically
- TSDB (current, recommended) — Prometheus TSDB-based index, more efficient and simpler to operate
Compactor
The compactor runs as a background process that:
- Merges index files — Combines many small index files into larger, more efficient ones
- Applies retention — Deletes chunks and index entries older than the configured retention period
- Manages delete requests — Processes any explicit log deletion requests (compliance use cases)
Deployment Modes
Loki offers three deployment modes to match different scale requirements:
Loki Deployment Modes
| Mode | Scale | Use Case |
|---|---|---|
| Monolithic | < 100 GB/day | Development, small teams, single-node deployment |
| Simple Scalable (SSD) | 100 GB – 1 TB/day | Mid-size production, read/write path separation |
| Microservices | > 1 TB/day | Large-scale production, independent scaling of each component |
The Simple Scalable Deployment (SSD) mode is recommended for most production environments. It splits Loki into a write path and read path, each independently scalable, while keeping operational complexity low.
Tips, Tricks & Best Practices
Operating Loki effectively requires understanding its constraints and working with — not against — its design philosophy. These practices come from real-world production deployments.
Label Cardinality
Label cardinality is the single most important factor affecting Loki's performance and cost. Each unique combination of labels creates a new stream, and each stream has overhead in the index and ingester memory.
Good labels (low cardinality, stable):
# These create a bounded, predictable number of streams
labels:
namespace: production # 3-5 values
app: checkout # 20-50 values
container: api # 2-3 per app
environment: prod # 3 values (dev/staging/prod)
region: us-east-1 # 5-10 values
level: error # 5 values (debug/info/warn/error/fatal)
Bad labels (high cardinality, dangerous):
# DO NOT DO THIS — creates millions of streams
labels:
user_id: "user-123456" # millions of values
request_id: "abc-def-ghi" # unique per request
ip_address: "192.168.1.42" # thousands of values
timestamp: "2026-06-15T10:30" # infinite values
url_path: "/users/123/orders" # unbounded (includes IDs)
A healthy Loki deployment typically has fewer than 100,000 active streams. Each tenant should aim for the lowest number of streams that still allows efficient querying.
Structured Metadata
Loki 3.0 introduced structured metadata as a middle ground between labels and log content. Structured metadata fields are stored alongside log entries but do NOT create new streams:
# Sending structured metadata with the Loki push API
streams:
- stream:
app: checkout
namespace: production
values:
- ["1718447400000000000", "POST /api/checkout 200 1.2s",
{"trace_id": "abc123", "user_id": "user-456", "pod": "checkout-7b4f8c-xk2p9"}]
You can filter on structured metadata in LogQL using the same syntax as extracted labels:
# Filter by structured metadata field
{app="checkout"} | trace_id = "abc123"
# Combine with other pipeline stages
{app="checkout"} | trace_id = "abc123" | json | status >= 400
This is the recommended approach for fields that are high-cardinality but frequently queried (trace IDs, pod names, request IDs).
Retention & Limits
Configure retention and limits to control costs and prevent abuse:
# loki-config.yaml — key retention and limits settings
limits_config:
# Ingestion limits
ingestion_rate_mb: 10 # MB/s per tenant
ingestion_burst_size_mb: 20 # burst allowance
max_streams_per_user: 50000 # stream limit per tenant
max_line_size: 256KB # reject lines larger than this
max_label_name_length: 1024
max_label_value_length: 2048
max_label_names_per_series: 30
# Query limits
max_query_length: 721h # max query time range (30 days)
max_query_series: 5000 # max streams per query
max_entries_limit_per_query: 10000 # max log lines returned
# Retention
retention_period: 744h # 31 days global retention
compactor:
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: s3
Query Optimization
Write efficient LogQL queries by following these principles:
- Narrow stream selection first — Use specific labels to minimize the number of streams scanned
- Line filter before parser — Filter on raw text (fast) before invoking a parser (slower)
- Shortest time range possible — Query
[1h]instead of[7d]when possible - Use specific parsers —
| json level, statusis faster than| json(extracts only needed fields) - Avoid regex when substring works —
|= "error"is much faster than|~ ".*error.*"
# ❌ SLOW — broad selector, regex line filter, full JSON parse
{namespace=~".+"} |~ ".*error.*" | json
# ✅ FAST — specific selector, substring filter, targeted parse
{namespace="production", app="checkout"} |= "error" | json level, status
# ❌ SLOW — 7-day range, aggregating all streams
sum(rate({namespace="production"} |= "error" [7d]))
# ✅ FAST — 5-minute range, pre-filtered to specific app
sum by (app) (rate({namespace="production", app="checkout"} |= "error" [5m]))
Alerting on Logs
Loki integrates with Grafana Alerting to fire alerts based on log patterns. Log-based alerts use metric queries as their condition:
# Grafana Alert Rule — fire when error rate exceeds threshold
# (configured via Grafana UI or provisioning)
alert: HighErrorRate
expr: |
sum(rate({app="checkout", namespace="production"} |= "error" [5m])) > 5
for: 2m
labels:
severity: critical
team: checkout
annotations:
summary: "Checkout error rate is {{ $value }} errors/sec"
description: "The checkout service in production is generating more than 5 errors/sec for 2+ minutes."
runbook: "https://wiki.internal/runbooks/checkout-errors"
# Deadman alert — fire when a critical service stops logging
alert: CheckoutSilent
expr: |
absent_over_time({app="checkout", namespace="production"} [15m])
for: 0m
labels:
severity: critical
annotations:
summary: "No logs from checkout for 15 minutes"
description: "The checkout service has not produced any log output. The service may be down or disconnected."
Common Log-Based Alert Patterns
- Error rate threshold —
rate({app="X"} |= "error" [5m]) > N - Deadman / silence —
absent_over_time({app="X"} [15m]) - Latency spike —
quantile_over_time(0.99, {app="X"} | json | unwrap duration [5m]) > 5 - Log volume anomaly —
bytes_rate({app="X"} [5m]) > 10MB(sudden log flood) - Specific error pattern —
count_over_time({app="X"} |= "OOMKilled" [5m]) > 0 - Error budget burn — Ratio of error lines to total lines exceeding SLO threshold
Summary & Next Steps
In this article, we've covered the full spectrum of working with Grafana Loki:
- Loki's philosophy — Label-based indexing, cost-efficiency, and Prometheus-inspired design
- LogQL fundamentals — Stream selectors, line filters, parsers (json, logfmt, pattern, regexp), label filters, line_format, and unwrap
- Metric queries — rate, count_over_time, bytes_over_time, quantile_over_time, aggregation operators, and practical dashboard patterns
- Architecture — Distributors, ingesters, queriers, query frontend, compactor, and the read/write path separation
- Best practices — Label cardinality management, structured metadata, retention configuration, query optimization, and log-based alerting
With LogQL mastery, you can build powerful log-based dashboards, create targeted alerts, and correlate log events with metrics and traces. The key takeaway is to think of Loki as a targeted search tool rather than a full-text search engine — start with labels, filter with pipelines, and extract only what you need.
Next in the Grafana Track
In Part 5: Monitoring with Metrics Using Grafana Mimir & Prometheus, we'll explore the metrics pillar — PromQL fundamentals, Mimir's architecture, building production dashboards, recording rules, and long-term metrics storage strategies.