Grafana Deep Dive Part 4: Looking at Logs with Grafana Loki

Introducing Grafana Loki

Grafana Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. Unlike traditional log management systems that index the full text of every log line, Loki indexes only a small set of labels (key-value pairs) associated with each log stream. The actual log content is stored compressed in object storage, making Loki significantly cheaper to operate at scale.

Loki was created at Grafana Labs in 2018 to solve a fundamental problem: organizations needed a logging backend that was cost-effective, operationally simple, and deeply integrated with their existing Prometheus and Grafana workflows. The result was a system that treats logs like metrics — using the same label-based model that made Prometheus successful.

Why Loki?

The core philosophy behind Loki can be summarized in three principles:

Logs should be cheap — By not indexing log content, Loki dramatically reduces storage and compute costs compared to full-text search engines
Labels are the index — The same labels you use in Prometheus (namespace, pod, container, job) identify log streams, creating a natural correlation between metrics and logs
Simple operations — Loki uses object storage (S3, GCS, Azure Blob) for chunks, requires no complex cluster management like sharding or rebalancing, and scales horizontally by adding read or write replicas

                            
                            Key Insight: Loki's power comes from the realization that most log queries start with a known context — "show me logs from the checkout service in production for the last hour." Labels narrow the search space, then line filters and parsers handle the rest. You rarely need to search all logs for a random string.
                        

Label-Based Indexing Explained

In Loki's model, a log stream is a unique combination of labels. For example:

# These are three distinct log streams:
{namespace="production", app="checkout", container="api"}
{namespace="production", app="checkout", container="worker"}
{namespace="staging", app="checkout", container="api"}

Each stream receives a continuous flow of timestamped log entries. Loki's index maps label sets to chunk locations in object storage. When you query, Loki first resolves which streams match your label selectors, then scans only those chunks — rather than searching the entire log corpus.

This approach creates a massive performance advantage for targeted queries. If you have 10,000 log streams but your query matches only 3, Loki reads data from just those 3 streams. A full-text search engine would need to consult its inverted index across all documents regardless.

Loki vs Elasticsearch

Understanding Loki's trade-offs compared to Elasticsearch (the traditional log search engine) helps you decide when each tool is appropriate:

Comparison

Loki vs Elasticsearch: Key Differences

Aspect	Loki	Elasticsearch
Indexing	Labels only (metadata)	Full-text inverted index
Storage Cost	Low (compressed chunks in object storage)	High (indexed data on fast disks)
Query Speed	Fast for label-scoped queries; slower for grep-all	Fast for arbitrary full-text search
Operational Complexity	Low (stateless components + object storage)	High (JVM tuning, shard management, disk I/O)
Ingestion Format	Push-based (Promtail, Alloy, OTel Collector)	Push-based (Beats, Logstash, Fluentd)
Query Language	LogQL (Prometheus-inspired)	KQL / Lucene / ES\|QL
Best For	Cloud-native, Kubernetes, cost-sensitive	Security analytics, full-text search, complex aggregations

cost-efficiency cloud-native label-based

                            
                            When NOT to Use Loki: If your primary use case is security analytics requiring arbitrary substring searches across petabytes of logs (e.g., "find all occurrences of a specific IP address across all services for the past year"), Elasticsearch or a dedicated SIEM is more appropriate. Loki excels when you know where to look.
                        

Understanding LogQL

LogQL is Loki's query language, designed to feel familiar to Prometheus users. It combines label-based stream selection with a powerful pipeline of transformations that filter, parse, and reshape log lines. LogQL has two main query types:

Log queries — Return log lines (the actual text content)
Metric queries — Return numeric values computed from logs (rates, counts, percentiles)

The LogQL Query Builder in Grafana

Before diving into raw LogQL syntax, it's worth noting that Grafana provides an excellent visual query builder for Loki. In the Explore view, select your Loki data source and switch to "Builder" mode. The builder provides:

Label browser — Dropdown menus showing all available label names and their values
Pipeline stages — Visual drag-and-drop for adding line filters, parsers, and label filters
Operation selector — Choose between log queries and metric queries with guided parameter input
Query preview — Real-time rendering of the raw LogQL expression as you build visually

The builder is excellent for learning and exploration. However, for complex queries and dashboards, writing raw LogQL gives you full control. The rest of this article focuses on the raw syntax.

LogQL Feature Overview

A LogQL query is structured as a pipeline, flowing from left to right:

LogQL Query Pipeline

flowchart LR
    A[Stream Selector
"{labels}"] --> B[Line Filters
"|= |~ !~ !="]
    B --> C[Parser
"| json | logfmt
| pattern | regexp"]
    C --> D[Label Filter
"| status >= 400"]
    D --> E[Line Format
"| line_format"]
    E --> F[Unwrap
"| unwrap duration"]

    style A fill:#3B9797,color:#fff
    style B fill:#16476A,color:#fff
    style C fill:#132440,color:#fff
    style D fill:#BF092F,color:#fff
    style E fill:#16476A,color:#fff
    style F fill:#3B9797,color:#fff

Each stage is optional — you can have as few or as many pipeline stages as needed. The only required element is the stream selector.

Log Stream Selectors

Every LogQL query begins with a stream selector enclosed in curly braces. This selects which log streams to read from, using label matchers:

# Exact match — select logs from the "checkout" app
{app="checkout"}

# Multiple labels — AND logic (all must match)
{namespace="production", app="checkout", container="api"}

# Not equal — exclude a specific value
{namespace!="kube-system"}

# Regex match — labels matching a pattern
{app=~"checkout|payment|inventory"}

# Regex not match — exclude labels matching a pattern
{namespace=~".+", app!~"debug-.*"}

The four matcher operators are:

= — Exact equality
!= — Not equal
=~ — Regex match (RE2 syntax)
!~ — Regex not match

                            
                            Performance Rule: At least one label matcher must be a non-empty = or =~ match. You cannot query {} (all streams) or use only exclusions like {app!="debug"}. Loki needs a positive selector to identify which streams to read.
                        

The Log Pipeline

After the stream selector, you can chain multiple pipeline stages using the pipe operator |. Each stage processes log lines sequentially, transforming or filtering them before passing to the next stage.

Line Filters

Line filters are the simplest and fastest pipeline stage — they match against the text content of each log line:

# Contains — keep lines containing "error"
{app="checkout"} |= "error"

# Does not contain — drop lines with "health"
{app="checkout"} != "healthcheck"

# Regex match — lines matching a pattern
{app="checkout"} |~ "status=(4|5)\\d{2}"

# Regex not match — drop lines matching pattern
{app="checkout"} !~ "DEBUG|TRACE"

# Case-insensitive contains
{app="checkout"} |= "error" or |= "Error" or |= "ERROR"

# Multiple filters — AND logic (all must pass)
{app="checkout"} |= "error" != "timeout" |~ "user_id=\\d+"

Line filters are evaluated before any parsing happens, making them extremely fast. Always place the most selective filter first to reduce the volume of data flowing through subsequent stages.

Parsers

Parsers extract structured fields from log lines and create new labels from them. Loki supports four built-in parsers:

JSON Parser

# Parse entire JSON log line — all top-level keys become labels
{app="checkout"} | json

# Parse specific fields only
{app="checkout"} | json level, method, status, duration

# Access nested fields with parameter expressions
{app="checkout"} | json request_id="request.id", user="context.user_id"

Given a log line like:

{"level":"error","method":"POST","path":"/api/checkout","status":500,"duration":1.23,"request":{"id":"abc-123"},"context":{"user_id":"user-456"}}

After | json, you can filter on extracted labels: | status >= 400

Logfmt Parser

# Parse logfmt-formatted lines
{app="ingester"} | logfmt

# Parse specific keys
{app="ingester"} | logfmt level, caller, msg, duration

Logfmt lines look like: level=info caller=ingester.go:123 msg="chunk flushed" duration=2.5s bytes=1048576

Pattern Parser

# Extract fields from structured text using a pattern template
# Underscores (_) are unnamed captures (discarded), named fields become labels
{app="nginx"} | pattern "<ip> - - [<_>] \"<method> <path> <_>\" <status> <bytes>"

# Common log format extraction
{job="apache"} | pattern "<ip> <_> <user> [<timestamp>] \"<method> <uri> <_>\" <status> <size>"

Regexp Parser

# Use named capture groups to extract labels
{app="legacy"} | regexp "(?P<timestamp>\\d{4}-\\d{2}-\\d{2}T[\\d:.]+Z) \\[(?P<level>\\w+)\\] (?P<msg>.+)"

# Extract specific values
{app="gateway"} | regexp "request_id=(?P<request_id>[a-f0-9-]+)"

                            
                            Parser Selection Guide: Use json for JSON logs, logfmt for key=value logs, pattern for fixed-format logs with variable fields, and regexp as a last resort for irregular formats. Pattern is faster than regexp because it doesn't use regex internally.
                        

Label Filters

After parsing, you can filter on the extracted labels using comparison operators:

# Numeric comparison
{app="checkout"} | json | status >= 400

# String equality
{app="checkout"} | json | level = "error"

# Multiple conditions (AND)
{app="checkout"} | json | status >= 500 | method = "POST"

# Duration comparison (auto-parses units like ms, s, m, h)
{app="checkout"} | json | duration > 2s

# Byte size comparison
{app="ingester"} | logfmt | bytes > 1MB

# Regex filter on extracted label
{app="checkout"} | json | path =~ "/api/(checkout|payment)/.*"

# Combining label filters with OR using parentheses
{app="checkout"} | json | (status >= 500 or duration > 5s)

Line Format Expression

The line_format expression rewrites the entire log line using Go's text/template syntax. This is useful for creating clean, readable output from structured logs:

# Reformat log output for readability
{app="checkout"} | json
    | line_format "{{.level | ToUpper}} [{{.method}}] {{.path}} → {{.status}} ({{.duration}})"

# Include conditional formatting
{app="checkout"} | json
    | line_format "{{if eq .level \"error\"}}🔴{{else}}🟢{{end}} {{.msg}}"

# Align columns for terminal viewing
{app="checkout"} | json
    | line_format "{{printf \"%-5s\" .level}} {{printf \"%-6s\" .method}} {{printf \"%-30s\" .path}} {{.status}}"

Unwrap Expression

The unwrap expression converts an extracted label into a numeric sample value, bridging the gap between log queries and metric queries. This is essential for creating metrics from log data:

# Unwrap a duration field for quantile calculations
{app="checkout"} | json | unwrap duration

# Unwrap with unit conversion (duration strings auto-parsed)
{app="checkout"} | logfmt | unwrap duration [5m]

# Unwrap bytes_processed for sum calculations
{app="ingester"} | logfmt | unwrap bytes

Unwrapped values feed into unwrap range aggregations like quantile_over_time, avg_over_time, and max_over_time, which we'll cover next.

LogQL Metric Queries

Metric queries transform logs into numeric time series — enabling you to build dashboards, set alerts, and compute statistics from log data. They wrap a log query in an aggregation function that operates over a time range.

Log Range Aggregations

These functions operate on the count or size of log lines within a time window:

# rate() — log lines per second over the range
# "How many errors per second is checkout generating?"
rate({app="checkout"} |= "error" [5m])

# count_over_time() — total log lines in the range
# "How many 500 errors in the last hour?"
count_over_time({app="checkout"} | json | status = 500 [1h])

# bytes_over_time() — total bytes of log lines in the range
# "How much log data is the ingester producing?"
bytes_over_time({app="ingester"} [5m])

# bytes_rate() — bytes per second
bytes_rate({app="ingester"} [5m])

# absent_over_time() — returns 1 if no logs exist in the range (deadman alert)
# "Alert if checkout stops logging entirely"
absent_over_time({app="checkout"} [15m])

                            
                            The Range Vector: The [5m] at the end is the range vector — it defines the time window for the aggregation. Choose a range that balances smoothness vs. responsiveness: [1m] is noisy but responsive, [15m] is smooth but lags behind changes.
                        

Unwrap Range Aggregations

When you've used unwrap to extract a numeric value from logs, these functions compute statistics over the unwrapped values:

# quantile_over_time() — percentiles of a numeric field
# "What's the p99 response time from checkout logs?"
quantile_over_time(0.99,
    {app="checkout"} | json | unwrap duration [5m]
)

# avg_over_time() — average of numeric values
avg_over_time(
    {app="checkout"} | json | unwrap duration [5m]
)

# max_over_time() / min_over_time()
max_over_time(
    {app="checkout"} | json | status >= 200 | unwrap duration [5m]
)

# sum_over_time() — total of numeric values
# "Total bytes processed by ingester in 5-minute windows"
sum_over_time(
    {app="ingester"} | logfmt | unwrap bytes [5m]
)

# stddev_over_time() / stdvar_over_time() — variability
stddev_over_time(
    {app="checkout"} | json | unwrap duration [5m]
)

# first_over_time() / last_over_time() — boundary values
last_over_time(
    {app="checkout"} | json | unwrap status [5m]
)

Aggregation Operators

Aggregation operators combine multiple time series produced by metric queries. They work identically to Prometheus aggregation operators:

# sum — total error rate across all containers
sum(rate({namespace="production"} |= "error" [5m]))

# sum by — error rate grouped by app
sum by (app) (rate({namespace="production"} |= "error" [5m]))

# topk — top 5 apps by error rate
topk(5, sum by (app) (rate({namespace="production"} |= "error" [5m])))

# avg — average request duration across pods
avg(
    avg_over_time({app="checkout"} | json | unwrap duration [5m])
)

# max by — highest p99 latency per service
max by (app) (
    quantile_over_time(0.99, {namespace="production"} | json | unwrap duration [5m])
)

# count — number of services with errors
count(sum by (app) (rate({namespace="production"} |= "error" [5m])) > 0)

Other available aggregation operators include: min, stddev, stdvar, bottomk, and sort / sort_desc.

Practical Metric Query Examples

Here are real-world LogQL metric queries you'd use in production dashboards:

# Error rate percentage — what % of requests are errors?
sum(rate({app="checkout"} | json | status >= 500 [5m]))
/
sum(rate({app="checkout"} | json | status > 0 [5m]))
* 100

# Request throughput by HTTP method
sum by (method) (rate({app="checkout"} | json [5m]))

# Slow request detection — requests over 2 seconds
sum by (path) (
    count_over_time({app="checkout"} | json | duration > 2s [5m])
)

# Log volume by severity level
sum by (level) (bytes_rate({namespace="production"} | json [5m]))

# Unique user count approximation (using distinct pattern counts)
count(
    sum by (user_id) (
        count_over_time({app="checkout"} | json | user_id != "" [1h])
    )
)

Lab Exercise

Building a Log-Based Error Budget

Using the OpenTelemetry Demo from Part 3, create a dashboard panel that shows the error budget consumption for the frontend service over the past 24 hours:

# SLO: 99.5% of requests should be non-5xx
# Error budget = 0.5% of total requests

# Current error ratio (should be < 0.005 for healthy SLO)
sum(rate({app="frontend"} | json | status >= 500 [24h]))
/
sum(rate({app="frontend"} | json | status > 0 [24h]))

If this value exceeds 0.005, you've consumed your entire error budget for that window.

SLO error-budget dashboard

Exploring Loki's Architecture

Understanding Loki's internal architecture helps you optimize queries, plan capacity, and troubleshoot performance issues. Loki is composed of several microservices that can run together (monolithic mode) or independently (microservices mode).

Loki Architecture — Write & Read Paths

flowchart TB
    subgraph Clients["Log Shippers"]
        Alloy[Grafana Alloy]
        Prom[Promtail]
        OTel[OTel Collector]
    end

    subgraph WritePath["Write Path"]
        Dist[Distributor
validates, rate-limits,
hashes to ingesters]
        Ing[Ingester
batches chunks in memory,
flushes to storage]
    end

    subgraph ReadPath["Read Path"]
        QF[Query Frontend
splits, caches,
retries queries]
        Sched[Query Scheduler
fair queuing across
tenants]
        Quer[Querier
executes query against
ingesters + storage]
    end

    subgraph Storage["Storage Layer"]
        Obj[(Object Storage
S3 / GCS / Azure
Chunks)]
        Idx[(Index
BoltDB / TSDB
Label → Chunk mapping)]
    end

    subgraph Maintenance["Background"]
        Comp[Compactor
merges index files,
applies retention]
        IG[Index Gateway
serves index queries
from cache]
    end

    Alloy --> Dist
    Prom --> Dist
    OTel --> Dist
    Dist --> Ing
    Ing --> Obj
    Ing --> Idx
    QF --> Sched
    Sched --> Quer
    Quer --> Ing
    Quer --> IG
    IG --> Idx
    Quer --> Obj
    Comp --> Idx
    Comp --> Obj

The Write Path

When log data arrives at Loki, it flows through these components:

Distributor

The distributor is the first component to receive log pushes. It performs several critical functions:

Validation — Checks that timestamps are within acceptable bounds, labels are valid, and line sizes don't exceed limits
Rate limiting — Enforces per-tenant ingestion rate limits to prevent noisy neighbors
Consistent hashing — Uses a hash ring to determine which ingester(s) should receive each stream, ensuring all entries for a given stream go to the same ingester
Replication — Writes to multiple ingesters (typically RF=3) for durability

Ingester

Ingesters are stateful components that build compressed chunks in memory:

Chunk building — Accumulates log entries into chunks (typically ~1.5 MB compressed). Each chunk represents a time window of a single stream
Write-Ahead Log (WAL) — Persists incoming data to disk immediately for crash recovery
Flushing — When a chunk reaches its target size or age, the ingester flushes it to object storage and updates the index
Live tail — Serves real-time log queries against in-memory (not-yet-flushed) data

                            
                            Why Ingesters Matter for Queries: Recent logs (last ~2 hours) are served directly from ingester memory, which is much faster than reading from object storage. This is why recent queries feel instantaneous while older queries may take longer.
                        

The Read Path

Query Frontend

The query frontend is an optional (but recommended) component that sits in front of queriers:

Query splitting — Large time-range queries are split into smaller sub-queries executed in parallel
Result caching — Caches query results to accelerate repeated dashboard loads
Query retries — Automatically retries failed sub-queries
Query limits — Enforces maximum query length and complexity per tenant

Query Scheduler

In microservices mode, the scheduler provides fair queuing across tenants, preventing any single tenant's expensive queries from starving others.

Querier

The querier executes the actual LogQL queries by reading from both sources:

Ingesters — For recent, in-memory data
Object storage — For historical, flushed chunks

The querier merges results from both sources, deduplicates entries (since replication means data exists on multiple ingesters), and applies the full LogQL pipeline.

Storage & Compaction

Chunk Storage

Log content is stored as compressed chunks in object storage. Each chunk is a compressed block of log entries for a single stream within a time window. Loki supports S3, GCS, Azure Blob Storage, and local filesystem (for testing).

Index

The index maps label sets to chunk locations. Loki has evolved through several index formats:

BoltDB Shipper (legacy) — BoltDB files uploaded to object storage periodically
TSDB (current, recommended) — Prometheus TSDB-based index, more efficient and simpler to operate

Compactor

The compactor runs as a background process that:

Merges index files — Combines many small index files into larger, more efficient ones
Applies retention — Deletes chunks and index entries older than the configured retention period
Manages delete requests — Processes any explicit log deletion requests (compliance use cases)

Deployment Modes

Loki offers three deployment modes to match different scale requirements:

Deployment

Loki Deployment Modes

Mode	Scale	Use Case
Monolithic	< 100 GB/day	Development, small teams, single-node deployment
Simple Scalable (SSD)	100 GB – 1 TB/day	Mid-size production, read/write path separation
Microservices	> 1 TB/day	Large-scale production, independent scaling of each component

The Simple Scalable Deployment (SSD) mode is recommended for most production environments. It splits Loki into a write path and read path, each independently scalable, while keeping operational complexity low.

scaling production deployment

Tips, Tricks & Best Practices

Operating Loki effectively requires understanding its constraints and working with — not against — its design philosophy. These practices come from real-world production deployments.

Label Cardinality

Label cardinality is the single most important factor affecting Loki's performance and cost. Each unique combination of labels creates a new stream, and each stream has overhead in the index and ingester memory.

                            
                            Critical Rule: Never use high-cardinality values as labels. User IDs, request IDs, IP addresses, trace IDs, and timestamps must NEVER be labels. These should remain in the log line content and be extracted at query time with parsers.
                        

Good labels (low cardinality, stable):

# These create a bounded, predictable number of streams
labels:
  namespace: production    # 3-5 values
  app: checkout           # 20-50 values
  container: api          # 2-3 per app
  environment: prod       # 3 values (dev/staging/prod)
  region: us-east-1       # 5-10 values
  level: error            # 5 values (debug/info/warn/error/fatal)

Bad labels (high cardinality, dangerous):

# DO NOT DO THIS — creates millions of streams
labels:
  user_id: "user-123456"         # millions of values
  request_id: "abc-def-ghi"      # unique per request
  ip_address: "192.168.1.42"     # thousands of values
  timestamp: "2026-06-15T10:30"  # infinite values
  url_path: "/users/123/orders"  # unbounded (includes IDs)

A healthy Loki deployment typically has fewer than 100,000 active streams. Each tenant should aim for the lowest number of streams that still allows efficient querying.

Structured Metadata

Loki 3.0 introduced structured metadata as a middle ground between labels and log content. Structured metadata fields are stored alongside log entries but do NOT create new streams:

# Sending structured metadata with the Loki push API
streams:
  - stream:
      app: checkout
      namespace: production
    values:
      - ["1718447400000000000", "POST /api/checkout 200 1.2s", 
         {"trace_id": "abc123", "user_id": "user-456", "pod": "checkout-7b4f8c-xk2p9"}]

You can filter on structured metadata in LogQL using the same syntax as extracted labels:

# Filter by structured metadata field
{app="checkout"} | trace_id = "abc123"

# Combine with other pipeline stages
{app="checkout"} | trace_id = "abc123" | json | status >= 400

This is the recommended approach for fields that are high-cardinality but frequently queried (trace IDs, pod names, request IDs).

Retention & Limits

Configure retention and limits to control costs and prevent abuse:

# loki-config.yaml — key retention and limits settings
limits_config:
  # Ingestion limits
  ingestion_rate_mb: 10              # MB/s per tenant
  ingestion_burst_size_mb: 20        # burst allowance
  max_streams_per_user: 50000        # stream limit per tenant
  max_line_size: 256KB               # reject lines larger than this
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30

  # Query limits
  max_query_length: 721h             # max query time range (30 days)
  max_query_series: 5000             # max streams per query
  max_entries_limit_per_query: 10000 # max log lines returned

  # Retention
  retention_period: 744h             # 31 days global retention

compactor:
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: s3

                            
                            Per-Tenant Retention: You can configure different retention periods per tenant using runtime overrides. Critical production logs might need 90 days while debug logs can be discarded after 7 days.
                        

Query Optimization

Write efficient LogQL queries by following these principles:

Narrow stream selection first — Use specific labels to minimize the number of streams scanned
Line filter before parser — Filter on raw text (fast) before invoking a parser (slower)
Shortest time range possible — Query [1h] instead of [7d] when possible
Use specific parsers — | json level, status is faster than | json (extracts only needed fields)
Avoid regex when substring works — |= "error" is much faster than |~ ".*error.*"

# ❌ SLOW — broad selector, regex line filter, full JSON parse
{namespace=~".+"} |~ ".*error.*" | json

# ✅ FAST — specific selector, substring filter, targeted parse
{namespace="production", app="checkout"} |= "error" | json level, status

# ❌ SLOW — 7-day range, aggregating all streams
sum(rate({namespace="production"} |= "error" [7d]))

# ✅ FAST — 5-minute range, pre-filtered to specific app
sum by (app) (rate({namespace="production", app="checkout"} |= "error" [5m]))

Alerting on Logs

Loki integrates with Grafana Alerting to fire alerts based on log patterns. Log-based alerts use metric queries as their condition:

# Grafana Alert Rule — fire when error rate exceeds threshold
# (configured via Grafana UI or provisioning)
alert: HighErrorRate
expr: |
  sum(rate({app="checkout", namespace="production"} |= "error" [5m])) > 5
for: 2m
labels:
  severity: critical
  team: checkout
annotations:
  summary: "Checkout error rate is {{ $value }} errors/sec"
  description: "The checkout service in production is generating more than 5 errors/sec for 2+ minutes."
  runbook: "https://wiki.internal/runbooks/checkout-errors"

# Deadman alert — fire when a critical service stops logging
alert: CheckoutSilent
expr: |
  absent_over_time({app="checkout", namespace="production"} [15m])
for: 0m
labels:
  severity: critical
annotations:
  summary: "No logs from checkout for 15 minutes"
  description: "The checkout service has not produced any log output. The service may be down or disconnected."

Alert Pattern

Common Log-Based Alert Patterns

Error rate threshold — rate({app="X"} |= "error" [5m]) > N
Deadman / silence — absent_over_time({app="X"} [15m])
Latency spike — quantile_over_time(0.99, {app="X"} | json | unwrap duration [5m]) > 5
Log volume anomaly — bytes_rate({app="X"} [5m]) > 10MB (sudden log flood)
Specific error pattern — count_over_time({app="X"} |= "OOMKilled" [5m]) > 0
Error budget burn — Ratio of error lines to total lines exceeding SLO threshold

alerting SRE on-call

Summary & Next Steps

In this article, we've covered the full spectrum of working with Grafana Loki:

Loki's philosophy — Label-based indexing, cost-efficiency, and Prometheus-inspired design
LogQL fundamentals — Stream selectors, line filters, parsers (json, logfmt, pattern, regexp), label filters, line_format, and unwrap
Metric queries — rate, count_over_time, bytes_over_time, quantile_over_time, aggregation operators, and practical dashboard patterns
Architecture — Distributors, ingesters, queriers, query frontend, compactor, and the read/write path separation
Best practices — Label cardinality management, structured metadata, retention configuration, query optimization, and log-based alerting

With LogQL mastery, you can build powerful log-based dashboards, create targeted alerts, and correlate log events with metrics and traces. The key takeaway is to think of Loki as a targeted search tool rather than a full-text search engine — start with labels, filter with pipelines, and extract only what you need.

Next in the Grafana Track

In Part 5: Monitoring with Metrics Using Grafana Mimir & Prometheus, we'll explore the metrics pillar — PromQL fundamentals, Mimir's architecture, building production dashboards, recording rules, and long-term metrics storage strategies.

Previous Part 3: Setting Up a Learning Environment Next Part 5: Monitoring with Metrics — Mimir & Prometheus

Grafana Deep Dive Part 4: Looking at Logs with Grafana Loki — LogQL Mastery

Table of Contents