Common Log Formats
Logs are the most universal form of telemetry — every application produces them. Yet the format you choose determines how effectively you can search, filter, alert on, and correlate log data in systems like Grafana Loki. Understanding the spectrum from unstructured to fully structured logging is the first step toward effective instrumentation.
flowchart LR
A[Unstructured
Free-form text] --> B[Semi-Structured
Consistent patterns]
B --> C[Structured
Machine-parseable]
C --> D[Contextualized
Correlated with
traces & metrics]
style A fill:#f8d7da,stroke:#dc3545
style B fill:#fff3cd,stroke:#ffc107
style C fill:#d1ecf1,stroke:#17a2b8
style D fill:#d4edda,stroke:#28a745
Unstructured Logs
Unstructured logs are free-form text strings written to stdout, files, or syslog. They are human-readable but require regex parsing or pattern matching to extract meaningful fields. Legacy applications, third-party software, and quick debug statements typically produce unstructured logs.
# Examples of unstructured log output
ERROR Connection timeout to database at 10.0.1.5:5432 after 30s
Starting server on port 8080...
User john.doe logged in from 192.168.1.100
WARN: Disk usage at 87% on /dev/sda1
Payment processed successfully for order #12345 ($149.99)
Challenges with unstructured logs:
- No consistent schema — every developer formats differently
- Expensive regex-based parsing at query time
- Difficult to aggregate or create alerts on specific fields
- Impossible to correlate with traces without manual effort
Semi-Structured Logs
Semi-structured logs follow a consistent pattern (timestamp, level, source, message) but aren't machine-parseable without format-specific parsers. Common formats include Apache/Nginx access logs, syslog (RFC 5424), and custom formats using logging frameworks with configured patterns.
# Apache Combined Log Format
192.168.1.100 - john [15/Jun/2026:14:30:02 +0000] "GET /api/users HTTP/1.1" 200 1234 "https://app.example.com" "Mozilla/5.0"
# Syslog RFC 5424
<165>1 2026-06-15T14:30:02.341Z app-server-01 myapp 1234 ID47 - Connection pool exhausted, waiting for available connection
# Log4j pattern layout
2026-06-15 14:30:02,341 [http-thread-42] ERROR com.example.UserService - Failed to authenticate user: timeout after 5000ms
# Python standard logging
2026-06-15 14:30:02,341 - myapp.auth - ERROR - Authentication failed for user_id=abc123 reason=token_expired
| pattern and | regexp pipeline stages. You can extract fields at query time without pre-processing, though this is slower than using labels on structured logs.
Structured Logs
Structured logs emit each event as a machine-parseable record (typically JSON) with typed fields. This is the gold standard for observability — every field is immediately queryable, aggregatable, and correlatable. All modern instrumentation libraries default to structured output.
{
"timestamp": "2026-06-15T14:30:02.341Z",
"level": "error",
"service": "payment-service",
"version": "2.4.1",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Payment processing failed",
"error.type": "TimeoutException",
"error.message": "Gateway timeout after 30000ms",
"payment.order_id": "ORD-12345",
"payment.amount": 149.99,
"payment.currency": "USD",
"payment.provider": "stripe",
"user.id": "usr_abc123",
"http.method": "POST",
"http.url": "/api/v2/payments",
"http.status_code": 504,
"duration_ms": 30042
}
Here's how to emit structured JSON logs in Python using the structlog library:
import structlog
import logging
# Configure structlog for JSON output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)
# Create a logger with bound context
logger = structlog.get_logger()
log = logger.bind(service="payment-service", version="2.4.1")
# Log with rich context — all fields become queryable in Loki
log.error(
"Payment processing failed",
error_type="TimeoutException",
order_id="ORD-12345",
amount=149.99,
currency="USD",
duration_ms=30042,
trace_id="4bf92f3577b34da6a3ce929d0e0e4736"
)
Choosing a Log Format
| Aspect | Unstructured | Semi-Structured | Structured (JSON) |
|---|---|---|---|
| Human Readability | Excellent | Good | Moderate (verbose) |
| Machine Parseability | Poor (regex required) | Moderate (format-specific) | Excellent (native) |
| Query Performance | Slow (full text scan) | Moderate | Fast (field extraction) |
| Storage Efficiency | Compact | Moderate | Verbose (keys repeated) |
| Trace Correlation | Manual effort | Possible with parsing | Native (trace_id field) |
| Alerting Capability | Pattern matching only | Limited field extraction | Full field-level alerting |
| Best For | Legacy apps, debugging | Web servers, syslog | Microservices, cloud-native |
loki.process stage to strip sensitive values before ingestion.
Metric Types & Best Practices
Metrics are the most cost-effective telemetry type — a single time series costs the same whether your service handles 10 or 10 million requests per second. Understanding the four fundamental metric types and when to use each is essential for effective monitoring.
Counters
A counter is a cumulative metric that only goes up (or resets to zero on restart). Use counters for things you want to count: requests, errors, bytes transferred, tasks completed.
rate() or increase() in PromQL. The raw counter value is meaningless without knowing the time window — "5000 errors" means nothing, but "200 errors/minute" is actionable.
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Define a counter with labels for method and status
var httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests processed",
},
[]string{"method", "handler", "status_code"},
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
// Process request...
httpRequestsTotal.WithLabelValues(r.Method, "/api/users", "200").Inc()
w.WriteHeader(http.StatusOK)
}
func main() {
http.HandleFunc("/api/users", handleRequest)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Gauges
A gauge represents a point-in-time value that can go up or down: temperature, queue depth, active connections, memory usage, number of goroutines. Gauges are ideal for resource utilization monitoring and capacity planning.
from prometheus_client import Gauge, start_http_server
import psutil
import time
# Define gauges for system resources
cpu_usage_percent = Gauge(
'system_cpu_usage_percent',
'Current CPU usage as a percentage',
['cpu']
)
memory_usage_bytes = Gauge(
'system_memory_usage_bytes',
'Current memory usage in bytes',
['type']
)
active_connections = Gauge(
'app_active_connections',
'Number of currently active client connections',
['pool']
)
def collect_system_metrics():
"""Collect system metrics and update gauges."""
# CPU per-core usage
for i, percent in enumerate(psutil.cpu_percent(percpu=True)):
cpu_usage_percent.labels(cpu=f"cpu{i}").set(percent)
# Memory breakdown
mem = psutil.virtual_memory()
memory_usage_bytes.labels(type="used").set(mem.used)
memory_usage_bytes.labels(type="available").set(mem.available)
memory_usage_bytes.labels(type="cached").set(mem.cached)
if __name__ == '__main__':
start_http_server(8080)
while True:
collect_system_metrics()
time.sleep(15)
Histograms
Histograms sample observations (usually durations or sizes) and count them in configurable buckets. They enable percentile calculations server-side using PromQL's histogram_quantile(). Histograms are the preferred type for latency measurement in Prometheus-based systems.
package main
import (
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Define a histogram with custom buckets for HTTP latency
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency in seconds",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
},
[]string{"method", "handler", "status_code"},
)
func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Simulate work with variable latency
time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
w.WriteHeader(http.StatusOK)
// Observe the duration
duration := time.Since(start).Seconds()
httpDuration.WithLabelValues(r.Method, "/api/orders", "200").Observe(duration)
}
func main() {
http.HandleFunc("/api/orders", instrumentedHandler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Summaries
Summaries calculate streaming quantiles on the client side. Unlike histograms, they provide exact percentiles but cannot be aggregated across instances. Use summaries only when you need precise quantiles from a single instance and cannot use histograms.
| Feature | Histogram | Summary |
|---|---|---|
| Quantile Calculation | Server-side (PromQL) | Client-side (streaming) |
| Aggregation | Aggregatable across instances | Not aggregatable |
| Accuracy | Depends on bucket boundaries | Configurable error margin |
| Cost | One time series per bucket | One time series per quantile |
| Recommendation | Preferred for most use cases | Only for single-instance precision |
Metric Protocols
Multiple protocols exist for transmitting metrics from applications to backends. Your choice depends on the ecosystem, existing infrastructure, and whether you want push or pull semantics.
flowchart TD
subgraph Pull["Pull-Based (Scrape)"]
P1[Prometheus Exposition
Format]
P2[OpenMetrics]
end
subgraph Push["Push-Based"]
Q1[OTLP
OpenTelemetry Protocol]
Q2[StatsD / DogStatsD]
Q3[Graphite Plaintext]
Q4[InfluxDB Line Protocol]
end
subgraph Backends["Storage Backends"]
B1[Grafana Mimir]
B2[Prometheus]
B3[Datadog]
B4[InfluxDB]
end
P1 --> B1
P1 --> B2
P2 --> B1
Q1 --> B1
Q1 --> B2
Q2 --> B3
Q3 --> B4
Q4 --> B4
| Protocol | Model | Format | Best For |
|---|---|---|---|
| Prometheus Exposition | Pull (scrape) | Text/protobuf | Kubernetes services, long-running processes |
| OTLP (OpenTelemetry) | Push (gRPC/HTTP) | Protobuf | Vendor-neutral, multi-signal (metrics + logs + traces) |
| StatsD | Push (UDP/TCP) | Plaintext | Low-overhead fire-and-forget, legacy apps |
| DogStatsD | Push (UDP) | Extended StatsD | Datadog ecosystem, tags support |
| OpenMetrics | Pull (scrape) | Text/protobuf | Prometheus successor, exemplar support |
{user_id="..."} across 1M users creates 1M time series — this will crash Prometheus or explode your Mimir costs.
Tracing Protocols & Best Practices
Distributed tracing follows a single request as it traverses multiple services, showing exactly where time is spent and where failures occur. While metrics tell you something is slow and logs tell you what happened, traces tell you where in the call chain the problem lives.
Spans and Traces
A trace represents the entire journey of a request through a distributed system. It is composed of one or more spans — each span representing a unit of work (an HTTP call, a database query, a message publish). Spans form a directed acyclic graph (DAG) with parent-child relationships.
flowchart TD
A["Root Span: POST /api/checkout
trace_id: abc123
duration: 850ms"] --> B["Span: Validate Cart
span_id: span_01
duration: 45ms"]
A --> C["Span: Process Payment
span_id: span_02
duration: 620ms"]
A --> D["Span: Send Confirmation
span_id: span_03
duration: 180ms"]
C --> E["Span: Stripe API Call
span_id: span_04
duration: 580ms"]
C --> F["Span: Update DB
span_id: span_05
duration: 35ms"]
D --> G["Span: Email Service
span_id: span_06
duration: 150ms"]
D --> H["Span: Push Notification
span_id: span_07
duration: 25ms"]
Each span carries essential metadata:
- Trace ID — unique identifier shared by all spans in a trace
- Span ID — unique identifier for this specific span
- Parent Span ID — links child spans to their parent
- Operation Name — describes the work performed
- Start/End Timestamps — precise timing
- Attributes — key-value metadata (http.method, db.statement, etc.)
- Status — OK, ERROR, or UNSET
- Events — timestamped annotations within a span (exceptions, log entries)
Tracing Protocols
Context propagation — passing trace/span IDs across service boundaries — requires standardized header formats. The industry has converged on W3C Trace Context, but you'll still encounter legacy formats.
| Protocol | Headers | Status | Notes |
|---|---|---|---|
| W3C Trace Context | traceparent, tracestate |
W3C Standard (recommended) | Universal standard, all modern SDKs default to this |
| B3 (Zipkin) | X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId |
Legacy (still widely used) | Zipkin ecosystem, Istio service mesh |
| B3 Single | b3 (single header) |
Legacy compact | Compressed single-header variant of B3 |
| Jaeger | uber-trace-id |
Deprecated | Jaeger-specific, migrating to W3C |
| AWS X-Ray | X-Amzn-Trace-Id |
AWS-specific | Required within AWS services |
# W3C Trace Context header format
# traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# version: 00 (current)
# trace-id: 32 hex chars (16 bytes)
# parent-span-id: 16 hex chars (8 bytes)
# trace-flags: 01 = sampled, 00 = not sampled
# tracestate carries vendor-specific data
tracestate: grafana=t:1,congo=t:456
Best Practices for Distributed Tracing
otelcol.processor.tail_sampling component.
Essential tracing best practices:
- Name spans semantically — use
HTTP GET /api/users/{id}notHTTP GET /api/users/abc123(high cardinality) - Add meaningful attributes — include
http.status_code,db.system,rpc.methodusing OpenTelemetry semantic conventions - Record errors properly — set span status to ERROR and attach exception events with stack traces
- Propagate context everywhere — HTTP headers, gRPC metadata, message queue headers, even background jobs
- Use span links for async flows — when a message consumer processes work triggered by a producer, link the consumer span back to the producer's trace
- Set resource attributes —
service.name,service.version,deployment.environmenton every span for filtering
A team noticed p99 checkout latency spiking to 12 seconds (SLO: 3s). Metrics showed the payment-service was slow, but which downstream call? Distributed tracing revealed that the Stripe API span was taking 8-10 seconds during peak hours due to rate limiting. The fix: implement a token bucket with retry backoff. Without tracing, they might have optimized the wrong service for weeks.
Using Libraries to Instrument Efficiently
OpenTelemetry (OTel) has become the industry standard for application instrumentation. It provides a single, vendor-neutral API and SDK for emitting metrics, logs, and traces. The key advantage: instrument once, send to any backend (Grafana, Datadog, New Relic, Jaeger) by changing only the exporter configuration.
flowchart LR
subgraph App["Application Code"]
A1[Your Code] --> A2[OTel API]
end
subgraph SDK["OTel SDK"]
A2 --> B1[TracerProvider]
A2 --> B2[MeterProvider]
A2 --> B3[LoggerProvider]
B1 --> C1[SpanProcessor]
B2 --> C2[MetricReader]
B3 --> C3[LogRecordProcessor]
end
subgraph Export["Exporters"]
C1 --> D1[OTLP Exporter]
C2 --> D1
C3 --> D1
end
D1 --> E[Grafana Alloy
or OTel Collector]
E --> F1[Mimir]
E --> F2[Loki]
E --> F3[Tempo]
Go
Go has first-class OpenTelemetry support with minimal overhead. The SDK is production-ready and widely deployed.
package main
import (
"context"
"log"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func initTracer() (*sdktrace.TracerProvider, error) {
ctx := context.Background()
// Create OTLP exporter (sends to Grafana Alloy or OTel Collector)
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("localhost:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
// Define service resource attributes
res, _ := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("checkout-service"),
semconv.ServiceVersion("1.2.0"),
semconv.DeploymentEnvironment("production"),
),
)
// Create TracerProvider with batch export
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1), // Sample 10% of new traces
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func checkoutHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("checkout")
// Create a custom span for business logic
ctx, span := tracer.Start(ctx, "process-checkout",
trace.WithAttributes(
attribute.String("user.id", "usr_abc123"),
attribute.Float64("order.total", 149.99),
),
)
defer span.End()
// Simulate processing
time.Sleep(50 * time.Millisecond)
span.AddEvent("payment-validated")
w.WriteHeader(http.StatusOK)
}
func main() {
tp, err := initTracer()
if err != nil {
log.Fatal(err)
}
defer tp.Shutdown(context.Background())
// Wrap handler with automatic HTTP instrumentation
handler := otelhttp.NewHandler(
http.HandlerFunc(checkoutHandler), "POST /checkout",
)
http.Handle("/checkout", handler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
Python
Python's OpenTelemetry SDK supports auto-instrumentation for popular frameworks (Flask, Django, FastAPI, SQLAlchemy, requests, etc.) with zero code changes.
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from flask import Flask
# Configure resource (identifies this service)
resource = Resource.create({
SERVICE_NAME: "order-service",
SERVICE_VERSION: "2.1.0",
"deployment.environment": "production",
})
# Configure tracing
trace_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanExporter(trace_exporter))
trace.set_tracer_provider(trace_provider)
# Configure metrics
metric_exporter = OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=30000)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# Create Flask app with auto-instrumentation
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument() # Auto-instruments outgoing HTTP calls
# Custom metrics
meter = metrics.get_meter("order-service")
order_counter = meter.create_counter("orders_processed_total", description="Total orders")
order_value = meter.create_histogram("order_value_dollars", description="Order value in USD")
# Custom tracing
tracer = trace.get_tracer("order-service")
@app.route("/api/orders", methods=["POST"])
def create_order():
with tracer.start_as_current_span("validate-order") as span:
span.set_attribute("order.items_count", 3)
# Validation logic...
order_counter.add(1, {"status": "completed", "payment_method": "card"})
order_value.record(149.99, {"currency": "USD"})
return {"status": "created"}, 201
if __name__ == "__main__":
app.run(port=8080)
opentelemetry-instrument command: opentelemetry-instrument --service_name order-service flask run. This auto-instruments Flask, database drivers, HTTP clients, and more.
Java
Java has the most mature auto-instrumentation via the OpenTelemetry Java Agent — a single JAR that attaches at startup and instruments 100+ libraries (Spring Boot, JDBC, gRPC, Kafka, etc.) without any code changes.
# Download the OpenTelemetry Java Agent
curl -L -o opentelemetry-javaagent.jar \
https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# Run your application with the agent attached
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=inventory-service \
-Dotel.exporter.otlp.endpoint=http://localhost:4317 \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-Dotel.traces.sampler=parentbased_traceidratio \
-Dotel.traces.sampler.arg=0.1 \
-jar my-application.jar
// Manual instrumentation for custom business spans (Spring Boot)
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.Meter;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/inventory")
public class InventoryController {
private final Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");
private final Meter meter = GlobalOpenTelemetry.getMeter("inventory-service");
private final LongCounter stockChecks = meter.counterBuilder("inventory_stock_checks_total")
.setDescription("Total stock availability checks")
.build();
@GetMapping("/{sku}")
public InventoryResponse checkStock(@PathVariable String sku) {
// Create a custom span for the stock check
Span span = tracer.spanBuilder("check-stock-availability")
.setAttribute("inventory.sku", sku)
.setAttribute("inventory.warehouse", "us-east-1")
.startSpan();
try {
// Business logic...
int available = queryWarehouse(sku);
span.setAttribute("inventory.available_quantity", available);
stockChecks.add(1, Attributes.builder()
.put("sku_category", "electronics")
.put("result", available > 0 ? "in_stock" : "out_of_stock")
.build());
return new InventoryResponse(sku, available);
} catch (Exception e) {
span.recordException(e);
span.setStatus(io.opentelemetry.api.trace.StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
private int queryWarehouse(String sku) {
// Simulated warehouse query
return 42;
}
}
JavaScript / Node.js
Node.js instrumentation requires registering instrumentations early in the application lifecycle (before any require() calls for auto-instrumentation to work).
// tracing.js — Load this FIRST via: node --require ./tracing.js app.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'api-gateway',
[ATTR_SERVICE_VERSION]: '3.0.1',
'deployment.environment': 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: 'http://localhost:4317' }),
exportIntervalMillis: 30000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments: express, http, pg, mysql, redis, grpc, etc.
'@opentelemetry/instrumentation-fs': { enabled: false }, // Disable noisy FS spans
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
// app.js — Your Express application (auto-instrumented by tracing.js)
const express = require('express');
const { trace, metrics } = require('@opentelemetry/api');
const app = express();
const tracer = trace.getTracer('api-gateway');
const meter = metrics.getMeter('api-gateway');
// Custom metrics
const requestCounter = meter.createCounter('gateway_requests_total', {
description: 'Total requests through the API gateway',
});
const latencyHistogram = meter.createHistogram('gateway_request_duration_ms', {
description: 'Request latency in milliseconds',
});
app.get('/api/products/:id', async (req, res) => {
const start = Date.now();
// Custom span for business logic
const span = tracer.startSpan('fetch-product-details', {
attributes: { 'product.id': req.params.id },
});
try {
// Simulated product fetch
const product = { id: req.params.id, name: 'Widget', price: 29.99 };
span.setAttribute('product.name', product.name);
span.addEvent('product-fetched-from-cache');
requestCounter.add(1, { method: 'GET', route: '/api/products/:id', status: '200' });
res.json(product);
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message }); // ERROR
res.status(500).json({ error: 'Internal error' });
} finally {
span.end();
latencyHistogram.record(Date.now() - start, { route: '/api/products/:id' });
}
});
app.listen(3000, () => console.log('API Gateway on :3000'));
.NET
.NET has excellent OpenTelemetry integration through the System.Diagnostics API and the OpenTelemetry .NET SDK. ASP.NET Core applications get rich auto-instrumentation out of the box.
# Install OpenTelemetry packages for an ASP.NET Core app
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
// Program.cs — ASP.NET Core with OpenTelemetry (C# shown with JS highlighting)
// using OpenTelemetry; using OpenTelemetry.Trace; using OpenTelemetry.Metrics;
var builder = WebApplication.CreateBuilder(args);
// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource => resource
.AddService(
serviceName: "catalog-service",
serviceVersion: "1.5.0"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(options =>
options.SetDbStatementForText = true)
.AddOtlpExporter(options =>
options.Endpoint = new Uri("http://localhost:4317")))
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddOtlpExporter(options =>
options.Endpoint = new Uri("http://localhost:4317")));
var app = builder.Build();
app.MapGet("/api/catalog/{id}", (string id) => {
// Auto-instrumented: HTTP span, SQL spans, outgoing HTTP spans
return Results.Ok(new { Id = id, Name = "Product", Price = 49.99 });
});
app.Run();
Infrastructure Data Technologies
Beyond application code, your infrastructure components (servers, containers, networks, databases) generate critical telemetry. Infrastructure monitoring completes the observability picture — when an application is slow, is it the code or the underlying hardware/platform?
Common Infrastructure Components
flowchart TD
subgraph Host["Host Layer"]
H1[node_exporter
CPU, Memory, Disk, Network]
H2[Windows Exporter
Windows performance counters]
end
subgraph Container["Container Layer"]
C1[cAdvisor
Container resource usage]
C2[kube-state-metrics
Kubernetes object states]
C3[kubelet metrics
Pod lifecycle]
end
subgraph Network["Network Layer"]
N1[SNMP Exporter
Network devices]
N2[Blackbox Exporter
Endpoint probing]
end
subgraph Data["Data Layer"]
D1[Database Exporters
PostgreSQL, MySQL, MongoDB]
D2[Redis Exporter]
D3[Kafka Exporter]
end
H1 --> A[Grafana Alloy
Collection Agent]
H2 --> A
C1 --> A
C2 --> A
C3 --> A
N1 --> A
N2 --> A
D1 --> A
D2 --> A
D3 --> A
A --> M[Grafana Mimir
Metrics Storage]
Monitoring Standards & Tools
node_exporter is the standard for Linux host metrics. It exposes hardware and OS metrics (CPU, memory, disk I/O, filesystem, network) in Prometheus exposition format on port 9100.
# Install and run node_exporter on a Linux host
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.0.linux-amd64.tar.gz
cd node_exporter-1.8.0.linux-amd64
# Start with specific collectors enabled
./node_exporter \
--collector.systemd \
--collector.processes \
--no-collector.wifi \
--web.listen-address=":9100"
# Key metrics exposed:
# node_cpu_seconds_total — CPU time per mode (user, system, idle, iowait)
# node_memory_MemAvailable_bytes — Available memory
# node_filesystem_avail_bytes — Available disk space
# node_disk_io_time_seconds_total — Disk I/O time
# node_network_receive_bytes_total — Network bytes received
cAdvisor (Container Advisor) provides container-level resource usage and performance data. In Kubernetes, kubelet embeds cAdvisor, so these metrics are available automatically.
# Grafana Alloy configuration to scrape infrastructure metrics
// Scrape node_exporter for host metrics
prometheus.scrape "node" {
targets = [
{"__address__" = "node-01:9100", "instance" = "node-01", "environment" = "production"},
{"__address__" = "node-02:9100", "instance" = "node-02", "environment" = "production"},
{"__address__" = "node-03:9100", "instance" = "node-03", "environment" = "production"},
]
scrape_interval = "15s"
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Scrape cAdvisor for container metrics (Kubernetes)
prometheus.scrape "cadvisor" {
targets = discovery.kubernetes.cadvisor.targets
scrape_interval = "30s"
metrics_path = "/metrics/cadvisor"
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Scrape kube-state-metrics for Kubernetes object state
prometheus.scrape "kube_state" {
targets = [{"__address__" = "kube-state-metrics:8080"}]
scrape_interval = "30s"
forward_to = [prometheus.remote_write.mimir.receiver]
}
// SNMP monitoring for network switches
prometheus.scrape "snmp" {
targets = [
{"__address__" = "switch-core-01", "__param_module" = "if_mib"},
{"__address__" = "switch-core-02", "__param_module" = "if_mib"},
]
scrape_interval = "60s"
metrics_path = "/snmp"
params = {"target" = [""]} // Set by relabeling
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Remote write to Grafana Mimir
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir:9009/api/v1/push"
}
}
SNMP (Simple Network Management Protocol) remains the standard for monitoring network devices (switches, routers, firewalls, load balancers). The Prometheus SNMP Exporter translates SNMP OIDs into Prometheus metrics.
| Tool | Layer | Key Metrics | Collection Method |
|---|---|---|---|
| node_exporter | Host (Linux) | CPU, memory, disk, network, filesystem | Prometheus scrape (:9100) |
| Windows Exporter | Host (Windows) | CPU, memory, disk, IIS, .NET CLR | Prometheus scrape (:9182) |
| cAdvisor | Container | Container CPU, memory, network, disk I/O | Embedded in kubelet |
| kube-state-metrics | Kubernetes | Pod status, deployment replicas, node conditions | Prometheus scrape (:8080) |
| SNMP Exporter | Network | Interface traffic, errors, device status | SNMP polling → Prometheus |
| Blackbox Exporter | Endpoint | HTTP status, DNS resolution, TCP connect, TLS expiry | Active probing |
| Database Exporters | Data | Connections, queries/sec, replication lag, cache hit ratio | Prometheus scrape (varies) |
ServiceMonitor and PodMonitor CRDs (from the Prometheus Operator) to automatically discover and scrape new services. For Grafana Alloy, use its discovery.kubernetes component for dynamic target discovery.
Summary & Next Steps
In this second part of the Grafana Deep Dive track, we covered the full spectrum of application and infrastructure instrumentation:
- Log formats: Progress from unstructured to structured JSON logging for maximum queryability in Loki
- Metric types: Counters, gauges, histograms, and summaries — each with specific use cases and cardinality considerations
- Metric protocols: Prometheus (pull), OTLP (push), StatsD — choose based on your architecture
- Distributed tracing: W3C Trace Context as the standard, with sampling strategies for production
- OpenTelemetry: Single API/SDK across Go, Python, Java, Node.js, and .NET with auto-instrumentation
- Infrastructure: node_exporter, cAdvisor, kube-state-metrics, SNMP, and Blackbox Exporter for full-stack visibility
Next in the Grafana Track
In Part 3: Setting Up a Learning Environment, we'll build a complete local Grafana stack using Docker Compose — Mimir, Loki, Tempo, Alloy, and Grafana — with a sample application that generates all three telemetry types for hands-on practice.