Back to Monitoring & Observability Series

Grafana Deep Dive Part 2: Instrumenting Applications & Infrastructure

June 15, 2026 Wasil Zafar 28 min read

A practitioner's guide to generating high-quality telemetry — from choosing log formats and metric types to implementing distributed tracing with OpenTelemetry, selecting the right instrumentation libraries, and monitoring infrastructure components at scale.

Table of Contents

  1. Common Log Formats
  2. Metric Types & Best Practices
  3. Tracing Protocols & Best Practices
  4. Using Libraries to Instrument Efficiently
  5. Infrastructure Data Technologies
  6. Summary & Next Steps

Common Log Formats

Logs are the most universal form of telemetry — every application produces them. Yet the format you choose determines how effectively you can search, filter, alert on, and correlate log data in systems like Grafana Loki. Understanding the spectrum from unstructured to fully structured logging is the first step toward effective instrumentation.

Log Format Maturity Spectrum
flowchart LR
    A[Unstructured
Free-form text] --> B[Semi-Structured
Consistent patterns] B --> C[Structured
Machine-parseable] C --> D[Contextualized
Correlated with
traces & metrics] style A fill:#f8d7da,stroke:#dc3545 style B fill:#fff3cd,stroke:#ffc107 style C fill:#d1ecf1,stroke:#17a2b8 style D fill:#d4edda,stroke:#28a745

Unstructured Logs

Unstructured logs are free-form text strings written to stdout, files, or syslog. They are human-readable but require regex parsing or pattern matching to extract meaningful fields. Legacy applications, third-party software, and quick debug statements typically produce unstructured logs.

# Examples of unstructured log output
ERROR Connection timeout to database at 10.0.1.5:5432 after 30s
Starting server on port 8080...
User john.doe logged in from 192.168.1.100
WARN: Disk usage at 87% on /dev/sda1
Payment processed successfully for order #12345 ($149.99)

Challenges with unstructured logs:

  • No consistent schema — every developer formats differently
  • Expensive regex-based parsing at query time
  • Difficult to aggregate or create alerts on specific fields
  • Impossible to correlate with traces without manual effort

Semi-Structured Logs

Semi-structured logs follow a consistent pattern (timestamp, level, source, message) but aren't machine-parseable without format-specific parsers. Common formats include Apache/Nginx access logs, syslog (RFC 5424), and custom formats using logging frameworks with configured patterns.

# Apache Combined Log Format
192.168.1.100 - john [15/Jun/2026:14:30:02 +0000] "GET /api/users HTTP/1.1" 200 1234 "https://app.example.com" "Mozilla/5.0"

# Syslog RFC 5424
<165>1 2026-06-15T14:30:02.341Z app-server-01 myapp 1234 ID47 - Connection pool exhausted, waiting for available connection

# Log4j pattern layout
2026-06-15 14:30:02,341 [http-thread-42] ERROR com.example.UserService - Failed to authenticate user: timeout after 5000ms

# Python standard logging
2026-06-15 14:30:02,341 - myapp.auth - ERROR - Authentication failed for user_id=abc123 reason=token_expired
Loki Tip: Semi-structured logs work well with Loki's | pattern and | regexp pipeline stages. You can extract fields at query time without pre-processing, though this is slower than using labels on structured logs.

Structured Logs

Structured logs emit each event as a machine-parseable record (typically JSON) with typed fields. This is the gold standard for observability — every field is immediately queryable, aggregatable, and correlatable. All modern instrumentation libraries default to structured output.

{
  "timestamp": "2026-06-15T14:30:02.341Z",
  "level": "error",
  "service": "payment-service",
  "version": "2.4.1",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "Payment processing failed",
  "error.type": "TimeoutException",
  "error.message": "Gateway timeout after 30000ms",
  "payment.order_id": "ORD-12345",
  "payment.amount": 149.99,
  "payment.currency": "USD",
  "payment.provider": "stripe",
  "user.id": "usr_abc123",
  "http.method": "POST",
  "http.url": "/api/v2/payments",
  "http.status_code": 504,
  "duration_ms": 30042
}

Here's how to emit structured JSON logs in Python using the structlog library:

import structlog
import logging

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)

# Create a logger with bound context
logger = structlog.get_logger()
log = logger.bind(service="payment-service", version="2.4.1")

# Log with rich context — all fields become queryable in Loki
log.error(
    "Payment processing failed",
    error_type="TimeoutException",
    order_id="ORD-12345",
    amount=149.99,
    currency="USD",
    duration_ms=30042,
    trace_id="4bf92f3577b34da6a3ce929d0e0e4736"
)

Choosing a Log Format

Aspect Unstructured Semi-Structured Structured (JSON)
Human Readability Excellent Good Moderate (verbose)
Machine Parseability Poor (regex required) Moderate (format-specific) Excellent (native)
Query Performance Slow (full text scan) Moderate Fast (field extraction)
Storage Efficiency Compact Moderate Verbose (keys repeated)
Trace Correlation Manual effort Possible with parsing Native (trace_id field)
Alerting Capability Pattern matching only Limited field extraction Full field-level alerting
Best For Legacy apps, debugging Web servers, syslog Microservices, cloud-native
Common Mistake: Don't log sensitive data (passwords, API keys, PII) in structured fields. Use a redaction processor in your logging pipeline or Grafana Alloy's loki.process stage to strip sensitive values before ingestion.

Metric Types & Best Practices

Metrics are the most cost-effective telemetry type — a single time series costs the same whether your service handles 10 or 10 million requests per second. Understanding the four fundamental metric types and when to use each is essential for effective monitoring.

Counters

A counter is a cumulative metric that only goes up (or resets to zero on restart). Use counters for things you want to count: requests, errors, bytes transferred, tasks completed.

Key Rule: Never use a counter's raw value for alerting. Always apply rate() or increase() in PromQL. The raw counter value is meaningless without knowing the time window — "5000 errors" means nothing, but "200 errors/minute" is actionable.
package main

import (
    "net/http"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Define a counter with labels for method and status
var httpRequestsTotal = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests processed",
    },
    []string{"method", "handler", "status_code"},
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Process request...
    httpRequestsTotal.WithLabelValues(r.Method, "/api/users", "200").Inc()
    w.WriteHeader(http.StatusOK)
}

func main() {
    http.HandleFunc("/api/users", handleRequest)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Gauges

A gauge represents a point-in-time value that can go up or down: temperature, queue depth, active connections, memory usage, number of goroutines. Gauges are ideal for resource utilization monitoring and capacity planning.

from prometheus_client import Gauge, start_http_server
import psutil
import time

# Define gauges for system resources
cpu_usage_percent = Gauge(
    'system_cpu_usage_percent',
    'Current CPU usage as a percentage',
    ['cpu']
)

memory_usage_bytes = Gauge(
    'system_memory_usage_bytes',
    'Current memory usage in bytes',
    ['type']
)

active_connections = Gauge(
    'app_active_connections',
    'Number of currently active client connections',
    ['pool']
)

def collect_system_metrics():
    """Collect system metrics and update gauges."""
    # CPU per-core usage
    for i, percent in enumerate(psutil.cpu_percent(percpu=True)):
        cpu_usage_percent.labels(cpu=f"cpu{i}").set(percent)

    # Memory breakdown
    mem = psutil.virtual_memory()
    memory_usage_bytes.labels(type="used").set(mem.used)
    memory_usage_bytes.labels(type="available").set(mem.available)
    memory_usage_bytes.labels(type="cached").set(mem.cached)

if __name__ == '__main__':
    start_http_server(8080)
    while True:
        collect_system_metrics()
        time.sleep(15)

Histograms

Histograms sample observations (usually durations or sizes) and count them in configurable buckets. They enable percentile calculations server-side using PromQL's histogram_quantile(). Histograms are the preferred type for latency measurement in Prometheus-based systems.

package main

import (
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Define a histogram with custom buckets for HTTP latency
var httpDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency in seconds",
        Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
    },
    []string{"method", "handler", "status_code"},
)

func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // Simulate work with variable latency
    time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
    w.WriteHeader(http.StatusOK)

    // Observe the duration
    duration := time.Since(start).Seconds()
    httpDuration.WithLabelValues(r.Method, "/api/orders", "200").Observe(duration)
}

func main() {
    http.HandleFunc("/api/orders", instrumentedHandler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}
Bucket Selection: Choose histogram buckets based on your SLO targets. If your p99 SLO is 500ms, include buckets at 0.1, 0.25, 0.5, 1.0, and 2.5 seconds. Too few buckets reduce accuracy; too many increase cardinality. The Prometheus default buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10) work for most HTTP services.

Summaries

Summaries calculate streaming quantiles on the client side. Unlike histograms, they provide exact percentiles but cannot be aggregated across instances. Use summaries only when you need precise quantiles from a single instance and cannot use histograms.

Feature Histogram Summary
Quantile Calculation Server-side (PromQL) Client-side (streaming)
Aggregation Aggregatable across instances Not aggregatable
Accuracy Depends on bucket boundaries Configurable error margin
Cost One time series per bucket One time series per quantile
Recommendation Preferred for most use cases Only for single-instance precision

Metric Protocols

Multiple protocols exist for transmitting metrics from applications to backends. Your choice depends on the ecosystem, existing infrastructure, and whether you want push or pull semantics.

Metric Protocol Landscape
flowchart TD
    subgraph Pull["Pull-Based (Scrape)"]
        P1[Prometheus Exposition
Format] P2[OpenMetrics] end subgraph Push["Push-Based"] Q1[OTLP
OpenTelemetry Protocol] Q2[StatsD / DogStatsD] Q3[Graphite Plaintext] Q4[InfluxDB Line Protocol] end subgraph Backends["Storage Backends"] B1[Grafana Mimir] B2[Prometheus] B3[Datadog] B4[InfluxDB] end P1 --> B1 P1 --> B2 P2 --> B1 Q1 --> B1 Q1 --> B2 Q2 --> B3 Q3 --> B4 Q4 --> B4
Protocol Model Format Best For
Prometheus Exposition Pull (scrape) Text/protobuf Kubernetes services, long-running processes
OTLP (OpenTelemetry) Push (gRPC/HTTP) Protobuf Vendor-neutral, multi-signal (metrics + logs + traces)
StatsD Push (UDP/TCP) Plaintext Low-overhead fire-and-forget, legacy apps
DogStatsD Push (UDP) Extended StatsD Datadog ecosystem, tags support
OpenMetrics Pull (scrape) Text/protobuf Prometheus successor, exemplar support
Cardinality Warning: The number one cause of metric system failure is unbounded label cardinality. Never use user IDs, request IDs, email addresses, or timestamps as metric label values. Each unique label combination creates a new time series. A metric with labels {user_id="..."} across 1M users creates 1M time series — this will crash Prometheus or explode your Mimir costs.

Tracing Protocols & Best Practices

Distributed tracing follows a single request as it traverses multiple services, showing exactly where time is spent and where failures occur. While metrics tell you something is slow and logs tell you what happened, traces tell you where in the call chain the problem lives.

Spans and Traces

A trace represents the entire journey of a request through a distributed system. It is composed of one or more spans — each span representing a unit of work (an HTTP call, a database query, a message publish). Spans form a directed acyclic graph (DAG) with parent-child relationships.

Anatomy of a Distributed Trace
flowchart TD
    A["Root Span: POST /api/checkout
trace_id: abc123
duration: 850ms"] --> B["Span: Validate Cart
span_id: span_01
duration: 45ms"] A --> C["Span: Process Payment
span_id: span_02
duration: 620ms"] A --> D["Span: Send Confirmation
span_id: span_03
duration: 180ms"] C --> E["Span: Stripe API Call
span_id: span_04
duration: 580ms"] C --> F["Span: Update DB
span_id: span_05
duration: 35ms"] D --> G["Span: Email Service
span_id: span_06
duration: 150ms"] D --> H["Span: Push Notification
span_id: span_07
duration: 25ms"]

Each span carries essential metadata:

  • Trace ID — unique identifier shared by all spans in a trace
  • Span ID — unique identifier for this specific span
  • Parent Span ID — links child spans to their parent
  • Operation Name — describes the work performed
  • Start/End Timestamps — precise timing
  • Attributes — key-value metadata (http.method, db.statement, etc.)
  • Status — OK, ERROR, or UNSET
  • Events — timestamped annotations within a span (exceptions, log entries)

Tracing Protocols

Context propagation — passing trace/span IDs across service boundaries — requires standardized header formats. The industry has converged on W3C Trace Context, but you'll still encounter legacy formats.

Protocol Headers Status Notes
W3C Trace Context traceparent, tracestate W3C Standard (recommended) Universal standard, all modern SDKs default to this
B3 (Zipkin) X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId Legacy (still widely used) Zipkin ecosystem, Istio service mesh
B3 Single b3 (single header) Legacy compact Compressed single-header variant of B3
Jaeger uber-trace-id Deprecated Jaeger-specific, migrating to W3C
AWS X-Ray X-Amzn-Trace-Id AWS-specific Required within AWS services
# W3C Trace Context header format
# traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

# version: 00 (current)
# trace-id: 32 hex chars (16 bytes)
# parent-span-id: 16 hex chars (8 bytes)
# trace-flags: 01 = sampled, 00 = not sampled

# tracestate carries vendor-specific data
tracestate: grafana=t:1,congo=t:456

Best Practices for Distributed Tracing

Sampling Strategy: Don't trace 100% of requests in production. Use head-based sampling (decide at the edge) at 1-10% for normal traffic, combined with tail-based sampling (keep all traces with errors or high latency) for debugging. Grafana Alloy supports both via its otelcol.processor.tail_sampling component.

Essential tracing best practices:

  • Name spans semantically — use HTTP GET /api/users/{id} not HTTP GET /api/users/abc123 (high cardinality)
  • Add meaningful attributes — include http.status_code, db.system, rpc.method using OpenTelemetry semantic conventions
  • Record errors properly — set span status to ERROR and attach exception events with stack traces
  • Propagate context everywhere — HTTP headers, gRPC metadata, message queue headers, even background jobs
  • Use span links for async flows — when a message consumer processes work triggered by a producer, link the consumer span back to the producer's trace
  • Set resource attributesservice.name, service.version, deployment.environment on every span for filtering
Case Study E-Commerce Checkout Debugging

A team noticed p99 checkout latency spiking to 12 seconds (SLO: 3s). Metrics showed the payment-service was slow, but which downstream call? Distributed tracing revealed that the Stripe API span was taking 8-10 seconds during peak hours due to rate limiting. The fix: implement a token bucket with retry backoff. Without tracing, they might have optimized the wrong service for weeks.

Latency Root Cause SLO Rate Limiting

Using Libraries to Instrument Efficiently

OpenTelemetry (OTel) has become the industry standard for application instrumentation. It provides a single, vendor-neutral API and SDK for emitting metrics, logs, and traces. The key advantage: instrument once, send to any backend (Grafana, Datadog, New Relic, Jaeger) by changing only the exporter configuration.

OpenTelemetry SDK Architecture
flowchart LR
    subgraph App["Application Code"]
        A1[Your Code] --> A2[OTel API]
    end

    subgraph SDK["OTel SDK"]
        A2 --> B1[TracerProvider]
        A2 --> B2[MeterProvider]
        A2 --> B3[LoggerProvider]
        B1 --> C1[SpanProcessor]
        B2 --> C2[MetricReader]
        B3 --> C3[LogRecordProcessor]
    end

    subgraph Export["Exporters"]
        C1 --> D1[OTLP Exporter]
        C2 --> D1
        C3 --> D1
    end

    D1 --> E[Grafana Alloy
or OTel Collector] E --> F1[Mimir] E --> F2[Loki] E --> F3[Tempo]

Go

Go has first-class OpenTelemetry support with minimal overhead. The SDK is production-ready and widely deployed.

package main

import (
    "context"
    "log"
    "net/http"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
    "go.opentelemetry.io/otel/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    ctx := context.Background()

    // Create OTLP exporter (sends to Grafana Alloy or OTel Collector)
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    // Define service resource attributes
    res, _ := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("checkout-service"),
            semconv.ServiceVersion("1.2.0"),
            semconv.DeploymentEnvironment("production"),
        ),
    )

    // Create TracerProvider with batch export
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // Sample 10% of new traces
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func checkoutHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("checkout")

    // Create a custom span for business logic
    ctx, span := tracer.Start(ctx, "process-checkout",
        trace.WithAttributes(
            attribute.String("user.id", "usr_abc123"),
            attribute.Float64("order.total", 149.99),
        ),
    )
    defer span.End()

    // Simulate processing
    time.Sleep(50 * time.Millisecond)
    span.AddEvent("payment-validated")

    w.WriteHeader(http.StatusOK)
}

func main() {
    tp, err := initTracer()
    if err != nil {
        log.Fatal(err)
    }
    defer tp.Shutdown(context.Background())

    // Wrap handler with automatic HTTP instrumentation
    handler := otelhttp.NewHandler(
        http.HandlerFunc(checkoutHandler), "POST /checkout",
    )
    http.Handle("/checkout", handler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Python

Python's OpenTelemetry SDK supports auto-instrumentation for popular frameworks (Flask, Django, FastAPI, SQLAlchemy, requests, etc.) with zero code changes.

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from flask import Flask

# Configure resource (identifies this service)
resource = Resource.create({
    SERVICE_NAME: "order-service",
    SERVICE_VERSION: "2.1.0",
    "deployment.environment": "production",
})

# Configure tracing
trace_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanExporter(trace_exporter))
trace.set_tracer_provider(trace_provider)

# Configure metrics
metric_exporter = OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=30000)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Create Flask app with auto-instrumentation
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()  # Auto-instruments outgoing HTTP calls

# Custom metrics
meter = metrics.get_meter("order-service")
order_counter = meter.create_counter("orders_processed_total", description="Total orders")
order_value = meter.create_histogram("order_value_dollars", description="Order value in USD")

# Custom tracing
tracer = trace.get_tracer("order-service")

@app.route("/api/orders", methods=["POST"])
def create_order():
    with tracer.start_as_current_span("validate-order") as span:
        span.set_attribute("order.items_count", 3)
        # Validation logic...

    order_counter.add(1, {"status": "completed", "payment_method": "card"})
    order_value.record(149.99, {"currency": "USD"})
    return {"status": "created"}, 201

if __name__ == "__main__":
    app.run(port=8080)
Zero-Code Instrumentation: For Python, you can add tracing without modifying any code using the opentelemetry-instrument command: opentelemetry-instrument --service_name order-service flask run. This auto-instruments Flask, database drivers, HTTP clients, and more.

Java

Java has the most mature auto-instrumentation via the OpenTelemetry Java Agent — a single JAR that attaches at startup and instruments 100+ libraries (Spring Boot, JDBC, gRPC, Kafka, etc.) without any code changes.

# Download the OpenTelemetry Java Agent
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Run your application with the agent attached
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=inventory-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -Dotel.metrics.exporter=otlp \
  -Dotel.logs.exporter=otlp \
  -Dotel.traces.sampler=parentbased_traceidratio \
  -Dotel.traces.sampler.arg=0.1 \
  -jar my-application.jar
// Manual instrumentation for custom business spans (Spring Boot)
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.Meter;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/api/inventory")
public class InventoryController {

    private final Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");
    private final Meter meter = GlobalOpenTelemetry.getMeter("inventory-service");
    private final LongCounter stockChecks = meter.counterBuilder("inventory_stock_checks_total")
            .setDescription("Total stock availability checks")
            .build();

    @GetMapping("/{sku}")
    public InventoryResponse checkStock(@PathVariable String sku) {
        // Create a custom span for the stock check
        Span span = tracer.spanBuilder("check-stock-availability")
                .setAttribute("inventory.sku", sku)
                .setAttribute("inventory.warehouse", "us-east-1")
                .startSpan();

        try {
            // Business logic...
            int available = queryWarehouse(sku);
            span.setAttribute("inventory.available_quantity", available);
            stockChecks.add(1, Attributes.builder()
                    .put("sku_category", "electronics")
                    .put("result", available > 0 ? "in_stock" : "out_of_stock")
                    .build());

            return new InventoryResponse(sku, available);
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(io.opentelemetry.api.trace.StatusCode.ERROR, e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }

    private int queryWarehouse(String sku) {
        // Simulated warehouse query
        return 42;
    }
}

JavaScript / Node.js

Node.js instrumentation requires registering instrumentations early in the application lifecycle (before any require() calls for auto-instrumentation to work).

// tracing.js — Load this FIRST via: node --require ./tracing.js app.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'api-gateway',
    [ATTR_SERVICE_VERSION]: '3.0.1',
    'deployment.environment': 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: 'http://localhost:4317' }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments: express, http, pg, mysql, redis, grpc, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Disable noisy FS spans
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
// app.js — Your Express application (auto-instrumented by tracing.js)
const express = require('express');
const { trace, metrics } = require('@opentelemetry/api');

const app = express();
const tracer = trace.getTracer('api-gateway');
const meter = metrics.getMeter('api-gateway');

// Custom metrics
const requestCounter = meter.createCounter('gateway_requests_total', {
  description: 'Total requests through the API gateway',
});
const latencyHistogram = meter.createHistogram('gateway_request_duration_ms', {
  description: 'Request latency in milliseconds',
});

app.get('/api/products/:id', async (req, res) => {
  const start = Date.now();

  // Custom span for business logic
  const span = tracer.startSpan('fetch-product-details', {
    attributes: { 'product.id': req.params.id },
  });

  try {
    // Simulated product fetch
    const product = { id: req.params.id, name: 'Widget', price: 29.99 };
    span.setAttribute('product.name', product.name);
    span.addEvent('product-fetched-from-cache');

    requestCounter.add(1, { method: 'GET', route: '/api/products/:id', status: '200' });
    res.json(product);
  } catch (err) {
    span.recordException(err);
    span.setStatus({ code: 2, message: err.message }); // ERROR
    res.status(500).json({ error: 'Internal error' });
  } finally {
    span.end();
    latencyHistogram.record(Date.now() - start, { route: '/api/products/:id' });
  }
});

app.listen(3000, () => console.log('API Gateway on :3000'));

.NET

.NET has excellent OpenTelemetry integration through the System.Diagnostics API and the OpenTelemetry .NET SDK. ASP.NET Core applications get rich auto-instrumentation out of the box.

# Install OpenTelemetry packages for an ASP.NET Core app
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
// Program.cs — ASP.NET Core with OpenTelemetry (C# shown with JS highlighting)
// using OpenTelemetry; using OpenTelemetry.Trace; using OpenTelemetry.Metrics;

var builder = WebApplication.CreateBuilder(args);

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: "catalog-service",
            serviceVersion: "1.5.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(options =>
            options.SetDbStatementForText = true)
        .AddOtlpExporter(options =>
            options.Endpoint = new Uri("http://localhost:4317")))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(options =>
            options.Endpoint = new Uri("http://localhost:4317")));

var app = builder.Build();
app.MapGet("/api/catalog/{id}", (string id) => {
    // Auto-instrumented: HTTP span, SQL spans, outgoing HTTP spans
    return Results.Ok(new { Id = id, Name = "Product", Price = 49.99 });
});
app.Run();

Infrastructure Data Technologies

Beyond application code, your infrastructure components (servers, containers, networks, databases) generate critical telemetry. Infrastructure monitoring completes the observability picture — when an application is slow, is it the code or the underlying hardware/platform?

Common Infrastructure Components

Infrastructure Monitoring Landscape
flowchart TD
    subgraph Host["Host Layer"]
        H1[node_exporter
CPU, Memory, Disk, Network] H2[Windows Exporter
Windows performance counters] end subgraph Container["Container Layer"] C1[cAdvisor
Container resource usage] C2[kube-state-metrics
Kubernetes object states] C3[kubelet metrics
Pod lifecycle] end subgraph Network["Network Layer"] N1[SNMP Exporter
Network devices] N2[Blackbox Exporter
Endpoint probing] end subgraph Data["Data Layer"] D1[Database Exporters
PostgreSQL, MySQL, MongoDB] D2[Redis Exporter] D3[Kafka Exporter] end H1 --> A[Grafana Alloy
Collection Agent] H2 --> A C1 --> A C2 --> A C3 --> A N1 --> A N2 --> A D1 --> A D2 --> A D3 --> A A --> M[Grafana Mimir
Metrics Storage]

Monitoring Standards & Tools

node_exporter is the standard for Linux host metrics. It exposes hardware and OS metrics (CPU, memory, disk I/O, filesystem, network) in Prometheus exposition format on port 9100.

# Install and run node_exporter on a Linux host
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.0.linux-amd64.tar.gz
cd node_exporter-1.8.0.linux-amd64

# Start with specific collectors enabled
./node_exporter \
  --collector.systemd \
  --collector.processes \
  --no-collector.wifi \
  --web.listen-address=":9100"

# Key metrics exposed:
# node_cpu_seconds_total          — CPU time per mode (user, system, idle, iowait)
# node_memory_MemAvailable_bytes  — Available memory
# node_filesystem_avail_bytes     — Available disk space
# node_disk_io_time_seconds_total — Disk I/O time
# node_network_receive_bytes_total — Network bytes received

cAdvisor (Container Advisor) provides container-level resource usage and performance data. In Kubernetes, kubelet embeds cAdvisor, so these metrics are available automatically.

# Grafana Alloy configuration to scrape infrastructure metrics
// Scrape node_exporter for host metrics
prometheus.scrape "node" {
  targets = [
    {"__address__" = "node-01:9100", "instance" = "node-01", "environment" = "production"},
    {"__address__" = "node-02:9100", "instance" = "node-02", "environment" = "production"},
    {"__address__" = "node-03:9100", "instance" = "node-03", "environment" = "production"},
  ]
  scrape_interval = "15s"
  forward_to     = [prometheus.remote_write.mimir.receiver]
}

// Scrape cAdvisor for container metrics (Kubernetes)
prometheus.scrape "cadvisor" {
  targets         = discovery.kubernetes.cadvisor.targets
  scrape_interval = "30s"
  metrics_path    = "/metrics/cadvisor"
  forward_to      = [prometheus.remote_write.mimir.receiver]
}

// Scrape kube-state-metrics for Kubernetes object state
prometheus.scrape "kube_state" {
  targets         = [{"__address__" = "kube-state-metrics:8080"}]
  scrape_interval = "30s"
  forward_to      = [prometheus.remote_write.mimir.receiver]
}

// SNMP monitoring for network switches
prometheus.scrape "snmp" {
  targets = [
    {"__address__" = "switch-core-01", "__param_module" = "if_mib"},
    {"__address__" = "switch-core-02", "__param_module" = "if_mib"},
  ]
  scrape_interval = "60s"
  metrics_path    = "/snmp"
  params          = {"target" = [""]}  // Set by relabeling
  forward_to      = [prometheus.remote_write.mimir.receiver]
}

// Remote write to Grafana Mimir
prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

SNMP (Simple Network Management Protocol) remains the standard for monitoring network devices (switches, routers, firewalls, load balancers). The Prometheus SNMP Exporter translates SNMP OIDs into Prometheus metrics.

Tool Layer Key Metrics Collection Method
node_exporter Host (Linux) CPU, memory, disk, network, filesystem Prometheus scrape (:9100)
Windows Exporter Host (Windows) CPU, memory, disk, IIS, .NET CLR Prometheus scrape (:9182)
cAdvisor Container Container CPU, memory, network, disk I/O Embedded in kubelet
kube-state-metrics Kubernetes Pod status, deployment replicas, node conditions Prometheus scrape (:8080)
SNMP Exporter Network Interface traffic, errors, device status SNMP polling → Prometheus
Blackbox Exporter Endpoint HTTP status, DNS resolution, TCP connect, TLS expiry Active probing
Database Exporters Data Connections, queries/sec, replication lag, cache hit ratio Prometheus scrape (varies)
Infrastructure as Code for Monitoring: Define your scrape targets declaratively. In Kubernetes, use ServiceMonitor and PodMonitor CRDs (from the Prometheus Operator) to automatically discover and scrape new services. For Grafana Alloy, use its discovery.kubernetes component for dynamic target discovery.

Summary & Next Steps

In this second part of the Grafana Deep Dive track, we covered the full spectrum of application and infrastructure instrumentation:

  • Log formats: Progress from unstructured to structured JSON logging for maximum queryability in Loki
  • Metric types: Counters, gauges, histograms, and summaries — each with specific use cases and cardinality considerations
  • Metric protocols: Prometheus (pull), OTLP (push), StatsD — choose based on your architecture
  • Distributed tracing: W3C Trace Context as the standard, with sampling strategies for production
  • OpenTelemetry: Single API/SDK across Go, Python, Java, Node.js, and .NET with auto-instrumentation
  • Infrastructure: node_exporter, cAdvisor, kube-state-metrics, SNMP, and Blackbox Exporter for full-stack visibility

Next in the Grafana Track

In Part 3: Setting Up a Learning Environment, we'll build a complete local Grafana stack using Docker Compose — Mimir, Loki, Tempo, Alloy, and Grafana — with a sample application that generates all three telemetry types for hands-on practice.