Building observable systems? Learn about our Staff Augmentation services.

Read also: CI/CD Pipeline Setup: Complete Implementation Guide

When a production incident strikes at 2 AM, the difference between a 5-minute fix and a 4-hour investigation is observability. Teams with mature observability practices can identify the affected service, trace the failing request, correlate it with a recent deployment, and roll back — all within minutes. Teams without observability start with “which server should I SSH into?” This guide covers the implementation of all three pillars: logs, metrics, and traces.

The Three Pillars

Logs: What happened

Logs are timestamped records of discrete events. They tell you what happened, when, and in what context.

Implementation checklist:

  • Use structured logging (JSON) — not unstructured text lines
  • Include standard fields in every log entry: timestamp, service name, trace ID, severity, message
  • Add business context: user ID, request ID, operation type, resource identifier
  • Use appropriate log levels: ERROR (action needed), WARN (degraded but functioning), INFO (business events), DEBUG (development only, disabled in production)
  • Centralize logs from all services into a single platform
  • Set retention policies: 30 days full resolution, 90-365 days aggregated or archived
  • Implement log sampling for high-volume, low-value events (health checks, routine operations)

Metrics: How much and how fast

Metrics are numerical measurements collected at regular intervals. They answer quantitative questions about system behavior.

Implementation checklist:

  • Instrument the four golden signals for every service:
    • Latency — request duration (p50, p95, p99)
    • Traffic — requests per second
    • Errors — error rate (4xx, 5xx as percentage of total)
    • Saturation — resource utilization (CPU, memory, connections, queue depth)
  • Use histograms for latency — averages hide tail latency problems
  • Add business metrics: orders per minute, sign-ups per hour, revenue per transaction
  • Set appropriate scrape intervals: 15-30 seconds for infrastructure, 30-60 seconds for application
  • Define metric naming conventions: service_component_metric_unit (e.g., api_http_request_duration_seconds)
  • Set retention policies: 15-second resolution for 30 days, 1-minute for 90 days, 5-minute for 1 year

Traces: How requests flow

Traces follow a single request as it traverses multiple services, revealing latency bottlenecks and failure points.

Implementation checklist:

  • Instrument all service-to-service calls with distributed tracing
  • Use OpenTelemetry SDKs for vendor-neutral instrumentation
  • Propagate trace context (W3C Trace Context headers) across all service boundaries
  • Add custom spans for significant internal operations: database queries, cache lookups, external API calls
  • Implement sampling: 100% of errors and slow requests, 1-10% of normal requests
  • Set span attributes that enable filtering: HTTP method, URL path, status code, user ID
  • Configure trace retention: 7-14 days for full traces, 30 days for sampled summaries

Tool Selection

Logging stack comparison

CriteriaELK (Elasticsearch + Logstash + Kibana)Loki + GrafanaDatadog Logs
CostInfrastructure-heavy, self-managedLower storage cost (indexes labels, not content)Per-GB ingestion pricing
Query languageKQL (powerful, complex)LogQL (Prometheus-inspired, simpler)Natural language + structured queries
ScalabilityRequires careful cluster managementHorizontally scalable, simpler opsFully managed
Best forFull-text search, complex analysisCost-effective log aggregationTeams preferring SaaS
Operational burdenHigh (cluster management, index tuning)Medium (simpler architecture)None (SaaS)

Metrics stack comparison

CriteriaPrometheus + GrafanaDatadogVictoria Metrics
CostFree (self-hosted) + infrastructurePer-host pricing ($15-23/host/month)Free (self-hosted), lower resource usage than Prometheus
Query languagePromQLProprietary + PromQL supportPromQL-compatible
Long-term storageRequires Thanos or CortexBuilt-inBuilt-in, efficient compression
Best forCloud-native, KubernetesFull-stack monitoringHigh-cardinality, long-retention
Operational burdenMedium (federation for scale)None (SaaS)Low-medium

Tracing stack comparison

CriteriaJaegerZipkinTempo + Grafana
Backend storageElasticsearch, Cassandra, KafkaElasticsearch, MySQL, CassandraObject storage (S3, GCS)
CostInfrastructure-dependentInfrastructure-dependentLower storage cost (object storage)
UIGood, standaloneSimple, lightweightGrafana integration
Best forProduction-grade, large scaleSimpler deploymentsGrafana ecosystem

Budget-conscious (self-hosted): Loki + Prometheus + Tempo + Grafana — unified Grafana UI, lower operational cost, CNCF-backed.

Enterprise (managed): Datadog or Grafana Cloud — full-stack observability, no infrastructure management, higher per-unit cost but lower total cost of ownership for teams without dedicated platform engineers.

Architecture Design

Collection layer

  • Deploy OpenTelemetry Collector as a sidecar or DaemonSet on every node
  • Configure the collector to receive traces, metrics, and logs via OTLP
  • Use the collector for batching, filtering, and routing — keep application-side instrumentation lightweight
  • Implement buffering in the collector to handle backend outages without data loss

Processing layer

  • Filter out high-volume, low-value data before storage (health check logs, internal metrics)
  • Enrich data with metadata: Kubernetes labels, deployment version, region
  • Sample traces at the collector level — tail-based sampling keeps interesting traces
  • Transform log formats into a consistent schema across all services

Storage layer

  • Separate hot storage (recent data, fast queries) from cold storage (archived data, cheaper)
  • Configure retention policies per data type and severity
  • Monitor storage growth and set up capacity alerts
  • Test data restoration from cold storage regularly

Alerting Strategy

Alert hierarchy

LevelResponse timeChannelExample
Critical (P1)5 minutesPagerDuty/phoneService down, data loss risk
High (P2)30 minutesSlack + PagerDutyError rate above SLO, degraded performance
Medium (P3)4 hoursSlackElevated error rate, approaching capacity
Low (P4)Next business dayEmail/ticketDrift detected, non-urgent maintenance

Alert design principles

  • Alert on symptoms (user-facing impact), not causes (CPU, memory)
  • Every alert must have a documented runbook — what to check, what to do
  • Set appropriate thresholds — avoid alerting on normal variance
  • Use multi-window, multi-burn-rate alerts for SLO-based monitoring
  • Review alert frequency monthly — if an alert fires more than 5 times without action, fix the root cause or remove the alert
  • Track alert fatigue: if the on-call engineer ignores more than 10% of alerts, the signal-to-noise ratio is too low

Anti-patterns to avoid

  • Alerting on every 500 error — instead, alert when the error rate exceeds the SLO threshold for a sustained period
  • Alerting on CPU > 80% — high CPU is not a problem unless it affects latency or availability
  • Duplicate alerts from multiple systems for the same issue — deduplicate and correlate
  • Alerts without owners — every alert must route to a specific team

Dashboard Design

Service dashboard template

Every service should have a dashboard with:

  • Request rate (requests per second, broken down by endpoint)
  • Error rate (percentage, broken down by status code)
  • Latency distribution (p50, p95, p99)
  • Saturation (CPU, memory, connections, queue depth)
  • Dependency health (latency and error rate of downstream services)
  • Recent deployments (annotated on time-series graphs)

Executive dashboard

  • Overall system availability (uptime percentage)
  • SLO burn rate — are we consuming error budget faster than planned?
  • Incident count and mean time to recovery (MTTR)
  • Cost metrics — infrastructure spend per service, per environment

Dashboard anti-patterns

  • Too many graphs on one dashboard — keep it to 8-12 panels maximum
  • Graphs without context — every panel needs a title, description, and expected range
  • Default time ranges too wide — 1-6 hours for operational dashboards, 7-30 days for trend dashboards
  • No drill-down — dashboards should link to traces and logs for investigation

How ARDURA Consulting Supports Observability Implementation

Building production-grade observability requires DevOps engineers, platform engineers, and SREs with hands-on experience in monitoring tooling, distributed tracing, and incident response. ARDURA Consulting provides the expertise:

  • 500+ senior specialists including SREs, platform engineers, and DevOps experts experienced in ELK, Prometheus, Grafana, Datadog, and OpenTelemetry — available within 2 weeks
  • 40% cost savings compared to permanent hiring, allowing you to bring in observability expertise for implementation and knowledge transfer without long-term commitments
  • 99% client retention — engineers who stay through implementation, tuning, and operationalization of your observability stack
  • 211+ completed projects including enterprise monitoring platforms, incident response automation, and SRE practice establishment

Whether you need a platform engineer to architect your observability stack or an SRE team to implement and operate it, ARDURA Consulting provides the talent to make your systems truly observable.