Building observable systems? Learn about our Staff Augmentation services.
Read also: CI/CD Pipeline Setup: Complete Implementation Guide
When a production incident strikes at 2 AM, the difference between a 5-minute fix and a 4-hour investigation is observability. Teams with mature observability practices can identify the affected service, trace the failing request, correlate it with a recent deployment, and roll back — all within minutes. Teams without observability start with “which server should I SSH into?” This guide covers the implementation of all three pillars: logs, metrics, and traces.
The Three Pillars
Logs: What happened
Logs are timestamped records of discrete events. They tell you what happened, when, and in what context.
Implementation checklist:
- Use structured logging (JSON) — not unstructured text lines
- Include standard fields in every log entry: timestamp, service name, trace ID, severity, message
- Add business context: user ID, request ID, operation type, resource identifier
- Use appropriate log levels: ERROR (action needed), WARN (degraded but functioning), INFO (business events), DEBUG (development only, disabled in production)
- Centralize logs from all services into a single platform
- Set retention policies: 30 days full resolution, 90-365 days aggregated or archived
- Implement log sampling for high-volume, low-value events (health checks, routine operations)
Metrics: How much and how fast
Metrics are numerical measurements collected at regular intervals. They answer quantitative questions about system behavior.
Implementation checklist:
- Instrument the four golden signals for every service:
- Latency — request duration (p50, p95, p99)
- Traffic — requests per second
- Errors — error rate (4xx, 5xx as percentage of total)
- Saturation — resource utilization (CPU, memory, connections, queue depth)
- Use histograms for latency — averages hide tail latency problems
- Add business metrics: orders per minute, sign-ups per hour, revenue per transaction
- Set appropriate scrape intervals: 15-30 seconds for infrastructure, 30-60 seconds for application
- Define metric naming conventions:
service_component_metric_unit(e.g.,api_http_request_duration_seconds) - Set retention policies: 15-second resolution for 30 days, 1-minute for 90 days, 5-minute for 1 year
Traces: How requests flow
Traces follow a single request as it traverses multiple services, revealing latency bottlenecks and failure points.
Implementation checklist:
- Instrument all service-to-service calls with distributed tracing
- Use OpenTelemetry SDKs for vendor-neutral instrumentation
- Propagate trace context (W3C Trace Context headers) across all service boundaries
- Add custom spans for significant internal operations: database queries, cache lookups, external API calls
- Implement sampling: 100% of errors and slow requests, 1-10% of normal requests
- Set span attributes that enable filtering: HTTP method, URL path, status code, user ID
- Configure trace retention: 7-14 days for full traces, 30 days for sampled summaries
Tool Selection
Logging stack comparison
| Criteria | ELK (Elasticsearch + Logstash + Kibana) | Loki + Grafana | Datadog Logs |
|---|---|---|---|
| Cost | Infrastructure-heavy, self-managed | Lower storage cost (indexes labels, not content) | Per-GB ingestion pricing |
| Query language | KQL (powerful, complex) | LogQL (Prometheus-inspired, simpler) | Natural language + structured queries |
| Scalability | Requires careful cluster management | Horizontally scalable, simpler ops | Fully managed |
| Best for | Full-text search, complex analysis | Cost-effective log aggregation | Teams preferring SaaS |
| Operational burden | High (cluster management, index tuning) | Medium (simpler architecture) | None (SaaS) |
Metrics stack comparison
| Criteria | Prometheus + Grafana | Datadog | Victoria Metrics |
|---|---|---|---|
| Cost | Free (self-hosted) + infrastructure | Per-host pricing ($15-23/host/month) | Free (self-hosted), lower resource usage than Prometheus |
| Query language | PromQL | Proprietary + PromQL support | PromQL-compatible |
| Long-term storage | Requires Thanos or Cortex | Built-in | Built-in, efficient compression |
| Best for | Cloud-native, Kubernetes | Full-stack monitoring | High-cardinality, long-retention |
| Operational burden | Medium (federation for scale) | None (SaaS) | Low-medium |
Tracing stack comparison
| Criteria | Jaeger | Zipkin | Tempo + Grafana |
|---|---|---|---|
| Backend storage | Elasticsearch, Cassandra, Kafka | Elasticsearch, MySQL, Cassandra | Object storage (S3, GCS) |
| Cost | Infrastructure-dependent | Infrastructure-dependent | Lower storage cost (object storage) |
| UI | Good, standalone | Simple, lightweight | Grafana integration |
| Best for | Production-grade, large scale | Simpler deployments | Grafana ecosystem |
Recommended stacks
Budget-conscious (self-hosted): Loki + Prometheus + Tempo + Grafana — unified Grafana UI, lower operational cost, CNCF-backed.
Enterprise (managed): Datadog or Grafana Cloud — full-stack observability, no infrastructure management, higher per-unit cost but lower total cost of ownership for teams without dedicated platform engineers.
Architecture Design
Collection layer
- Deploy OpenTelemetry Collector as a sidecar or DaemonSet on every node
- Configure the collector to receive traces, metrics, and logs via OTLP
- Use the collector for batching, filtering, and routing — keep application-side instrumentation lightweight
- Implement buffering in the collector to handle backend outages without data loss
Processing layer
- Filter out high-volume, low-value data before storage (health check logs, internal metrics)
- Enrich data with metadata: Kubernetes labels, deployment version, region
- Sample traces at the collector level — tail-based sampling keeps interesting traces
- Transform log formats into a consistent schema across all services
Storage layer
- Separate hot storage (recent data, fast queries) from cold storage (archived data, cheaper)
- Configure retention policies per data type and severity
- Monitor storage growth and set up capacity alerts
- Test data restoration from cold storage regularly
Alerting Strategy
Alert hierarchy
| Level | Response time | Channel | Example |
|---|---|---|---|
| Critical (P1) | 5 minutes | PagerDuty/phone | Service down, data loss risk |
| High (P2) | 30 minutes | Slack + PagerDuty | Error rate above SLO, degraded performance |
| Medium (P3) | 4 hours | Slack | Elevated error rate, approaching capacity |
| Low (P4) | Next business day | Email/ticket | Drift detected, non-urgent maintenance |
Alert design principles
- Alert on symptoms (user-facing impact), not causes (CPU, memory)
- Every alert must have a documented runbook — what to check, what to do
- Set appropriate thresholds — avoid alerting on normal variance
- Use multi-window, multi-burn-rate alerts for SLO-based monitoring
- Review alert frequency monthly — if an alert fires more than 5 times without action, fix the root cause or remove the alert
- Track alert fatigue: if the on-call engineer ignores more than 10% of alerts, the signal-to-noise ratio is too low
Anti-patterns to avoid
- Alerting on every 500 error — instead, alert when the error rate exceeds the SLO threshold for a sustained period
- Alerting on CPU > 80% — high CPU is not a problem unless it affects latency or availability
- Duplicate alerts from multiple systems for the same issue — deduplicate and correlate
- Alerts without owners — every alert must route to a specific team
Dashboard Design
Service dashboard template
Every service should have a dashboard with:
- Request rate (requests per second, broken down by endpoint)
- Error rate (percentage, broken down by status code)
- Latency distribution (p50, p95, p99)
- Saturation (CPU, memory, connections, queue depth)
- Dependency health (latency and error rate of downstream services)
- Recent deployments (annotated on time-series graphs)
Executive dashboard
- Overall system availability (uptime percentage)
- SLO burn rate — are we consuming error budget faster than planned?
- Incident count and mean time to recovery (MTTR)
- Cost metrics — infrastructure spend per service, per environment
Dashboard anti-patterns
- Too many graphs on one dashboard — keep it to 8-12 panels maximum
- Graphs without context — every panel needs a title, description, and expected range
- Default time ranges too wide — 1-6 hours for operational dashboards, 7-30 days for trend dashboards
- No drill-down — dashboards should link to traces and logs for investigation
How ARDURA Consulting Supports Observability Implementation
Building production-grade observability requires DevOps engineers, platform engineers, and SREs with hands-on experience in monitoring tooling, distributed tracing, and incident response. ARDURA Consulting provides the expertise:
- 500+ senior specialists including SREs, platform engineers, and DevOps experts experienced in ELK, Prometheus, Grafana, Datadog, and OpenTelemetry — available within 2 weeks
- 40% cost savings compared to permanent hiring, allowing you to bring in observability expertise for implementation and knowledge transfer without long-term commitments
- 99% client retention — engineers who stay through implementation, tuning, and operationalization of your observability stack
- 211+ completed projects including enterprise monitoring platforms, incident response automation, and SRE practice establishment
Whether you need a platform engineer to architect your observability stack or an SRE team to implement and operate it, ARDURA Consulting provides the talent to make your systems truly observable.