Skip to content
Back to projects

Service Observability Layer

A shared tracing and metrics layer across a fleet of platform services, built so on-call engineers can find the failing hop in minutes, not hours.

  • OpenTelemetry
  • Datadog APM
  • Grafana
  • Python

title: Service Observability Layer summary: A shared tracing and metrics layer across a fleet of platform services, built so on-call engineers can find the failing hop in minutes, not hours. stack: ["OpenTelemetry", "Datadog APM", "Grafana", "Python"] links: repo: https://github.com/your-handle/observability-layer featured: true order: 2

The problem

Incidents were slow to diagnose. Traces stopped at service boundaries, so a request that crossed five services left five disconnected fragments. Alerts were threshold-based and noisy enough that the team had learned to ignore them.

The approach

I instrumented distributed tracing end to end with OpenTelemetry and standardized span conventions across services, so a single trace follows a request through the whole system. Then I replaced threshold alerts with SLI/SLO-based signals surfaced in Grafana — alerting on user-visible symptoms rather than internal noise.

The tradeoff was upfront discipline: shared conventions only pay off if every service follows them, which meant template code and review gates rather than letting each team instrument ad hoc.

The outcome

  • Mean time to detect on production incidents cut by ~52%
  • Alert noise reduced by ~40%
  • On-call engineers could pinpoint the failing hop from one trace