Alerting and Monitoring in Distributed Systems

Published on 15 Dec 2024
Backend

Distributed systems give us scalability, resilience, and flexibility—but they also make failures harder to see, diagnose, and respond to. Problems are rarely binary. Instead of “the system is down,” you get partial outages, slow degradation, and failures that only show up under specific conditions.

That’s why alerting and monitoring aren’t optional add-ons. They’re foundational to running distributed systems in production.

This post explores how to think about alerting and monitoring in distributed systems, common mistakes to avoid, and practical examples using Azure resources to make the ideas concrete.


Why Distributed Systems Are Different

In a monolithic application, failures tend to be obvious. In distributed systems, they are often subtle and incomplete.

For example:

  • One service is slow but still responding
    An Azure App Service may continue returning HTTP 200 responses, but P95 latency steadily increases. From a metrics perspective, everything looks “up,” yet users experience sluggish pages and timeouts. If you only alert on error rate, you’ll miss this entirely.

  • A downstream dependency is timing out intermittently
    An App Service calling Azure SQL Database or Cosmos DB may encounter brief throttling. Requests often succeed on retry, so errors remain low, but latency spikes appear in distributed traces. These issues are easy to overlook without end-to-end visibility.

  • Messages are backing up in a queue
    An Azure Service Bus queue may show steadily increasing ActiveMessages. Nothing is technically broken, but processing lag grows until users start noticing delays. Left unchecked, this often leads to cascading failures.

  • A single region or availability zone is degraded
    The application remains available, but performance drops due to a regional dependency issue. Without region-level metrics, this can look like random application slowness rather than an infrastructure problem.

Monitoring in distributed systems must answer three questions:

  1. Is the system healthy right now?

  2. Are users being impacted?

  3. Where should we investigate first?


Monitoring vs. Alerting

Although they’re closely related, monitoring and alerting serve very different purposes.

Monitoring: Visibility

Monitoring helps you understand what your system is doing over time. In Azure, this typically includes:

  • Azure Monitor metrics for latency, traffic, errors, and resource usage

  • Application Insights dashboards for request behaviour and dependency calls

  • Log Analytics queries for deep debugging

  • Distributed tracing across services and dependencies

Monitoring answers:

What is happening, and how did we get here?

Dashboards, graphs, and logs are for engineers actively investigating issues or analysing trends.

Alerting: Action

Alerting exists to trigger action. In Azure, alerts are usually configured through Azure Monitor Alerts and routed via Action Groups to email, Teams, PagerDuty, or other on-call systems.

A simple rule of thumb:

If no action is required, it should not be an alert.

Alerts answer:

Does someone need to act right now?


The Signals That Matter

Not all metrics are equally important. One widely used framework for distributed systems is the Golden Signals:

Latency

Latency directly affects user experience. Track percentile latency (P95 or P99) rather than averages.

Azure example:

  • Application Insights requests/duration

  • Alert when P95 latency exceeds an agreed threshold for a sustained period

Traffic

Traffic shows demand and helps detect unexpected drops or spikes.

Azure example:

  • Request count in App Service or API Management

  • Sudden drops may indicate routing or authentication issues

  • Sudden spikes may expose scaling or throttling limits

Errors

Error rate is often more meaningful than raw error count.

Azure example:

  • Failed request percentage in Application Insights

  • 2 errors out of 10 requests is very different from 2 errors out of 10,000

Saturation

Saturation shows how close a system is to its limits.

Azure example:

  • CPU or memory for App Service

  • DTU or vCore utilization for Azure SQL

  • Queue depth or message age for Service Bus

In asynchronous systems, queue-related metrics are often the earliest warning signs.


Logs, Metrics, and Traces

Each observability signal answers a different question:

  • Metrics tell you something is wrong

  • Logs tell you why

  • Traces tell you where

A common Azure incident flow looks like this:

  1. An alert fires due to elevated error rate

  2. Dashboards show increased dependency latency

  3. Distributed traces reveal slow SQL calls

  4. Logs identify throttling or an inefficient query

Relying on only one of these creates blind spots. Together, they dramatically reduce time to diagnosis.


Designing Effective Alerts

Poor alerts create noise. Too much noise leads to alert fatigue—and eventually, missed incidents.

Effective alerts share a few characteristics:

  • They focus on symptoms, not causes
    High CPU might be acceptable during peak traffic. Elevated error rates or breached latency targets are not.

  • They are sustained, not spiky
    Alerting on single data points creates noise. Alerting on sustained conditions builds trust.

  • They are actionable
    An on-call engineer should know where to start investigating.

  • They include context
    Azure alerts can link directly to dashboards, log queries, or runbooks.

For example:

  • Alert when failed request percentage exceeds 2% for 5 minutes

  • Alert when Service Bus queue depth grows continuously for 10 minutes

Each alert should answer:

What is broken, how bad is it, and where do I look first?


SLOs, SLIs, and Error Budgets

More mature systems drive alerting from Service Level Objectives (SLOs) rather than raw metrics.

  • SLI: What you measure (e.g., request success rate)

  • SLO: The target (e.g., 99.9% success over 30 days)

  • Error Budget: How much failure is acceptable

In Azure, this often means:

  • Using Application Insights for SLIs

  • Tracking rolling success rates

  • Alerting on error budget burn rate rather than isolated failures

This approach aligns alerting with customer impact and business goals.


Common Pitfalls

Some mistakes appear repeatedly in distributed systems:

  • Alerting on every available Azure metric “just in case”

  • Treating dashboards as alerts

  • Alerting on auto-scaling events instead of user impact

  • Ignoring queue depth and processing lag

  • Shipping alerts without runbooks or guidance

A common failure mode:

Everything looks green in the Azure Portal, but users are complaining.

This usually means you’re monitoring resources, not outcomes.


Summary

Distributed systems rarely fail loudly. They fail gradually, partially, and quietly.

Effective alerting and monitoring:

  • Focus on user-facing signals

  • Alert only when action is required

  • Treat queues and async workflows as first-class citizens

  • Combine metrics, logs, and traces

  • Continuously refine alerts as the system evolves

The goal isn’t more dashboards or more alerts.
It’s fewer surprises—and faster understanding when something goes wrong.