SRE

Distributed Systems: Monitoring and Alerting

In day to day life, SRE deals with operating complex distributed software systems. Cloud native is becoming a first class citizen and these systems leverage many managed services (like databases, queues, VMs) to provide business functionalities to the end user. SRE has to make sure that the system is always up and operating normally.
The development teams are also interested in understanding how their software behaves in the production environments.

Having an efficient monitoring and alerting system will help not only the SRE teams, but also the development teams.

In this post I am sharing my thoughts about setting up useful monitors and alerts.

Monitoring

Modern software systems spans across multiple data centers in different regions and often uses polyglot databases and queuing systems. Including business, various stakeholders are interested in understanding the system behaviors.

Logging & Instrumentation are the two pillars of monitoring. Monitors gathers data from these systems and stores them so that they can be queried and visualized.

We should monitor all the components of the software system. Also, to get better visibility of the systems, we should set up monitors from different perspectives.

A successful monitoring implementation will reduce the time to identify the root cause of a problem and helps to reduce the MTTR and increase the MTBR.

In an ideal monitored world, the operations engineer working at 3 AM should be able to pin point issues without typing any commands.

Alerting

Though dashboards can help us to find anomalies, not everyone likes to stare at the dashboards often.

So, we should setup some mechanism which periodically analyze the data and alert us when some component is not behaving correctly. There are many tools out there to analyze and raise alerts, which can be paged to the respective stakeholders.

Like monitors, different stake holders are interested in different types alerts, their severity and how they are delivered to them (via email, phone calls etc.)

For example,

Operations Engineers should be called when a system failed, whereas get emails in other cases
Development teams might want to get notified on certain error messages via emails
Business might want to get notified when the end users are affected via emails
Make a phone call to the on call engineer on high severity issues.

So, I would like to categorize the alerts based on their severity, and assign different delivery mechanisms for each category so that no one gets alert fatigue.