Site Reliability Engineering (SRE) | @bagnascojhoel

SRE is a concept developed inside Google to solve the standard way of running software systems, at early 2000s and before. Before the SRE approach, the operations team, with sysadmins, was responsible for deploying and ensuring systems worked. The development team was responsible for developing new features. Each team used specific technical terms and had their own concerns. That division ended up causing conflicts between the two teams, and each started to use tactics to bypass constraints created by the other team.

Google then started to use software engineers with some sysadmin knowledge to handle their operations requirements. That caused them to automate a set of tasks that were previously done manually, as well as facilitate the interactions between development and operations.

🔥

Postmortems should be written for any significant incidents and use a blame-free strategy.

Principles of SRE

The aspects of an SRE team differ from team to team, but they all adhere to some core responsibilities with their services:

availability;

latency;

performance;

efficiency;

change management;

monitoring;

emergency response;

and capacity planning.

Balancing Velocity and Availability

No software system will ever be 100% available. That is an unrealistic goal, hard to achieve and non-relevant goal to the end user. Getting the final 0.001% of availability is extremely hard, and basically no user will notice the difference between 99.999% (less than 6 minutes of downtime per year) and 100%.

Since, 100% availability is not the right SLO (Service-Level Objective) for your system, what is? In SRE, it is an error budget. This is a requirement defined by the product, which should take in consideration how the different levels of availability affect the product and its user. We then spend this error budget, e.g. 0.001% over the year, with maximum velocity for feature development.

Monitoring

Monitoring is how service owners keep track of their systems’ health and availability. It includes collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.

Great monitoring should never require that a human interprets any data, software should do that. Humans should only be involved when they need to take an action.

There are three kinds of possible monitoring outputs:

Alerts: when a human needs to take immediate action in response to something that is happening or about to happen.

Tickets: when a human needs to take action, but not immediately. The system cannot handle the situation automatically, but no damage will incur if the action takes a few days.

Logging: when no human needs to take action, but it is recorded for diagnostic or forensic purposes. No one should read the logs unless something else prompts them to do so.

Emergency Response

Reliability is dependent on both mean time to failure (MTTF) and mean time to repair (MTTR). The latter is the most effective for evaluating emergency response time.

It’s intuitive that humans are slower than software when it comes to MTTR. So the less human intervention a system requires, the higher its availability will be. Even so, some incidents will require human intervention. For those cases, keeping a playbook of best practices for troubleshooting and recovery actions improves roughly 3x the MTTR. Besides that, for issues that the playbook does not cover, SRE engineers can perform exercises such as the Wheel of Misfortune to prepare and keep sharp.

Change Management

Roughly 70% of outages are due to changes in a live system. This trio of best practices can minimize the number of users and operations exposed to bad changes:

implementing progressive rollouts,

quickly and accurately detecting problems,

and rolling back changes safely when problems arise.

This best practices remove humans from the deployment flow and reduces common problems they cause. As a result, both velocity and safety increase.

Efficiency and Performance

If you care about money, you need to use resources efficiently. The usage of resources by a system is a described by its demand, capacity and software efficiency. These three factors are a large part of a system’s efficiency. With poor software efficiency and high demand, capacity will decrease. At some point, the system will stop serving and performance goes down the hole — or the system may go down.

Monitoring Distributed Systems

Monitoring watches for certain indicators of systems’ health, perform actions to solve what can be automatically fixed, and notify humans when some manual action is required.

White-box monitoring: is based on metrics exposed by the internals of the system, like logs, runners, or HTTP handlers that emit internal statistics.

This type of monitoring is essential for debugging systems.
You need to collect data from every node in your system. E.g. if web servers seem slow on database-heavy requests, you need to know both how fast the web server perceives the database to be, and how fast the database believes itself to be. Otherwise, you will not know if the issue is on one of the systems or on the network.

Black-box monitoring: monitoring externally visible behavior, as the user would see it.

This type of monitoring is really useful to ensure that mission-critical flows are always functional and are attended to as soon as they stop working — but it is completely useless for predicting issues before they arise.

Why Monitor?

Monitoring includes several advantages, such as:

Analyzing long-term trends: how large is my database, and how fast it is growing? Should I consider increasing its capacity? Should I change to a different DBMS?

Comparing over time or experiment groups: is my system slower after I released that new feature? Is Postgres any slower than MongoDB for query that specific use case?

Alerting: something is broke or going to break soon and somebody needs to take a look.

Building dashboards: you can create dashboards to quickly answer basic questions about your system’s health and performance.

Debugging: our system latency just shot up. What else happened around the same time that might have caused this?

Business and security analytics: it provides raw input data for business analytics as well as analysis of security breaches.

Common Rules

Avoid complexity: systems that try to learn thresholds or automatically detect causality, as well rules that are too reliant on dependencies, are generally too complex for monitoring. Monitoring should be kept simple and comprehensible for everyone on the team.

This rule is recommended because more complexity equals more fragility.
Monitoring systems that learn from end-user behavior is an applicable scenario. Another valid use case is on experiments that won’t affect the monitoring of mission-critical systems.
Alerting rules should only be dependent on very stable parts of the system.
To keep a reliant, low noise, and high signaling monitoring system, it should be simple and robust.

Should answer “what” and “why” something broke or is about to break: the “what” signals the symptom; the “why” indicates a possible root-cause (nothing guarantees that it is really the root-cause and not just an intermediary cause).

The Four Golden Signals

If you measure all the four following signals, you will have a decently monitored service.

Latency: time it takes to serve a request.

Segregate latency for successful and failing requests, otherwise failing requests might affect your interpretation of the overall latency.
In white-box monitoring, the latency can be from each services’ response time. But on black-box, it should mimic end-user perception, so it should consider the end-to-end latency.

Traffic: high-level metric how much demand is being placed on your system.

This metric is updated depending on your system:

for a web service, how many HTTP requests per second;
for databases, how many transactions and retrievals per second;
for a streaming service, the network I/O rate or concurrent sessions.

Errors: rate of requests that fail in any way.

Not only requests that result in 50x HTTP status but, for example, an HTTP 200 response with compromised response body.

Saturation: how much is your service being used.

Emphasize on the resources your system is most constrained on (e.g. memory, I/O, storage, …).
You should keep a target utilization. If your system uses less resources than available, you are wasting money; if you over-utilize the resources, your application will fail.
In complex systems, you can use some higher-level load measurement to ensure the saturation is always on target range utilization. For example, can the service handle 10% more traffic? What about double? And less?
You should use this to also predict impeding saturations. For instance, your database will fill its hard-drive in about 4 hours.

How to Extract the Data

You should not use the mean, as it will easily become misleading. You can look at some portion of the data (x% of the highest latency requests) or bucket the data in groups to visualize with a histogram.

Different measurements require different levels of granularity. Collecting frequent measurements on your resources might yield interesting data, but it will be expensive to collect, analyze, and store it. You should consider that when defining how frequently you will collect data.

If you require high resolution, but can have low latency, your system can make some internal sampling and leave it to an external service to further extract and analyze that data. For instance, you could collect per-second measurement of CPU load; increment it on buckets of 5% utilization; aggregate those values every minute.

Your monitoring system should be the simplest possible. When designing your monitoring system, follow these guidelines:

Rules that catch incidents should be simple, predictable, and reliable.

Data collection, aggregation, and alerting configuration that are rarely used, are candidates for removal.

Signals that are collected, but not exposed on any dashboard nor used by any alert, are candidates for removal.