Balancing Act of Reliability

Software development involves both creating and maintaining systems. Once you put anything into production, reliability becomes critical. When your systems aren’t reliable, you face issues in various ways. If you’re a SaaS company, you could lose customers. Nobody wants their business to stop because of your reliability problems. They will go to the competition. In B2C, customers can’t search, book, or use your service. They get frustrated. If you don’t deliver on reliability, you might have to compensate with credits or even real money. But the real cost is your reputation. This is why reliability matters. And if you can’t measure it, you can’t promise it.

This is where error budgeting, Mean Time to Detect (MTTD), and Mean Time to Recovery (MTTR) come in. Think of error budgeting as your allowance for risk without compromising reliability. It tells you how much downtime you can afford before shifting focus from new features to stability. MTTD and MTTR, on the other hand, answer important questions: how fast did you detect the issue, how did you recover, and how long did it take? Let’s dive in and see how these metrics help manage risk and reliability.

This is fine. Site Reliability Engineering

Setting the Foundations

Before we talk about error budgets and recovery, let’s cover Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLI

This is a metric that shows how well your service is performing. It could be latency, error rate, or availability. SLIs are the performance indicators that help you understand if your system meets its goals.

SLO

This is the target for an SLI. For example, if your SLI is uptime, your SLO might be 99.9% uptime. SLOs are the internal goals that help ensure your service stays reliable for users.

SLA

SLA is a formal contract between you and your customers where you outline the expected level of service. It includes specific SLOs, and if these objectives aren’t met, you typically face consequences, mostly financial. SLAs are external commitments to customers.

Reliability management depends on SLIs and SLOs. They provide a way to measure and improve your service quality. They also form the foundation of error budgets. You can then define what level of service is acceptable.

Error Budgeting

An error budget quantifies the acceptable amount of system unreliability over a specified period. For instance, error budgeting tells you how much downtime is okay over a given period. Say you want 99.9% uptime for your system. You can use error budgeting to see what that means in practice:

Service Level Objective (SLO): 99.9% uptime
Total Time Period: 30 days (43,200 minutes)

The allowable downtime is:

Allowable Downtime = Total Time * (1 – Uptime %)
Allowable Downtime = 43,200 minutes * (1 – 0.999)

That gives you 43.2 minutes of allowable downtime in 30 days. So, your error budget is 43.2 minutes for that period.

An error budget helps you understand how fast you should go. You want to introduce new features while keeping systems reliable. Once you’re out of the error budget, leadership needs to make a decision. Should you slow down or stop feature development? Should you focus solely on reliability?

Having error budgets also provides clarity. It becomes a communication tool. They help make reliability tangible. Now, you are speaking with data as opposed to assumptions. When error budgets are low, everyone becomes aware that they need to focus on reliability. Then you can agree on the actions to take, such as slowing down releases or prioritizing fixes.

MTTD

Mean Time to Detect (MTTD) is the average time it takes to find an issue. The faster you detect, the faster you can respond, which directly impacts reliability. MTTD measures how well your monitoring and alerting systems catch issues.

A low MTTD means your detection mechanisms are effective. A high MTTD means problems stay undetected longer, causing more damage. Reducing MTTD is critical for keeping services smooth and customers happy.

In some cases, your monitoring and alerting may not even catch the issue so you learn it from the customer. That’s plain bad. You should fix these cases and have proper alerting for these scenarios.

MTTR

Mean Time to Recovery (MTTR) measures how quickly you can fix a problem and get back on track.

For example:

Total Downtime: 240 minutes
Number of Incidents: 6

The MTTR is:

MTTR = Total Downtime / Number of Incidents
MTTR = 240 minutes / 6 = 40 minutes

A lower MTTR means higher resilience since you’re recovering faster. But measuring MTTR can be tricky. It’s even harder if it’s not a service but a complex, multi-step recovery. MTTR needs a clear incident lifecycle, and without it, your MTTR might be unreliable. Measuring how fast and how well you recover provide additional data points for leadership.

There’s some criticism towards MTTR as a metric. People often argue that MTTR can be misleading because it assumes all incidents are equal, which they aren’t. Some incidents are minor, while others can be major and take much longer to resolve. This variability makes it hard to truly assess reliability using just MTTR. This is why I think we need to consider other relevant metrics such as MTTD to see the overall impact of an incident.

Moreover, MTTR is often conflated with Mean Time to Respond, which focuses more on the response initiation rather than the complete recovery. A good response time doesn’t always mean a quick recovery, especially if the issue is more complex.

Another point worth noting is that not all recovery times are created equal. The MTTR that matters is the one that minimizes the overall impact on users. In practice, we need to focus on decreasing user experience based downtime. Partial service outage if it doesn’t lead to downtime, shouldn’t be part of the calculation. It’s about understanding which parts of the recovery are most critical for the customer experience and optimizing those first.

Services vs. Platforms

The reliability metrics we use such as MTTR, MTTD, and error budgets are not straightforward to apply across different types of systems. The distinction between a service, potentially microservice, and a platform like a Spark application is important here. Microservices in many cases are usually small, self-contained, and have clear boundaries, which makes it easier to define and measure metrics like MTTR and MTTD. Each service can be monitored independently, and issues are typically localized, allowing teams to detect and recover quickly.

On the other hand, a platform which offers spark execution is a different beast since it provides a distributed data processing system. It often relies on many interdependent services. Measuring MTTR and MTTD for such an application can be tricky because an issue in a Spark job may not be isolated. Problems can pop up due to data ingestion, resource allocation, or even networking. In such cases, the recovery is not just about restarting a single service; it involves diagnosing multiple components, which can drastically increase recovery time. Inherent complexity makes defining SLIs and SLOs much more challenging. Thus, it complicates the implementation of a realistic error budget.

The Real Challenge

Defining these metrics is one thing, but implementing them properly is the real challenge. If you own a service that depends on multiple departments and services, convincing everyone to do their part can be tough. Resiliency cuts across teams, and often people don’t see it as their responsibility. Each and every service contributes to overall reliability. Nonetheless, it’s easy for teams to think, “That’s not my job.” Hence, educating people about the importance of resilience is also a big part of the job.

The other part is the measurement. All teams need to use the same tools to create a consistent, global understanding of how reliable your systems are. Otherwise, you won’t be able to report a consistent view. This is again going back to education. You need to train the staff to use the same tools.

To truly implement these metrics, you need collaboration, shared responsibility, and a culture that values resilience. Setting targets is the easy bit. You need to get people’s buy in about resilience and making it everyone’s priority.