Calculating Service Level Objectives (SLO) and using Service Level Indicators (SLI)


About the author

Principal Site Reliability Engineer (SRE), Startup Founder and IT Leader. Building, tuning and supporting mission critical systems.  Platform creator, inquisitive technologist and innovation enthusiast with more than 23 years of Information Technology Industry experience.

Experience includes Airlines, E-Commerce, Fintech, Hospitality, Streaming Services Industries as well as State and Local Government. Launched several startups, providing an expert technical perspective on key technology decisions.  For more information about the author please visit his site at www.briandibella.com or linkedin.

Summary

This article calculates Service Level Objectives (SLOs), explains how they differ from Service Level Indicators (SLIs) and how to leverage both.  SLOs are internal agreements within the development and product team to define the reliability of a service, as a percentage, over a defined period of time.  

SLOs can be used to display the health of a service but their real power lies in their ability to measure the reliability of a component used in decision making around deploying new features or fixing bugs and/or performance.  

An SLO is often reflected as a percentage of time when things are going as planned vs when they are not.  However, SLOs can also be reflected by the number of errors or other metrics over time, e.g. number of errors over 30 days.  

In essence, an SLO should be telling you how reliable a piece of software is using a metric for your application and/or each component that makes up that application, e.g. micro-service, database, API, third party API, network, service, excreta.  

SLO Thresholds

A SRE thinks about the user's experience, whether we are frustrating the user and at what point will they stop using our application.  SLO thresholds should be set a the point where most users are so frustrated with our application that they will begin to look for other options.  

It takes time, effort, and money to make an application run at 99.9999% reliability. If the user does not care if the app runs with less reliability, then neither should we, which will result in time, effort and money saved.  When selecting SLO thresholds, find the value of the metric that is being measured, where users begin to get frustrated.  If we are measuring latency, we know that users begin to notice slowness around 400 ms, for example.  An SLO target set to 99.9% is a good starting point for most applications and underlining components. 

SLO Calculation

100 - ((minutes failed / total time) X 100) = SLO often represented as nines, e.g. 99.9% or 3 nines.

Example SLO: 

Availability = a measure of if the service is reachable and working (up) vs not reachable or not working (down). 

Over a period of 30 days the availability of our service was down for 5 minutes and 54 seconds of the month.  This measurement comes from the monitoring system, whatever that may be. We set our SLO target to 3 nines of availability or 99.9% up time. 

What We Know

  • We are measuring a time period of 24 hour 7 day a week minutes for a 30 day month 
  • 24 hours X 30 days = 720 hours
  • 720 hours X 60 minutes =  43,200 minutes in the month. 
  • Our monitoring system has reported 5 Minutes 54 seconds of bad minutes = 6 bad minutes
Note: This number may vary slightly depending on the total days of the week or by using an average throughout the year.  To keep the equation simple, we choose 30 days for a month.

Down Time Percentage

    ( 6 / 43,200) X 100 = 0.0138% of downtime used.

Up Time Percentage 

    100% - 0.0138% = 99.9862% of up time which is our SLI that indicates our our current status of the service.

    Because we have 99.9862% up time our target of 99.9% SLO for availability was met.  Success!! 


Calculating the Error Budget

Based on the math above, we now know that we have extra time in our error budget.  The team can now decide on whether to release a new feature or focus on fixing bugs and performance.  If we choose to release a new feature, the rollback plan must allow enough time to execute and complete before using more than the remainder of the error budget.  But how much time do we have?? 

The inverse of the SLO is the error budget.  Error budgets are used for making decisions on where to focus your effort as a developer.  To calculate an error budget, we look at the SLO target (99.9%) and calculate in minutes or seconds (if necessary) how much time we have for a month.  

Error Budget Calculation

    SLO target in minutes - minutes used = Error budget in minutes 

What We Know

  • 43,200 minutes in a month 
  • SLO Target is 99.9%.  We set this target after talking with the team about the service purpose, criticality, historical service performance and users expectations.

Calculating the Percentage and Minutes of the Error Budget 

    100% - 99.9% = 0.1% 

    Convert 0.1% to a number 0.1 X 100 = 0.001

    43,200 X 0.001 = 43.20  minutes of down time available as an error budget.  We can round down to 43 minutes.

Note: This number may vary slightly depending on the total days of the week or by taking an average of throughout the year.  To keep the equation simple we choose 30 days for a month.

Calculating the Used Error Budget

    100% - 99.9867% = 0.013% used for the month.

    Convert 0.0133% to a number. 

        100 x 0.0133 = 0.000133  

        43,200 X 0.000133 = 5.7456 minutes used or 6 minutes used rounding up.  

    Subtract minutes used from total error budget for the month.  

        43.20 - 6 = 37.20 minutes or 37 minutes of available down time within the error budget.    

Conclusion

The development team can confidently rollback an environment in 10 to 15 minutes maximum.  

In this case there is plenty of time to roll back a deployment where the users will be disrupted but based on our assessment of reliability the users will tolerate the disruption.   

Therefore, releasing new features can proceed with the knowledge that the impact to the users will need to remain in the generous window of 37 minutes for this month.  

I recommend this math be automated and displayed within a dashboard of SLIs illustrating the resulting numbers for the following:

  • Current percentage of minutes
  • Percentage of target error budget used
  • Minutes of error budget used
  • Minutes of error budget remaining

I recommend SLOs on each component measuring the reliability of availability, latency, error rate, saturation and throughput.  In addition, all time frames should remain fixed so that reporting reliability is consistent in measuring quantitative values. 

This example is simplified for the purpose of keeping the math as simple as possible.  It is recommended to find the time to the second, especially if your application has the potential of losing thousands of dollars or more per minute. 

Good luck using this process to produce useful SLOs and Error Budgets for your application teams to use with the purpose of improving the reliability of an application and its components. 


Comments

Popular Posts