Introduction to Grafana SLO
With Grafana SLO, you can create, manage, monitor, and alert on your Service Level Objectives.
Key features
- Learn about SLOs interactively with in-app guidance and this documentation.
- Set up your SLOs in Grafana using the UI, Terraform, or the API.
- Use our query builder to easily make ratio-based SLIs.
- Create an SLI from any PromQL query.
- Iterate on the query and objective to build a high-quality SLO. Update your SLI, evaluation windows, and objectives before deploying alerts.
- Use multi-dimension SLIs
- Look back at your SLO’s performance over time, including before the SLO was created.
- Generate multi-window, multi-burn rate alerting rules to ensure you’re notified at the right time.
- Automatically generate a dashboard for each SLO that can drill down into the metrics to help investigate alerts.
Overview
There are two key components to the Grafana SLO framework.
Service Level Indicators (SLIs)
A key performance metric, like availability. These are the metrics you measure over time that inform you about the health or performance of a service. It expresses actual results as a fraction. For example, 99.9% system availability, or 0.999. Grafana SLO can help create a high-quality SLI using our ratio query builder, and can also support any custom PromQL query for an SLI.
Service Level Objectives (SLOs)
The target an SLI ought to achieve. You define what an acceptable level of service is. In our example, this would be the percentage of error responses that are acceptable within a given time frame, so that a customer doesn’t notice a degradation in the service you are providing: 99.9% of requests to a web service return without server errors over 28 days.
When defining your SLOs, it is important to remember that you are not aiming for 100%. The cost and complexity of availability gets higher the closer you get to 100%, so it’s important to factor in a margin of error to your target, known as the error budget.
To highlight how this works, let’s use an example of a credit card processing app. The company has 99.95% availability written into contracts, but that doesn’t really give a clear picture of what their customers really expect from their service.
Using the SLO framework, this company can be more specific about their availability goals. The SLO in this case would be that they want 99.97% of requests to validate a credit card to return without a 500 error in less than 100ms. Validation should be instantaneous, because e-commerce websites need to show customers if they mistyped a number before a customer hits the “buy” button. The SLI in this case would be the % of requests to validate a credit card return without errors in less than 100ms.
Some other key concepts to be familiar with when implementing your SLO strategy are:
Error budget
Error budgets allow for a certain amount of failure when measuring the performance of a service. It is a measurement of the difference between actual and desired performance. Using the example from above; the difference between perfect service (100%) and the service level objective (99.97%).
In this case, the error budget can be measured as a percentage (like 0.03% failures) or an amount of time (12 minutes of non-compliance per 28 days).
Burn rate
Burn rate is the rate at which your service is running out of error budget, which is the amount of imperfection you’re okay with in your service.
By setting an SLO of 99.5%, you’re saying it’s okay for 0.5% of requests to return errors or take longer than 500ms. If you have a constant error rate of 0.5%, your service will completely run out of error budget in 28 days. That’s a burn rate of 1. Slower burn (like 0.75) is good! That means you’re beating your SLO. Faster burn is bad - it implies you’re providing lower-quality service than your users expect, and you should do something about it.
Alert on your burn rate
SLO alert rules trigger alerts when you’re in danger of using up the error budget in your SLO timeframe. This ensures support teams are only notified when an issue impacts their business objectives and not each and every time a monitored resource or process breaches a set threshold.
Grafana generates both fast and slow burn rate alerts, because you will probably want to react differently if your service is slowly burning the error budget (e.g. just open a ticket if a bug has increased your error rate) vs. quickly burning the error budget (e.g. notify the on-call engineer for a regional outage).
For example, if you’re burning error budget at a rate of 2% per hour (in our case, that would be like an error rate of X%), Grafana triggers an alert, which should page an on-call engineer using a tool like Grafana OnCall. This catches urgent events, like outages or hardware failures.
If you’re burning error budget at 0.8% per hour of your error budget, Grafana sends a less-critical alert, intended for you to open a ticket in Jira, ServiceNow, Github, or another ticketing system. This catches less-urgent events, like bugs or network slowdowns.
Fast-burn alert rule:
Over short time scales, Grafana sends alerts when the burn rate is very high. This alerts you for very serious conditions, such as outages or hardware failures.
Slow-burn alert rule:
Over longer time scales, Grafana alerts when the burn rate is lower. This alerts you to ongoing issues that require attention.
SLO usage and billing
Each Grafana instance is limited to 25 SLOs by default. If you need to increase these limits, please contact us or ask your account executive, support engineer, or technical account manager.
SLOs create new data points every 60 seconds, or 1 Data Point per Minute (DPM) of Prometheus metrics. Each SLO creates 10-12 Prometheus recording rules, where each recording rule will create one or more series depending on the provided grouping labels. If the output of your SLI query has very high cardinality, an SLO will create many new series.
For more information about metrics and how to manage DPM, see refer to the Grafana Cloud metrics optimization docs.