Create SLOs
Create SLOs to measure the quality of service you provide to your users.
Each SLO contains an SLI, a target, and an error budget. You can also choose to add SLO alert rules to to alert on your error budget burn rate.
In the following sections, we’ll guide you through the process of creating your SLOs.
To create an SLO, use the in-product setup wizard or use the following steps to help you to set up your SLOs.
- Define a service level indicator
- Set a target
- Add a name and description
- Add SLO alert rules
- Review and save your SLO
Before you begin
Before creating an SLO, complete the steps to Set up Grafana SLO.
Set up Alert Notifications [Optional]
To use a custom notification policy for your SLO alert rules, complete the following steps:
Configure a notification policy and add SLO labels as label matchers in the Alerting app.
Add the following SLO labels to make severity-based routing decisions. Custom labels that were added to the SLO or it’s alerting configuration can also be used.
- grafana_slo_severity # “warning” or “critical”
Configure contact points for the notification policy in the Alerting application.
Note
For fast-burn alert rules (ex:
grafana_slo_severity="critical"
), we suggest using a paging service, such as Grafana OnCall.For slow-burn alert rules (ex:
grafana_slo_severity="warning"
), we suggest opening a ticket in Jira, ServiceNow, Github, or another ticketing system.
Define a service level indicator
Start with your service level indicator (SLI), the metric you want to measure for your SLO.
SLIs (Service Level Indicators) are the metrics you measure over time that inform about the health or performance of a service.
Click Alerts&IRM -> SLO + Create SLO.
Enter time window.
Select a time frame over which to evaluate the SLO.
Note
The default time window for Grafana SLOs is 28 days, because it always captures the same number of weekends, no matter what day of the week it is. This accounts better for traffic variation over weekends than a 30 day SLO.Choose a data source.
Select a data source from the data source drop-down picker.
Note
Grafana SLOs can be created for Grafana Cloud Metrics, Grafana Enterprise Metrics, and Grafana Mimir data sources.
To use a Grafana Mimir data source, the Ruler API must be enabled, and the user creating an SLO must be allowed to create recording and alerting rules for the data source.
Choose query type.
Ratio query
The Ratio Query Builder builds a properly formatted SLI Ratio Query without the need to write PromQL. Syntax, required variables, and expression format are generated based on the metrics entered. Users new to PromQL or SLOs that don’t require data massaging at the query level should use the Ratio Query Builder.
a. Enter a success metric.
b. Enter a totals metric.
c. [Optional]: Enter grouping labels to generate an SLO with multiple series, breaking the SLO down into logical groups.
Labels are useful, for example, if you have an SLO that tracks the performance of a service that you want to track by region, then you can see how much budget you have left per region.
Grafana creates multidimentional alerts, so if you group by cluster, each cluster will have its own associated alerts.
Note
Grouping labels can change the number of Data Points per Minute (DPM) series that are produced. You can see the number of series created in the SLI ratio, where each series is represented by its own line on the graph displayed in the wizard. For example, grouping by “cluster” will multiply the series for each of the individual clusters for the SLO. This means that if you have 5 different clusters, the SLO will create 5 time series, each with same namespace on the Define SLI wizard.The generated expression and a preview of the query is displayed in the Computed SLI ratio window.
Advanced query
The advanced query builder allows users to write a freeform query and can be found throughout the Grafana product. You can choose between a graphical query building experience or writing PromQL in a code textbox.
All advanced queries must use a ratio format.
To create an advanced query, enter a query that returns a number between zero and one.
For example, divide the rate of successful requests to a web service by the rate of total requests to the web service.
Example:
“Successful" events could be a count of requests to your web service that received a response in less than 300ms, and “total” events could be the count of all requests to your web service.
See this example from the Prometheus documentation about successful events divided by total events:
sum(rate( http_request_duration_seconds_bucket{le="0.3"}[$__rate_interval] )) by (job) / sum(rate( http_request_duration_seconds_count[$__rate_interval] )) by (job)
Set a target
Set the target to a value that indicates “good performance” for the system.
For example, if you define a 99% target for requests with a status code of 200, then as long as at least 99% of requests have a 200 status code over your time window, you are meeting your SLO.
The error budget is the amount of error that your service can accumulate over a certain period of time before your users start being unhappy.
For example, a service with an SLO of 99.5% availability has an error budget of 0.5% downtime, which amounts to about 43 minutes of downtime over a 28-day period.
To set a target, enter a percentage greater than 0 and less than 100%.
The error budget is automatically calculated as 100% - target.
Statistical Predictions (Beta)
To help select a realistic SLO target, the SLO app will query 90 days of history from the raw metrics used to define the SLI. We then simulate many scenarios to compute a cumulative distribution function to help select a target that has a high likelihood of being met based on historical data.
Select different target probabilities on the presented graph to adjust your SLO target and the likelihood to hit that target.
Note
At times predictions may not be able to be generated. In those instances will display a Error budget panel based on the provided query.
Currently ratio queries are only supported. Freeform queries that can be parsed as a ratio type will be supported in a future release.
Name the SLO
Give your SLO a name. You can also add an optional description or labels to the SLO to give it more context for searches and management.
Good SLO names, descriptions, and labeling practices are a critical part of SLO maintenance and management. A single sentence that is understandable to stakeholders clarifies meaning and adds value to the SLO. Consistent naming conventions make communication about your SLOs and searches easier.
SLO names identify your SLO in dashboards and alert rules.
Add a name for your SLO.
Make sure the name is short and meaningful, so that anyone can tell what it’s measuring at a glance.
Add a description for your SLO.
Make sure the description clearly explains what the SLO measures and why it’s important.
Add SLO labels.
Add team or service labels to your SLO, or add your own custom key-value pair.
Add SLO alert rules
SLO-based alerting prevents noisy alerting while making sure you’re alerted when your SLO is in danger. Add predefined alert rules to alert on your error budget burn rate, so you can respond to problems before consuming too much of your error budget and violating your SLO.
SLO alert rules alert on burn rate over different time windows. Burn rate is the rate that your SLO is consuming its error budget. A burn rate of 1 would consume all of your error budget (like 1%) over your entire time window (like 28 days). In this scenario, you would exactly meet your SLO. A burn rate of 10 would consume all of your error budget in one-tenth of the time window (like 2.8 days), breaking your SLO.
Fast-burn alert rules:
Over short time scales, Grafana sends alerts when the burn rate is very high. This alerts you for very serious conditions, such as outages or hardware failures. The fast burn alert triggers when the error budget will be used up in a matter of minutes or hours.
Show-burn alert rules:
Over longer time scales, Grafana alerts when the burn rate is lower. This alerts you to ongoing issues that require attention. The slow burn alert triggers when the error budget will be used up in a matter of hours or days. If you decide to generate alert rules for your SLO to notify you when an event (like an outage) consumes a large portion of your error budget, SLO alert rules are automatically added with predefined conditions and are routed to a notification policy.
Note
When you add SLO alert rules, they are installed in Grafana Cloud Alerting in the stack where you create the SLO. The unmodified SLO name is included in the alert name.
To automatically add SLO alert rules, select the Add alert rules checkbox.
SLO alert rules are added once you save your SLO in the Review SLO step.
SLO alert rules are generated with default alert rule conditions.
Advanced Options
The options and features under the Advanced Options header are not required to build SLOs with either the UI or terraform. They are provided for specific conditions advanced users might want control of for their SLO.
Minimum Failures
Minimum Failures defines the minimum number of failure events (as defined by (success events - total events)
) needed to occur in a window to trigger an alert to fire. To use this feature, you must have a defined SLI that parses as a ratio from the SLO app. It adds a new term to the promQL that restricts alerting until the number of events has been exceeded. It is applied to all alert rules, so the minimum window the MinFailures will be set to is 1hr.
Set MinFailures to 0 in the UI or in terraform to reset to default behavior. Alternately, you can remove MinFailures from terraform to reset to default behavior. To learn more about MinFailures, view the best practices here.
View alert rules
View the conditions, name, description, and labels for fast-burn and slow-burn alert rules.
Once you have saved your SLO, you can view your SLO alert rules in the Alert list view in the Alerting application. Here, you can view the status of your alert rules and if there are any firing alert instances associated with them.
For more information on the state and health of alert rules and instances, refer to View the state and health of alert rules.
Conditions
The fast-burn alert rule creates alerts under two conditions:
The burn rate is at least 14.4 x the error budget when averaged over the last 5 minutes AND over the last hour. The burn rate is at least 6 x the error budget when averaged over the last 30 minutes AND over the last 6 hours.
The slow-burn alert rule creates alerts under two conditions:
The burn rate is at least 3 x the error budget when averaged over the last 2 hours AND over the last 24 hours. The burn rate is at least 1 x the error budget when averaged over the last 6 hours AND over the last 72 hours.
These alert rules are designed so that alerts are created in response to either severe or sustained burn rate, without alerting for minor, transient burn rate.
Learn more about alerting on SLOs in the Google SRE workbook.
Name and Description
View the Name and Description fields.
Alert rule labels and annotations
When your SLO is created, a set of labels is automatically created to uniquely identify the alert rules for your SLO. This includes the grafana_slo_severity
label, but not the severity
label. Label values can be overridden in either the GUI or the API/terraform resources, but the grafana_slo_severity
label name itself can not be changed.
Each SLO alert rule contains predefined labels according to the severity and evaluation behavior. This means a fast-burn alert rule will have a grafana_slo_severity=critical
label and a slow-burn alert rule will have a grafana_slo_severity=warning
label.
Note that if you create a custom notification policy, you have to add these labels as label matchers in the notification policy.
Custom annotations can also be added to both the slow-burn and fast-burn alert rules. Annotations will be attached to any alerts generated by these rules. For more information on alerting annotations, refer to the annotations documentation for alerts.
Note
SLO Alert rules are stored alongside the recording rules in a Grafana Mimir namespace called “grafana_slo_” where the stack ID refers to the stack in which the SLO plugin is running. This enables you to quickly search for and uniquely identify SLO alert rules.
For more information on labels, refer to label matchers.
For more information on rule groups and namespaces, refer to Namespaces and groups.
View notification policies
When alerts fire, they are routed to a default notification policy or to a custom notification policy with matching labels.
Note
If you have custom notification policies defined, the labels in the alert rule must match the labels in the notification policy for notifications to be sent out.
Notifications are sent out to the contact point integrations, for example, Slack or email, defined as contact points in the notification policy. Email is the contact point in the default notification policy, but you can add contact points as required.
For more information on notification policies, refer to Manage notification policies.
Review and save your SLO
Review each section of your SLO and once you are happy, save it to generate dashboards, recording rules, and alert rules.
Note
If you selected the option of adding SLO alert rules, they are displayed here. They are not created until you save your SLO.