Best practices for Grafana SLOs
Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.
What is a good SLO?
A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.
Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.
A good SLO is attainable, not aspirational
Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.
It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.
Target services with good traffic
Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.
Team alignment
Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.
Embed SLO review in team rituals
As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.
Iterate and adjust
Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.
As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.
Alerts and labels
SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.
Prioritize your alerts
Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.
Use labels
Set up good label practices. Keep them limited to make them navigable and consumable for triage.
Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.
Minimum Failures
To reduce alert fatigue a team may want to set a minimum number of failures (as defined by (success events - total events)
) before an alert is triggered. This is most common for SLIs built on processes that have heavily periodic or spiky traffic where the low traffic rates make alerting rules unreliable.
The ideal solution for low-traffic is to supplement your traffic flow with synthetics to ensure you always get a clear signal on whether your failure events represent an issue.
If you are unable to use synthetics, you can choose to change the Minimum Failures advanced feature number. This number is applied to all your alerting time windows for the SLO, the smallest of which is 1 hour. This means that, if your service never gets enough traffic to exceed the Minimum Failures number, it won’t trigger an alert.
Query tips and pitfalls
There are many approaches to how you configure your SLO queries. Ultimately, it all depends on your needs. Ultimately, just remember: if you don’t have metrics that represent your user’s experience then you need new metrics.
Keep queries simple
The best SLIs are based on Prometheus counter metrics (such as monotonically increasing series) and use labels to encode the counted event as either a success or failure (for example: requests_total{code=”200”}
). If your metrics don’t look like this, it’s probably better to reinstrument your service with well-suited metrics than to try and work around the issue with complex SLI query definitions.
Availability and Latency are the most common SLOs to start with for request driven services. For example:
- Availability (non-5xx responses):
requests_total{code!~”5..”} / requests_total
- Latency (less than 1 second):
requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”}
Don’t use histogram_quantile
to calculate the P95 latency and then compare it to a threshold. Calculating percentiles from histograms can be inaccurate and you can’t aggregate the P95 values together for higher level reporting.
Do use prometheus histograms with le
buckets to count how many requests were returned with latency “less than or equal to” your le="1.0"
threshold.
See Google Cloud’s Building good SLOs—CRE life lessons for a worked example of count-based Latency SLOs.
Freshness is a common SLO for message queues or batch processes where you want to ensure that each item (perhaps after several retries) gets completed before the work request grows too stale.
- Freshness (work spent less than 120 sec in queue):
completed_duration_seconds_bucket{le=”120”} / completed_duration_seconds_count
Advanced SLIs
Define advanced SLIs as a “success/total” ratio for best dashboards. The “Ratio” SLO type enforces this success/total style, but you’ll get more dashboard features if you follow the same approach with your advanced SLOs.
- Do
<success rate> / <total rate>
- Avoid:
1 - (<failure rate> / <total rate>)
If you can’t reinstrument your metrics to encode success/failure with labels and you must work with failure_total
and all_total
counters, you can do (total - fail) / total
. For example:
(
sum by (...) (rate(all_total[$__rate_interval]))
- sum by (...) (rate(failure_total[$__rate_interval]))
)
/ sum by (...) (rate(all_total[$__rate_interval]))
SLOs from K6 Probes
Grafana K6 Synthetic Monitoring probes emit metrics that can be used in count-based SLOs. The following Ratio SLO is a good example:
- success:
probe_all_success_sum
- total:
probe_all_success_count
- group_by:
job, probe
Read more about Grafana K6 metrics
Time-based SLIs
Recording each minute as up or down isn’t as powerful as counting successful and total events, but it can still provide useful information. The Advanced type accepts anything that includes an _over_time(...[$__interval])
expression and returns a value between 0 and 1. The $__interval
is what is replaced by the varying time windows used in the alerting rules (windows range from 5 min to 3 day).
Here’s an example of an SLI query that always returns a 0 or 1:
avg_over_time((max(metric) < bool 10)[$__interval:])
In the above example, you start with a simple threshold max(metric) < bool 10
that returns 0 or 1, and then add the required _over_time()
function to calculate the SLI over various time windows like 5 minutes or 1 hour. Note the trailing :
in the square brackets means this is a Prometheus subquery which is more computationally expensive but allows the < bool 10
comparison inside the avg_over_time
function.
If you want a multidimensional SLI with something like a cluster
dashboard variable that allows you to visualize both aggregate and per-cluster SLI values, you need to structure the expression as a ratio. You can use a comparison to filter out only the successful time points, and then add them up with a count_over_time. The denominator can just count all time points and you get a success / total ratio.
avg by(cluster) (
count_over_time((metric2 == 1)[$__interval:])
)
/
avg by(cluster) (
count_over_time((metric2)[$__interval:])
)
Know your SLIs
There are many SLI types. A brief explanation of Multidimensional and Rollup SLIs follows below.
Multidimensional SLI
A Multidimensional SLI reports a ratio for each value of a given label. for example: sum by (cluster) (rate(<success>[5m])) / sum by (cluster) (rate<total>[5m]))
.
When you specify “group by” labels on the ratio SLO type, it makes it a multidimensional SLI. A common use is to specify cluster
and/or namespace
in the grouping.
Multidimensional SLIs enables per-cluster alerting and supports more flexible dashboards where you can include or exclude values for the chosen dimension labels (see rollup SLI below).
Rollup SLI
A rollup SLI (or aggregated SLI) is a calculation of a multidimensional SLI where the numerator and denominator is further aggregated before the final ratio calculation.
When you select cluster=all
on the dashboard of a multidimensional SLO that defined cluster
as a group label, the dashboard calculates the aggregate ratio of the sum of all successes/over sum of all requests. This provides alerting on each cluster and reporting on the rollup overall results.
Additional reference materials
Google provides good introductory documentation on SLOs in their SRE Book. They also provide useful guides on SLO implementation and alerting on SLOs.