Grafana Alerting: A beginner's guide to templating alert notifications
We often see questions about how to template alerts. In Grafana, you can template information about your alerts with custom labels and annotations, and you can also template how notifications look and what information they contain with notification templates. Many users confuse the two, despite being separate features with different use cases. However, the confusion is understandable for those who aren’t already familiar with the Prometheus model of designing alerting systems, which Grafana Alerting is based on.
In this blog post, we’ll attempt to clear up some of the confusion. We’ll look at how Prometheus-based alerting systems work and how it relates to templating alerts. We’ll also explore the differences between templating custom labels and annotations and notification templates, including when to use one or the other. Lastly, we’ll look at four examples you can use when setting up your own alerts.
Understanding Prometheus-based alerting systems
In Prometheus-based alerting systems, you have an alert generator that creates alerts and an alert receiver that receives alerts. Prometheus is an alert generator and is responsible for evaluating rules, while Alertmanager is an alert receiver and is responsible for grouping, inhibiting, silencing, and sending notifications about firing and resolved alerts. You template custom labels and annotations in the alert generator and template notifications in the alert receiver.
To understand the distinction between templating custom labels and annotations and notification templates, we first need to understand what labels and annotations are and why they are critical in Prometheus-like alerting systems.
What are labels and annotations?
Labels are sets of key/value pairs. Each alert has its own set of labels, and no two alerts can have the same set of labels. As a result, labels are a unique identifier for an alert such that you can find a specific alert in a list of thousands of alerts just by knowing its labels. For example, an alert with the labels {alertname="test",instance="foo"}
is a different alert from {alertname="test",instance="bar"}
, even though both alerts might have fired from the same rule.
Like labels, annotations are also key/value pairs. However, instead of acting as a unique identifier for an alert, annotations add information to the alert. For example, you might have a summary annotation telling you why this alert has fired, or a runbook annotation that links to the runbook for this alert.
What does this have to do with Grafana?
Grafana Alerting is built on the Prometheus model of designing alerting systems. It has an internal alert generator responsible for scheduling and evaluating rules, as well as an internal alert receiver responsible for grouping, inhibiting, silencing, and sending notifications. Grafana doesn’t use Prometheus as its alert generator because Grafana Alerting needs to work with many other data sources in addition to Prometheus. However, it does use Alertmanager as its alert receiver.
The Grafana user interface does a pretty good job of hiding this complexity, but it still helps to understand that when you use Grafana Alerting, you are actually using two separate subsystems.
When should I template labels and annotations?
You should template labels when you want to add or change the unique identifier of an alert. Typically you will want to do this if the labels returned by your query are incomplete or not as descriptive as you want. But remember: Long sentences are best left for the summary and description annotations. You should also avoid using the value of the query in labels because it’s likely that every evaluation of the alert will return a different value, causing Grafana to create tens or even hundreds of alerts when you really only want one.
You should template annotations whenever you want to add a summary, description, or information to the alert that does not change its unique identifier. You can use the value of the query in an annotation if you think it will add meaning to the alert. In fact, this is a very common practice when writing rules in Prometheus. You can find more information on templating labels and annotations here.
When should I template notifications?
You should template notifications when you want to change how your alerts look. This is where you get to choose which labels, annotations, and links should be included in the notification. Notification templates are not the right place to add information to individual alerts. That is best done in the rule with labels and annotations. You can find more information on templating notifications here.
Why do the default notifications have so much information?
We see this question quite often. The answer is that the default notification template is trying to show you as much information as possible. We understand that some Grafana users might want to see only certain information in their notifications, while others might want to see more information. The default template at present shows you everything so you can create your own templates to meet your individual preferences.
4 examples of how to use templates in Grafana Alerting
Now that we’ve covered the basics, let’s take a look at some examples that address common use cases and some of the different approaches you can take with templating. If you are unfamiliar with how to template labels and annotations, check out the corresponding documentation. You can also read the new and improved documentation on how to write notification templates in Go’s templating language, and how to use notification templates in Grafana. (Note: The following examples assume you are running Grafana 9.4.)
1. Firing and resolved alerts, with summary annotation
This example shows the number of Firing and Resolved alerts, with just the summary annotation of each alert included in the notification. This is a very opinionated example because you must write descriptive summary annotations for all of your rules. However, if done properly, it can produce simple yet highly actionable notifications.
The annotation template
This is the Summary annotation for a rule that fires when disk usage of a database server exceeds 75%. It uses the instance label from the query to tell you which database server(s) are low on disk space.
The database server {{ index $labels "instance" }} has exceeded 75% of available disk space, please resize the disk within the next 24 hours
You can also show the amount of disk space used with the $values
variable. For example, if your rule has a query called A that queries the disk space usage of all database servers and a Reduce expression called B that averages the result of query A, then you can use $values
to show the average disk space usage for each database server.
The database server {{ index $labels "instance" }} has exceeded 75% of available disk space. Disk space used is {{ index $values "B" }}%, please resize the disk within the next 24 hours
The notification template
This is the notification template. It prints the Summary annotation for all firing and resolved alerts, such as the Summary annotation in the example above.
{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
{{ len .Alerts.Firing }} firing alert(s)
{{ template "alerts.summarize" .Alerts.Firing }}
{{- end }}
{{- if .Alerts.Resolved -}}
{{ len .Alerts.Resolved }} resolved alert(s)
{{ template "alerts.summarize" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "alerts.summarize" -}}
{{ range . -}}
- {{ index .Annotations "summary" }}
{{ end }}
{{ end }}
The output of this template looks like this:
1 firing alert(s)
- The database server db1 has exceeded 75% of available disk space. Disk space used is 76%, please resize the disk size within the next 24 hours
1 resolved alert(s)
- The web server web1 has been responding to 5% of HTTP requests with 5xx errors for the last 5 minutes
2. Firing and resolved alerts, with summary, description, and runbook URL
This example is a little more complex, but it still takes an opinionated view on how you should set up your alerts. This template shows the summary, the description, and the runbook URL of all firing and resolved alerts. You have to write a summary for each alert rule, but the description and Runbook URL are optional.
The annotation template
This is the same as the first example and shows the summary annotation for a rule that fires when disk usage of a database server exceeds 75%. It uses the instance label from the query to tell you which database server(s) are low on disk space.
The database server {{ index $labels "instance" }} has exceeded 75% of available disk space. Disk space used is {{ index $values "B" }}%, please resize the disk within the next 24 hours
The notification template
This is the notification template. It prints the Summary annotation for all firing and resolved alerts, such as the Summary annotation in the example above. The Description and Runbook URL are optional and are omitted if absent from the alert.
{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
{{ len .Alerts.Firing }} firing alert(s)
{{ template "alerts.summarize_large" .Alerts.Firing }}
{{- end }}
{{- if .Alerts.Resolved -}}
{{ len .Alerts.Resolved }} resolved alert(s)
{{ template "alerts.summarize_large" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "alerts.summarize_large" -}}
{{ range . }}
Summary: {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
Description: {{ index .Annotations "description" }}{{ end }}
{{- if index .Annotations "runbook_url" }}
Runbook: {{ index .Annotations "runbook_url" }}{{ end }}
{{ end }}
{{ end }}
The output of this template looks like this:
1 firing alert(s)
Summary: The database server db1 has exceeded 75% of available disk space. Disk space used is 76%, please resize the disk size within the next 24 hours
Description: This alert fires when a database server is at risk of running out of disk space. You should take measures to increase the maximum available disk space as soon as possible to avoid possible corruption.
Runbook: https://example.com/on-call/database_server_high_disk_usage
1 resolved alert(s)
Summary: The web server web1 has been responding to 5% of HTTP requests with 5xx errors for the last 5 minutes
Description: This alert fires when a web server responds with more 5xx errors than is expected. This could be an issue with the web server or a backend service. Please refer to the runbook for more information.
Runbook: https://example.com/on-call/web_server_high_5xx_rate
3. Labels with values of instant queries and expressions
Unlike the previous examples, this one does not require a summary annotation to be written for each alert. Instead it shows the labels and the values of any instant queries; Threshold, Reduce or Math expressions; or Classic Conditions.
The notification template
{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
{{ len .Alerts.Firing }} firing alert(s)
{{ template "alerts.summarize_labels_and_values" .Alerts.Firing }}
{{- end }}
{{- if .Alerts.Resolved -}}
{{ len .Alerts.Resolved }} resolved alert(s)
{{ template "alerts.summarize_labels_and_values" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "alerts.summarize_labels_and_values" -}}
{{ range . -}}
- {{ range $k, $v := .Labels }}{{ $k }}={{ $v }} {{ end }}{{ range $k, $v := .Values }}{{ $k }}={{ $v }} {{ end }}
{{ end }}
{{ end }}
The output of this template looks like this:
1 firing alert(s):
- alertname=database_high_disk_usage server=db1 B=0.76 C=1
1 resolved alert(s):
- alertname=web_server_high_5xx_rate server=web1 B=0 C=0
4. Firing and resolved alerts, with labels, summary, and silencing
The last example combines the first and third examples as it prints the labels, the summary annotation, and then links to both silence the alert and show in the alert in Grafana. However, the link to silence the alert is omitted for resolved alerts as it doesn’t make sense to silence an alert that is no longer firing.
The notification template
{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
{{ len .Alerts.Firing }} firing alert(s)
{{ template "alerts.summarize_with_links" .Alerts.Firing }}
{{- end }}
{{- if .Alerts.Resolved -}}
{{ len .Alerts.Resolved }} resolved alert(s)
{{ template "alerts.summarize_with_links" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "alerts.summarize_with_links" -}}
{{ range . -}}
{{ range $k, $v := .Labels }}{{ $k }}={{ $v }} {{ end }}
{{ index .Annotations "summary" }}
{{- if eq .Status "firing" }}
- Silence this alert: {{ .SilenceURL }}{{ end }}
- View on Grafana: {{ .GeneratorURL }}
{{ end }}
{{ end }}
The output of this template looks like this:
1 firing alert(s):
alertname=database_high_disk_usage server=db1
The database server db1 has exceeded 75% of available disk space. Disk space used is 76%, please resize the disk size within the next 24 hours
- Silence this alert: https://example.com/grafana/alerting/silence/new
- View on Grafana: https://example.com/grafana/alerting/grafana/view
1 resolved alert(s):
alertname=web_server_high_5xx_rate server=web1
The web server web1 has been responding to 5% of HTTP requests with 5xx errors for the last 5 minutes
- View on Grafana: https://example.com/grafana/alerting/grafana/view
These are just some of the many ways you can template alerts. If you would like to see more examples then check out our new and improved documentation here.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, and dashboards. We have a generous free forever tier and plans for every use case. Sign up for free now!