Get started with Grafana Alerting - Part 5

Introduction

The Get started with Grafana Alerting tutorial Part 5 is a continuation of Get started with Grafana Alerting tutorial Part 4.

In this tutorial, we focus on optimizing your alerting strategy using Grafana for monitoring system health, particularly when working with Prometheus. Imagine you are managing a web application or a fleet of servers, tracking critical metrics such as CPU, memory, and disk usage. While monitoring is essential, managing alerts allows your team to act on issues without necessarily feeling overwhelmed by the noise.

In this tutorial you will learn how to:

Leverage notification policies for dynamic routing based on query values: Use notification policies to route alerts based on dynamically generated labels, in a way that critical alerts reach the on-call team and less urgent ones go to a general monitoring channel.
Set mute timings to suppress certain alerts during maintenance or weekends.
Link alerts to dashboards to provide more context to resolve issues.

Before you begin

Interactive learning environment
- Alternatively, you can try out this example in our interactive learning environment. It’s a fully configured environment with all the dependencies already installed.
Grafana OSS
- If you opt to run a Grafana stack locally, ensure you have the following applications installed:
  - Docker Compose (included in Docker for Desktop for macOS and Windows)
  - Git

Set up the Grafana stack

To observe data using the Grafana stack, download and run the following files.

Clone the tutorial environment repository.

git clone https://github.com/tonypowa/grafana-prometheus-alerting-demo.git

Change to the directory where you cloned the repository:
bash
```
cd grafana-prometheus-alerting-demo
```
Build the Grafana stack:
```
docker compose build
```
Bring up the containers:
```
docker compose up -d
```
The first time you run docker compose up -d, Docker downloads all the necessary resources for the tutorial. This might take a few minutes, depending on your internet connection.

Note
If you already have Grafana, Loki, or Prometheus running on your system, you might see errors, because the Docker image is trying to use ports that your local installations are already using. If this is the case, stop the services, then run the command again.

Use case: Monitoring and alerting for system health with Prometheus and Grafana

In this use case, we focus on monitoring the system’s CPU, memory, and disk usage as part of a monitoring setup. This example is based on the Grafana Prometheus Alerting Demo, which collects and visualizes system metrics via Prometheus and Grafana.

Your team is responsible for ensuring the health of your servers, and you want to leverage advanced alerting features in Grafana to:

Set who should receive an alert notification based on query value.
Suppress alerts based on query value.
Integrate alert rules into visualizations for better context.

Scenario

In the provided demo setup, you’re monitoring:

CPU Usage.
Memory Consumption.

You have a mixture of critical alerts (e.g., CPU usage over 75%) and warning alerts (e.g., memory usage over 60%).

At times, you also have scheduled maintenance windows, where you might temporarily suppress certain alerts during planned downtime.

Create a visualization to monitor metrics

To keep track of these metrics and understand system behavior across different environments, you can set up a visualization for CPU usage and memory consumption. This will make it easier to see how the system is performing and how alerts are distributed based on the environment label, including during scheduled maintenance windows.

The time-series visualization supports alert rules to provide more context in the form of annotations and alert rule state. Follow these steps to create a visualization to monitor the application’s metrics.

Log in to Grafana:
- Navigate to http://localhost:3000, where Grafana should be running.
- Username and password: admin
Create a time series panel:
- Navigate to Dashboards.
- Click New.
- Select New Dashboard.
- Click + Add visualization.
- Select Prometheus as the data source (provided with the demo).
- Enter a title for your panel, e.g., CPU and Memory Usage.
Add queries for metrics:
- In the query area, copy and paste the following PromQL query:
  ** switch to Code mode if not already selected **
  promql
```
flask_app_cpu_usage{environment="prod"}
```
- Click Run queries.
This query should display the simulated CPU usage data in the prod environment.
Add memory usage query:
- Click + Add query.
- In the query area, paste the following PromQL query:
  promql
```
flask_app_memory_usage{environment="prod"}
```
Time-series panel displaying CPU and memory usage metrics in production.
Both metrics return labels that we’ll use later to link alert instances with the appropriate routing. These labels help define how alerts are routed based on their environment or other criteria.
Click Save dashboard.
We have our time-series panel ready. Feel free to combine metrics with labels such as environment = “staging”.

Create Notification Policies

Notification policies route alert instances to contact points via label matchers. Since we know what labels our application returns (i.e., environment, job, instance), we can use these labels to match alert rules.

Navigate to Alerts & IRM > Alerting > Notification Policies.
Add a child policy:
- In the Default policy, click + New child policy.
- Label: environment.
- Operator: =.
- Value: production.
- This label matches alert rules where the environment label is prod.
Choose a contact point:
- If you don’t have any contact points, add a Contact point.
For a quick test, you can use a public webhook from webhook.site to capture and inspect alert notifications. If you choose this method, select Webhook from the drop-down menu in contact points.
Enable continue matching:
- Turn on Continue matching subsequent sibling nodes so the evaluation continues even after one or more labels (i.e., environment labels) match.
Save and repeat
- Create another child policy by following the same steps.
- Use environment = staging as the label/value pair.
- Feel free to use a different contact point.

Now that the labels are defined, we can create alert rules for CPU and memory metrics. These alert rules will use the labels from the collected and stored metrics in Prometheus.

Create alert rules to monitor CPU and memory usage

Follow these steps to manually create alert rules and link them to a visualization.

Create an alert rule for CPU usage

Navigate to Alerts & IRM > Alerting > Alert rules from the Grafana sidebar.
Click + New alert rule rule to create a new alert.

Enter alert rule name

Make it short and descriptive, as this will appear in your alert notification. For instance, CPU usage .

Define query and alert condition

Select Prometheus data source from the drop-down menu.
In the query section, enter the following query:
** switch to Code mode if not already selected **
```
flask_app_cpu_usage{}
```
Alert condition section:
- Enter 75 as the value for WHEN QUERY IS ABOVE to set the threshold for the alert.
- Click Preview alert rule condition to run the queries.
Preview of a query returning alert instances in Grafana.
Among the labels returned for flask_app_cpu_usage, the environment label is particularly important, as it enables dynamic alert routing based on the environment value, ensuring the right team receives the relevant notifications.

Add folders and labels

In this section we add a templated label based on query value to map to the notification policies.

In Folder, click + New folder and enter a name. For example: App metrics . This folder contains our alerts.
Click + Add labels.
Key field: environment .
In the value field copy in the following template:
Go
```
{{- if eq $labels.environment "prod" -}}
production
{{- else if eq $labels.environment "staging" -}}
staging
{{- else -}}
development
{{- end -}}
```
In this context, the template is used to route alert notifications based on the environment label. When a metric like CPU usage exceeds a threshold, the template checks the environment (e.g., prod, staging, or any other value). It then generates a label based on query value (e.g., production, staging, or development). This label is used in the alert notification policy to route alerts to the appropriate team, so that notifications are directed to the right group, making the process more efficient and avoiding unnecessary overlap.

Set evaluation behaviour

Click + New evaluation group. Name it System usage.
Choose an Evaluation interval (how often the alert will be evaluated). Choose 1m. Click Create.
Set the pending period to 0s (zero seconds), so the alert rule fires the moment the condition is met (this minimizes the waiting time for the demonstration.).

Configure notifications

Select who should receive a notification when an alert rule fires.

Toggle the Advance options button.
Click Preview routing. The preview should display which firing alerts are routed to contact points based on notification policies that match the environment label.
Notification policies matched by the environment label matcher.
The environment label matcher should map to the notification policies created earlier. This makes sure that firing alert instances are routed to the appropriate contact points associated with each policy.

Configure notification message

Link your dashboard panel to this alert rule to display alert annotations in your visualization whenever the alert rule triggers or resolves.

Click Link dashboard and panel.
Find the panel that you created earlier.
Click Confirm.

Create a second alert rule for memory usage

Duplicate the existing alert rule (More > Duplicate), or create a new alert rule for memory usage, defining a threshold condition (e.g., memory usage exceeding 60%).
Query: flask_app_memory_usage{}
Link to the same visualization to obtain memory usage annotations whenever the alert rule triggers or resolves.

Now that the CPU and memory alert rules are set up, they are linked to the notification policies through the custom label matcher we added. The value of the label dynamically changes based on the environment template, using $labels.environment. This ensures that the label value will be set to production, staging, or development, depending on the environment.

Visualizing metrics and alert annotations

Check how your dashboard looks now that both alerts have been linked to your dashboard panel.

Time series panel displaying health indicators and annotations.

After the alert rules are created, they should appear as health indicators (colored heart icons: red heart when the alert is in Alerting state, and green heart when in Normal state.) on the linked panel. In addition, the annotations include helpful context, such as the time the alert was triggered.

Create mute timings

Now that we’ve set up notification policies, we can demonstrate how to mute alerts for recurring periods of time. You can mute notifications for either the production or staging policies, depending on your needs.

Mute timings are useful for suppressing alerts with certain labels during maintenance windows or weekends.

Navigate to Alerts & IRM > Alerting > Notification Policies.
- Enter a name, e.g. Planned downtime or Non-business hours.
- Select Sat and Sun, to apply the mute timing to all Saturdays and Sundays.
- Click Save mute timing.
Add mute timing to the desired policy:
- Go to the notification policy that routes instances with the staging label.
- Select More > Edit.
- Choose the mute timing from the drop-down menu
- Click Update policy.

This mute timing will apply to any alerts from the staging environment that trigger on Saturdays and Sundays.

Conclusion

By using notification policies, you can route alerts based on query values, directing them to the appropriate teams. Integrating alerts into dashboards provides more context, and mute timings allow you to suppress alerts during maintenance or low-priority periods.

Learn more

Explore related topics covered in this tutorial:

Understand how alert routing works in Get started with Grafana Alerting - Part 2.
Learn how templating works in Get started with Grafana Alerting - Part 4.
- More examples on templating labels.

Feedback

Relevant sources:

Feedback

Get started with Grafana Alerting - Part 5

Introduction

Before you begin

Set up the Grafana stack

Use case: Monitoring and alerting for system health with Prometheus and Grafana

Scenario

Create a visualization to monitor metrics

Create Notification Policies

Create alert rules to monitor CPU and memory usage

Create an alert rule for CPU usage

Enter alert rule name

Define query and alert condition

Add folders and labels

Set evaluation behaviour

Configure notifications

Configure notification message

Create a second alert rule for memory usage

Visualizing metrics and alert annotations

Create mute timings

Conclusion

Learn more