Alertmanager integration for Grafana IRM
Note
⚠️ Legacy integration ⚠️ Integrations that were created before version 1.3.21 (1 August 2023) were marked as (Legacy) and migrated. These integrations are receiving and escalating alerts, but some manual adjustments may be required.
The Alertmanager integration handles alerts from Prometheus Alertmanager. This integration is the recommended way to send alerts from Prometheus deployed in your infrastructure, to Grafana IRM.
Tip
Create one integration per team, and configure alertmanager labels selector to send alerts only related to that team
Configure Grafana IRM to receive alerts from Prometheus Alertmanager
- In Grafana IRM, navigate to IRM > Integrations > Monitoring Systems
- Click + New integration
- Select Alertmanager Prometheus from the list of available integrations
- Enter a name and description for the integration, click Create
- A new page will open with the integration details. Copy the IRM Integration URL from HTTP Endpoint section. You will need it when configuring Alertmanager
Configure Alertmanager to send alerts to Grafana IRM
- Add a new Webhook receiver to
receivers
section of your Alertmanager configuration - Set
url
to the IRM Integration URL from previous section- Note: The url has a trailing slash that is required for it to work properly
- Set
send_resolved
totrue
, so Grafana IRM can autoresolve alert groups when they are resolved in Alertmanager - It is recommended to set
max_alerts
to less than100
to avoid requests that are too large - Use this receiver in your route configuration
Here is the example of final configuration:
route:
receiver: 'oncall'
group_by: [alertname, datacenter, app]
receivers:
- name: 'oncall'
webhook_configs:
- url: <integration-url>
send_resolved: true
max_alerts: 100
Complete the integration configuration
Complete configuration by setting routes, templates, maintenances, etc.
Note about grouping and autoresolution
Grafana IRM relies on the Alertmanager grouping and autoresolution mechanism to ensure consistency between alert state in IRM and AlertManager. It’s recommended to configure grouping on the Alertmanager side and use default grouping and autoresolution templates on the IRM side. Changing this templates might lead to incorrect grouping and autoresolution behavior. This is unlikely to be what you want, unless you have disabled grouping on the AlertManager side.
Configure IRM heartbeats (optional)
An IRM heartbeat acts as a monitoring for monitoring systems. If your monitoring is down and stop sending alerts, Grafana IRM will notify you about that.
Configure Grafana IRM heartbeat
- Go to Integration Page, click on three dots on top right, click Heartbeat settings
- Copy IRM Heartbeat URL, you will need it when configuring Alertmanager
- Set up Heartbeat Interval, time period after which Grafana IRM will start a new alert group if it doesn’t receive a heartbeat request
Configure Alertmanager to send heartbeats to Grafana IRM
You can configure Alertmanager to regularly send alerts to the heartbeat endpoint. Add vector(1)
as a heartbeat
generator to prometheus.yaml
. It will always return true and act like always firing alert, which will be sent to
Grafana IRM once in a given period of time:
groups:
- name: meta
rules:
- alert: heartbeat
expr: vector(1)
labels:
severity: none
annotations:
description: This is a heartbeat alert for Grafana IRM
summary: Heartbeat for Grafana IRM
Add receiver configuration to prometheus.yaml
with the IRM Heartbeat URL:
...
route:
...
routes:
- match:
alertname: heartbeat
receiver: 'grafana-irm-heartbeat'
group_wait: 0s
group_interval: 1m
repeat_interval: 50s
receivers:
- name: 'grafana-irm-heartbeat'
webhook_configs:
- url: https://oncall-dev-us-central-0.grafana.net/oncall/integrations/v1/alertmanager/1234567890/heartbeat/
send_resolved: false
Note about legacy integration
Legacy integration was using each alert from AlertManager group as a separate payload:
{
"labels": {
"severity": "critical",
"alertname": "InstanceDown"
},
"annotations": {
"title": "Instance localhost:8081 down",
"description": "Node has been down for more than 1 minute"
},
...
}
This behaviour was leading to mismatch in alert state between IRM and AlertManager and draining of rate-limits, since each AlertManager alert was counted separately.
We decided to change this behavior to respect AlertManager grouping by using AlertManager group as one payload.
{
"alerts": [...],
"groupLabels": {
"alertname": "InstanceDown"
},
"commonLabels": {
"job": "node",
"alertname": "InstanceDown"
},
"commonAnnotations": {
"description": "Node has been down for more than 1 minute"
},
"groupKey": "{}:{alertname=\"InstanceDown\"}",
...
}
You can read more about AlertManager Data model here.
After-migration checklist
Note
Integration URL will stay the same, so no need to change AlertManager or Grafana Alerting configuration. Integration templates will be reset to suit new payload. It is needed to adjust routes and outgoing webhooks manually to new payload.
- Send a new demo alert to the migrated integration
- Adjust routes to the new shape of payload. You can use payload of the demo alert from previous step as an example
- If outgoing webhooks utilized the alerts payload from the migrated integration in the trigger or data template it’s needed to adjust them as well