Alert insights and metrics
Grafana IRM provides detailed metrics and logs to help you monitor your alert handling performance and analyze trends. These insights enable you to identify bottlenecks, measure response effectiveness, and continuously improve your alerting processes.
About alert metrics
Alert metrics in Grafana IRM track key performance indicators related to alert handling, including:
- Alert volume across integrations
- Response times for alert acknowledgment
- Notification patterns
- Team and user metrics
These metrics are exposed in Prometheus format, making them easy to query and visualize in Grafana dashboards.
Available metrics
Grafana IRM provides the following core metrics:
Metric | Type | Description |
---|---|---|
alert_groups_total | Gauge | Total count of alert groups for each integration by state (firing, acknowledged, resolved, silenced) |
alert_groups_response_time | Histogram | Mean time between alert start and first action over the last 7 days |
user_was_notified_of_alert_groups_total | Counter | Total count of alert groups users were notified of |
Access metrics
For Grafana Cloud customers
Alert metrics are automatically collected in the preinstalled grafanacloud-usage data source and have the prefix grafanacloud_oncall_instance
, for example:
grafanacloud_oncall_instance_alert_groups_total
grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket
grafanacloud_oncall_instance_user_was_notified_of_alert_groups_total
Metric details and examples
Alert groups total
This metric tracks the count of alerts in different states with the following labels:
Label | Description |
---|---|
id | ID of Grafana instance (stack) |
slug | Slug of Grafana instance (stack) |
org_id | ID of Grafana organization |
team | Team name |
integration | Integration name |
service_name | Value of alert group service_name label |
state | Alert groups state (firing , acknowledged , resolved , silenced ) |
Example query:
Get the number of alerts in “firing” state for “Grafana Alerting” integration:
grafanacloud_oncall_instance_alert_groups_total{integration="Grafana Alerting", state="firing"}
Alert groups response time
This metric tracks response times with the following labels:
Label | Description |
---|---|
id | ID of Grafana instance (stack) |
slug | Slug of Grafana instance (stack) |
org_id | ID of Grafana organization |
team | Team name |
integration | Integration name |
service_name | Value of alert group service_name label |
le | Histogram bucket value in seconds (60 , 300 , 600 , 3600 , +Inf ) |
Example query:
Get the number of alerts with response time less than 10 minutes (600 seconds):
grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket{integration="Grafana Alerting", le="600"}
User notification metrics
This metric tracks how many alerts each user was notified about:
Label | Description |
---|---|
id | ID of Grafana instance (stack) |
slug | Slug of Grafana instance (stack) |
org_id | ID of Grafana organization |
username | User’s username |
Example query:
Get the number of alerts a specific user was notified of:
grafanacloud_oncall_instance_user_was_notified_of_alert_groups_total{username="alex"}
Alert metrics dashboard
A pre-built “OnCall Insights” dashboard is available to visualize key alert metrics. To access it:
- Navigate to your dashboards list in the folder
General
- Find the dashboard with the tag
oncall
- Select your Prometheus data source (for Cloud customers, use
grafanacloud_usage
) - Filter data by Grafana instances, teams, and integrations
To re-import the dashboard:
- Go to
Administration
>Plugins
- Find OnCall in the plugins list
- Open the
Dashboards
tab - Click “Re-import” next to “OnCall Metrics”
Note: Re-importing or updating the plugin will reset any customizations. To preserve changes, save a copy of the dashboard using “Save As” in dashboard settings.
You can also view insights directly in Grafana IRM by clicking Insights in the navigation menu.
Alert insight logs
Alert insight logs provide an audit trail of configuration changes and system events in your IRM environment. These logs are automatically configured in Grafana Cloud with the Usage Insights Loki data source.
Access insight logs
To retrieve all logs related to your IRM instance:
{instance_type="oncall"} | logfmt | __error__=``
Types of insight logs
IRM captures three primary types of insight logs:
Resource logs
Track changes to resources (integrations, escalation chains, schedules, etc.):
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource`
Resource logs include the following key fields:
Field | Description |
---|---|
action_name | Type of action (created , updated , deleted ) |
action_type | Always resource for resource logs |
author | Username who performed the action |
resource_id | ID of the modified resource |
resource_name | Name of the modified resource |
resource_type | Type of resource (integration, escalation chain, etc.) |
team | Team the resource belongs to |
prev_state | JSON representation of resource before update |
new_state | JSON representation of resource after update |
Maintenance logs
Track when maintenance mode is started or finished:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `maintenance`
Maintenance logs include:
Field | Description |
---|---|
action_name | Maintenance action (started or finished ) |
action_type | Always maintenance for maintenance logs |
maintenance_mode | Type of maintenance (maintenance or debug ) |
resource_id | ID of the integration under maintenance |
resource_name | Name of the integration under maintenance |
team | Team the integration belongs to |
author | Username who performed the action |
ChatOps logs
Track configuration changes to chat integrations:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `chat_ops`
ChatOps logs include:
Field | Description |
---|---|
action_name | Type of chatops action |
action_type | Always chat_ops for chatops logs |
author | Username who performed the action |
chat_ops_type | Type of integration (telegram , slack , msteams , mobile_app ) |
channel_name | Name of the linked channel |
linked_user | Username linked to the chatops integration |
Example log queries
Here are some practical log queries to analyze your alert handling configuration:
Actions by specific user:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource` and author="username"
Changes to schedules:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource` and (resource_type=`web_schedule` or resource_type=`calendar_schedule` or resource_type=`ical_schedule`)
Changes to escalation policies:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource` and resource_type=`escalation_policy` and escalation_chain_id=`CHAIN_ID`
Maintenance events for an integration:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `maintenance` and resource_id=`INTEGRATION_ID`
Slack chatops configuration changes:
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `chat_ops` and chat_ops_type=`slack`