Menu
Grafana Cloud

Define metrics aggregation rules

The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.

Aggregation rule format

The aggregations service expects the following format:

Field nameData typeDescription
metricstringThe metric name or metric name matcher to which the aggregation rule applies.
match_typestring (optional)The type of matching to be done against the value of the metric field. For valid values, see substring matchers. If you do not specify match_type, the value is exact.
dropbool (optional)If set to true, the entire metric is dropped instead of aggregated. If you set this to true, you cannot use the drop_labels and aggregations fields. If you do not specify drop, the value is false.
drop_labelsstring arrayThe list of labels that will be aggregated away; each of these labels that is present in the original series will have their value set to <aggregated>. You can specify either drop_labels or keep_labels, but you can’t use both fields within the same rule.
keep_labelsstring arrayThe list of labels that will be retained. The value of all labels not present in this list will be replaced by <aggregated>. You can specify either keep_labels or drop_labels, but you can’t use both fields within the same rule.
aggregationsstring arrayThe list of aggregation functions to apply to the metric or metrics that are matched by this rule. For valid values, see Supported aggregation types.
aggregation_intervalstring duration (optional)The interval of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_interval, you also need to specify aggregation_delay field.
aggregation_delaystring duration (optional)The time of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_delay, you also need to specify aggregation_interval field.

The following example shows an aggregation rule for the metric proxy_sql_queries_total:

json
{
  "metric": "proxy_sql_queries_total",
  "drop_labels": ["container", "instance", "namespace", "pod"],
  "aggregations": ["sum:counter"]
}

Supported aggregation types

The following values are supported for the aggregations field of an aggregation rule:

Aggregation functionDefinition
sum:counterThe running sum of all increases of raw series values. Applicable to counter type metrics, and correctly accounts for counter resets. A counter type metric is conceptually similar to elevation gain. For example, if a cyclist counts their elevation gain by peak, they can sum several peaks’ worth of elevation gain to understand how much they’ve climbed in total. The elevation gain for each peak over time is a raw series. If you specify the sum:counter aggregation with "drop_labels": ["peak"] for this metric, the per-peak raw series would be aggregated into one series that would tell the cyclist the total amount they climbed over time. From this aggregated data, they can no longer tell how much they have climbed in total for a given peak.
sumThe sum of all values across the aggregated series at a given time stamp. The sum aggregation is not useful for counter type metrics; for counter type metrics, use sum:counter instead.
minThe minimum of all values across all the aggregated series at a given time stamp.
maxThe maximum of all values across all the aggregated series at a given time stamp.
countThe number of raw series that feed into the aggregated series at a given time stamp.

Substring matchers

By default, a rule is applied to the metric name specified in the rule’s metric field. In addition, Adaptive Metrics allows you to write rules that apply to all metrics whose names match a given prefix or suffix. To apply rules to all such metrics, use the optional field match_type in your rule and set it to prefix or suffix.

The match_type field supports the following values:

  • exact: Apply the rule to the metric whose name is specified in the rule’s metric field. Because metric names are unique, the rule will only apply to one metric.
  • prefix: Apply the rule to all metrics whose names start with the string in the rule’s metric field.
  • suffix: Apply the rule to all metrics whose names end with the string in the rule’s metric field.

An example rule that matches all metrics beginning with http_requests_total_, and that aggregates away their instance label using the sum:counter function, looks as follows:

json
{
  "metric": "http_requests_total_",
  "match_type": "prefix",
  "drop_labels": ["instance"],
  "aggregations": ["sum:counter"]
}

In such scenario, the metric http_requests_total_abc has two rules that potentially apply. However, because an exact match has precedence over a prefix match, both the instance and pod labels would be aggregated away for http_requests_total_abc:

json
[
  {
    "metric": "http_requests_total_",
    "match_type": "prefix",
    "drop_labels": ["instance"],
    "aggregations": ["sum:counter"]
  },
  {
    "metric": "http_requests_total_abc",
    "drop_labels": ["instance", "pod"],
    "aggregations": ["sum:counter"]
  }
]

If multiple substring matchers match a metric, the first match always wins. Consider a rule file with the following two rules:

json
[
  {
    "metric": "http_requests_total_",
    "match_type": "prefix",
    "drop_labels": ["instance"],
    "aggregations": ["sum:counter"]
  },
  {
    "metric": "_abc",
    "match_type": "suffix",
    "drop_labels": ["pod"],
    "aggregations": ["sum:counter"]
  }
]

In this scenario, the metric http_requests_total_abc is matched by both rules. Because neither rule is an exact match, the first rule in the list takes precedence. This means that the instance label, not the pod label is aggregated away for http_requests_total_abc.

Configure an aggregation

As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total with labels street_name and building_number. Given that you only care about the total energy consumption per street and the average consumption per building on a street, you could configure two aggregations where one sums the consumption of all buildings in a street and the other counts the buildings of the street.

Since the metric electrical_throughput_total is a counter, we’d need to use the sum:counter aggregation (instead of the sum aggregation) to handle counter resets correctly:

json
{
  "metric": "electrical_throughput_total",
  "drop_labels": ["building_number"],
  "aggregations": ["sum:counter", "count"]
}

Based on the preceding configuration, the aggregation service would discard the label building_number from the aggregated metric electrical_throughput_total. In its place, it would compute and store aggregated values per street for this metric.

The sum:counter aggregation function computes the total electrical throughput of every street in the street_name label set. The count aggregation function computes the count of buildings per street. These two values can be used to compute an average consumption per building for each street.

However, because the building_number label has been discarded, it is no longer possible to understand how much power a specific building consumes.

Examples of sum(), sum by(), count(), and count by() functions are as follows:

  • Sum the rate of electrical throughput per street:

    promql
    sum by (street_name) (rate(electrical_throughput_total[5m]))
  • Sum the rate of electrical throughput for buildings on <EXAMPLE-STREET>:

    promql
    sum(rate(electrical_throughput_total{street_name="<EXAMPLE-STREET>"}[5m]))
  • Count the number of buildings per street that are producing electrical throughput

    promql
    count by (street_name) (electrical_throughput_total)
  • Count the total number of buildings that are producing electrical throughput

    promql
    count(electrical_throughput_total)
  • Get the average rate of electrical throughput

    promql
    avg(rate(electrical_throughput_total[5m]))

Limits on the aggregation service

The Adaptive Metrics feature has limits, which are necessary to guarantee a highly reliable service. These limits are designed to adjust automatically to your usage of the service. This means that as long as your usage of Adaptive Metrics increases gradually, you should not expect to hit limits under normal circumstances. However, if your usage increases substantially over a short period of time, you might experience rate limiting. In this case, limits adapt to the changed usage pattern automatically after some time (usually within 24 hours). If you are experiencing sustained rate limiting beyond this time frame, contact Grafana Labs Support.

Number of aggregated series

The Adaptive Metrics aggregation service enforces limits on the number of series that can be aggregated. If these limits are exceeded, the aggregation service begins to discard incoming samples.

When this happens, you will see an increase in aggregator-too-many-aggregated-series or aggregator-too-many-raw-series errors in the Discarded Metrics Samples panel of your billing dashboard.

Rate of samples to aggregate

There is also a limit on the rate at which samples can get forwarded to the Adaptive Metrics aggregation service. If this limit is exceeded, our API will return a 429 status code and you will see an increase in aggregations-max-ingestion-rate-exceeded errors in the Discarded Metrics Samples panel of your billing dashboard.

Drop a metric

You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total, from the example in Configure an aggregation, you would configure a rule as follows:

json
{
  "metric": "electrical_throughput_total",
  "drop": true
}

This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.

Note

Generally, aggregation is more favorable than dropping a metric entirely. By aggregating a metric, you can usually reduce its cardinality by 80-90%, and in the database keep some reference to it, such as a lower-fidelity version of it. This can be useful during the investigation of an incident. If you drop a metric, you reduce costs a bit more, but you eliminate all traces of the metric. This means that you do not see this metric when looking in the metric-name browser in Grafana Explore.

If you drop a metric, it shows up on the Discarded Metrics Samples panel with a label that provides context about why it was dropped.

Most of these labels are self-explanatory, but in the case of the requested-by-configuration label, it means that the user intentionally drops samples by means of aggregation rules that the aggregation service applies.

Drop a label

You can drop a label before you ingest data.

Dropping a label is useful in cases where granularity is not needed. For example, if you want to measure energy consumption in locations where the temperatures are typically high in the summer, you can drop labels for locations whose temperatures are low if you do not need to monitor those.

Dropping a label reduces the cardinality for that metric name, and thereby decreases the total number of billable active series.

Configure the aggregation interval and the DPM of the aggregated metric

The number of data points per minute (DPM) that are stored for the aggregated metric depends directly on the aggregation interval of the metric, which is the interval at which the aggregated samples are emitted.

The default aggregation_interval value matches the included DPM per series of your organization. For the organizations with the default resolution of 1 DPM this means a default interval setting of 60s.

The valid values for aggregation_interval are: 6s, 10s, 15s, 20s, 30s and 60s corresponding to 10 DPM, 6 DPM, 4 DPM, 3 DPM, 2 DPM and 1 DPM respectively.

Note

Changing the values of aggregation_interval setting causes a small gap in the data for the affected aggregated metrics while the aggregation is being initialized with the new parameters.

If you want to increase the DPM of the aggregated metric, decrease the aggregation_interval to one of the supported values.

Note

You can set the aggregation_interval individually for each aggregation rule.

You can also ask Grafana Cloud support to set a global value for aggregation_interval as the default for all aggregation rules. Open a support ticket in the Cloud Portal to request this.

Caution

By increasing the DPM of the aggregated metric you may incur additional costs.

Configure the aggregation delay

The aggregation_delay is the delay after which the aggregated samples are emitted. The default value is 90s. The valid values for the aggregation_delay are: 15s, 30s, 60s, 1m30s, 2m, 2m30s and 3m.

Note

Changing the values of aggregation_delay setting causes a small gap in the data for the affected aggregated metrics while the aggregation is being initialized with the new parameters.

Increase the aggregation_delay to emit the aggregated samples later and reduce the risk of excluding the samples that are received late (because of a lagging remote write client, for example).

The total delay between the time of the raw sample arriving at Grafana Cloud and the time that the aggregated sample becomes queryable is usually the sum of the aggregation_interval and the aggregation_delay. It’s possible that there are transient fluctuations in the real delay with which aggregations are produced. The aggregation_delay is a minimum that guarantees that aggregates never get emitted sooner than the configured duration.

Note

You can set the aggregation_delay individually for each aggregation rule.

You can also ask Grafana Cloud support to set a global value for aggregation_delay as the default for all aggregation rules. Open a support ticket in the Cloud Portal to request this.