Menu
Grafana Cloud RSS

Reduce metrics costs via Adaptive Metrics

Adaptive Metrics is a cardinality optimization feature that allows you to identify and eliminate unused time series metrics data by means of aggregation. Recommended rules identify what metrics to aggregate based on usage within your cloud environment.

Adaptive Metrics consists of the following services:

  • The recommendations service generates recommended rules for aggregation.
  • The aggregations service implements those rules.

The recommended rules of Adaptive Metrics are updated daily, and are available for you to review using the Adaptive Metrics plugin or the Adaptive Metrics HTTP API.

After reviewing the recommended rules, use the plugin or the API to indicate which rules you want to apply. You can also create your own aggregation rules.

Use a GitHub Action to automatically apply recommended rules. Refer to the Adaptive Metrics auto-apply template repository in GitHub for more information.

Caution

Auto-apply mode for Adaptive Metrics is currently in public preview. This feature is still under development and support is limited at this time.

Get started

You can use the Adaptive Metrics plugin GUI or the Adaptive Metrics HTTP API to interact with recommended rules.

Supported metrics formats

Grafana Cloud accepts metrics data in a variety of formats, and Adaptive Metrics is compatible with the following subset of formats:

Metrics formatSupported?Notes
PrometheusYesFully supported.
OpenTelemetryYesFully supported.
Influx Line protocolYesRecommendations are limited because metadata is not sent.
DatadogNo
GraphiteNo

Check if you are sending metadata for your metrics

To check whether you are sending metrics metadata, send a request to the HTTP API metadata endpoint:

console
curl -u "$METRICS_INSTANCE_ID:$API_KEY" "https://<cluster>.grafana.net/prometheus/api/v1/metadata"

Note

Adaptive Metrics uses Prometheus metrics metadata stored in your Grafana Hosted Metrics instance to make sure that recommendations are safe to apply mathematically.

For example, for a counter-type metric, recommendations by Adaptive Metrics make sure that counter resets are handled correctly during aggregation.

If metrics metadata is not available for a metric, and Adaptive Metrics is unable to infer a metric’s type from its name or usage patterns, a default recommendation will be produced for that metric which supports the most common aggregation functions (sum(…), count(…), avg(…), and sum(rate(…))). If you are using a metrics format other than Prometheus or OpenTelemetry, metrics metadata is not preserved. As a result, recommendations for those metrics may store more data than strictly necessary and will produce lower cost savings.

Aggregation service: requirements on sample age

The aggregation service can only aggregate raw samples that are relatively recent, it rejects samples for metrics being aggregated that arrive with too much delay.

The delay in this case is not calculated relative to wall clock time, it is the delta between the timestamp of a sample which gets ingested into an aggregated series and the timestamp of the newest sample that has already been ingested into that same aggregated series.

This means that if all samples that get ingested into a given aggregated series have a delay of 10 minutes relative to wall clock time, this is OK. However, if some have a timestamp which is equal to wall clock time and others have a delay of 10 minutes relative to wall clock time, then the delayed samples get dropped.

By default the maximum allowed delay is 90 seconds, this can be tuned in each aggregation rule via the parameter aggregation_delay as documented in Define metrics aggregation rules.

If Grafana Cloud rejects samples for this reason, you can see an increase in aggregator-sample-too-old errors on the Discarded Metrics Samples panel of your billing dashboard.

This sample age requirement only applies to samples that belong to metrics that are being aggregated by Adaptive Metrics. For more information and troubleshooting steps, refer to Troubleshoot Discarded Raw Samples.

Why this happens

To compute an aggregation, you must wait for all raw samples associated with that metric to arrive. We don’t know how many samples will arrive, nor can we wait indefinitely on those samples, because the longer we wait, the longer the delay in the data being queryable and/or visible in dashboards.

If a sample arrives after our configured waiting time, it does not get taken into account during the computation of the aggregated value. Because our metrics database is immutable once the aggregation has been computed, we cannot update the aggregated value to reflect this late arriving data point.

Manage access to Adaptive Metrics

You can use role-based access control to manage access to Adaptive Metrics. For more information, refer to Manage access to Adaptive Metrics using role-based access control.

Troubleshooting

If you encounter issues querying a metric that has been aggregated, see Troubleshoot your aggregated metrics query. For any other questions or feedback, contact your Customer Success Manager or file a support request.