Reduce metrics costs via Adaptive Metrics
Adaptive Metrics is a cardinality optimization feature that allows you to identify and eliminate unused time series metrics data by means of aggregation. Recommended rules identify what metrics to aggregate based on usage within your cloud environment.
Adaptive Metrics consists of the following services:
- The recommendations service generates recommended rules for aggregation.
- The aggregations service implements those rules.
The recommended rules of Adaptive Metrics are updated daily, and are available for you to review using the Adaptive Metrics plugin or the Adaptive Metrics HTTP API.
After reviewing the recommended rules, use the plugin or the API to indicate which rules you want to apply. You can also create your own aggregation rules.
Use a GitHub Action to automatically apply recommended rules. Refer to the Adaptive Metrics auto-apply template repository in GitHub for more information.
Caution
Auto-apply mode for Adaptive Metrics is currently in public preview. This feature is still under development and support is limited at this time.
Get started
You can use the Adaptive Metrics plugin GUI or the Adaptive Metrics HTTP API to interact with recommended rules.
From the Rules tab, review the current aggregation rule recommendations.
Select the rules you want to apply. For more information, refer to Adaptive Metrics plugin.
Download the JSON file with recommended and applied aggregation rules.
From the Rules tab, click the download button.
From the Download recommendations dialog box, select which set of rules you want to download. You can download recommendations or applied rules in verbose or non-verbose formats.
Review and update the rules, as required.
Upload the JSON file to the
$URL/aggregations/rules
Adaptive Metrics API endpoint.For more information, refer to the Adaptive Metrics API.
Supported metrics formats
Grafana Cloud accepts metrics data in a variety of formats, and Adaptive Metrics is compatible with the following subset of formats:
Metrics format | Supported? | Notes |
---|---|---|
Prometheus | Yes | Fully supported. |
OpenTelemetry | Yes | Fully supported. |
Influx Line protocol | Yes | Recommendations are limited because metadata is not sent. |
Datadog | No | |
Graphite | No |
Check if you are sending metadata for your metrics
To check whether you are sending metrics metadata, send a request to the HTTP API metadata
endpoint:
curl -u "$METRICS_INSTANCE_ID:$API_KEY" "https://<cluster>.grafana.net/prometheus/api/v1/metadata"
Note
Adaptive Metrics uses Prometheus metrics metadata stored in your Grafana Hosted Metrics instance to make sure that recommendations are safe to apply mathematically.
For example, for a counter-type metric, recommendations by Adaptive Metrics make sure that counter resets are handled correctly during aggregation.
If metrics metadata is not available for a metric, and Adaptive Metrics is unable to infer a metric’s type from its name or usage patterns, a default recommendation will be produced for that metric which supports the most common aggregation functions (sum(…), count(…), avg(…), and sum(rate(…))). If you are using a metrics format other than Prometheus or OpenTelemetry, metrics metadata is not preserved. As a result, recommendations for those metrics may store more data than strictly necessary and will produce lower cost savings.
Aggregation service: requirements on sample age
The aggregation service can only aggregate raw samples that are relatively recent, it rejects samples for metrics being aggregated that arrive with too much delay.
The delay in this case is not calculated relative to wall clock time, it is the delta between the timestamp of a sample which gets ingested into an aggregated series and the timestamp of the newest sample that has already been ingested into that same aggregated series.
This means that if all samples that get ingested into a given aggregated series have a delay of 10 minutes relative to wall clock time, this is OK. However, if some have a timestamp which is equal to wall clock time and others have a delay of 10 minutes relative to wall clock time, then the delayed samples get dropped.
By default the maximum allowed delay is 90 seconds, this can be tuned in each aggregation rule via the parameter aggregation_delay
as documented in Define metrics aggregation rules.
If Grafana Cloud rejects samples for this reason, you can see an increase in aggregator-sample-too-old
errors on the Discarded Metrics Samples panel of your billing dashboard.
This sample age requirement only applies to samples that belong to metrics that are being aggregated by Adaptive Metrics. For more information and troubleshooting steps, refer to Troubleshoot Discarded Raw Samples.
Why this happens
To compute an aggregation, you must wait for all raw samples associated with that metric to arrive. We don’t know how many samples will arrive, nor can we wait indefinitely on those samples, because the longer we wait, the longer the delay in the data being queryable and/or visible in dashboards.
If a sample arrives after our configured waiting time, it does not get taken into account during the computation of the aggregated value. Because our metrics database is immutable once the aggregation has been computed, we cannot update the aggregated value to reflect this late arriving data point.
Manage access to Adaptive Metrics
You can use role-based access control to manage access to Adaptive Metrics. For more information, refer to Manage access to Adaptive Metrics using role-based access control.
Troubleshooting
If you encounter issues querying a metric that has been aggregated, see Troubleshoot your aggregated metrics query. For any other questions or feedback, contact your Customer Success Manager or file a support request.