Analyze Graphite metrics costs
This page explains:
- Finding which are the more commonly used metrics prefixes or patterns amongst your metrics.
- Finding metrics that have the most cardinality (distinct combinations of tag values).
Knowing this can help with reducing your Grafana Cloud Graphite usage
The “Grafana Cloud Billing/Usage” dashboard reports the total number of active time series.
You might want to find out what is driving this number up. Often the largest groups of metrics have the same prefix, the same pattern, or the same name (if you use tags to differentiate metrics).
There are several methods for investigation.
The following scripts require you to authenticate with grafana.com credentials. You can find the instance ID and the URL on your Grafana Cloud details page in the Grafana Cloud Portal
Analysis of a metrics listing
Below, we explain two approaches to obtain a metrics listing, which is a text file with one line per metric.
For the analysis examples below, we assume we have a file metrics-index.txt
with these contents:
stats.count.app1._root.visits
stats.timers.app1._root.timer.mean
stats.timers.app1._root.timer.lower
stats.timers.app1._root.timer.upper
stats.timers.app1._root.timer.upper_99
servers.dc1.server1.disk_free
servers.dc1.server2.disk_free
servers.dc1.server3.disk_free
Breakdown per prefix
Graphite users tend to classify their metrics by a prefix, which is typically a string of two or more “nodes” (dot-separated strings of the metric name). In this example, we get an overview of the different prefixes used, along with their counts.
cut -f1,2 -d. metrics-index.txt | sort | uniq -c | sort -n
1 stats.count
3 servers.dc1
4 stats.timers
This is very helpful in showing which are the top commonly used prefixes. In this example, we can easily demonstrate that half of our metrics are statsd timers.
Estimating savings when aggregating metrics
In this example, we simulate the effect of introducing an aggregation. Each set of metrics that have the same name except for the datacenter and server name, get reduced to a single metric per datacenter that covers all servers. In this example, we can reduce our set of metrics by 25%.
$ wc -l metrics-index.txt
8 metrics-index.txt
$ sed 's/^servers\.\(dc[0-9]\+\)\.server[0-9]\+\.*/servers.\1.servers_total./' metrics-index.txt | sort | uniq | wc -l
6
Manual drilldown with the Metrics Find API
The Find API can be used to explore the subtree matching a certain query. This can be a useful first step in diagnosis.
You can manually query the /metrics/find
endpoint, which is documented in the official Graphite Documentation.
Note
This /metrics/find endpoint is for untagged metrics. Tagged metrics are neither matched nor returned.
Example:
curl -u ${USER_ID}:${ACCESS_POLICY_TOKEN}
-s -G \
--data-urlencode "query=stats.*"
'${GRAPHITE_QUERY_ENDPOINT}/metrics/find/' \
| jq -r '.[].text'
...
count
timers
The result shows possible children. Drill down by modifying the query string and appending .*
or any other supported patterns.
E.g. query=stats.count.*
Automating find queries using the walk_metrics.py script
The walk_metrics.py
script automates the process of recursively calling the /metrics/find
API.
It explores an entire hierarchy under a given root prefix (which might be "" to cover the entire metrics space).
The output is a metrics listing as described above: a list of all the metric names seen under the provided prefix, one per line.
The script is multi-threaded so the output is not perfectly sorted, then you might need to pipe it through the sort
utility.
Scanning through the list visually might make certain patterns obvious, but various kinds of precise analysis can be done using nothing more than standard shell tools, as shown in the section above.
Note
- This does not work when Graphite tags are used.
- If there are hundreds of thousands of series, then the script might take over an hour to finish.
Installation:
mkdir walk_metrics
cd walk_metrics
wget https://raw.githubusercontent.com/grafana/cloud-graphite-scripts/master/query/walk_metrics.py
chmod +x walk_metrics.py
Requires requests
. You can install it in a virtualenv,
but this is out of the scope for this documentation.
You can install it system-wide:
sudo dnf install python3-requests
(RedHat based distributions)sudo apt install python3-requests
(Debian based distributions)
Using the walk_metrics.py
script:
usage: walk_metrics.py [-h] --url URL [--prefix PREFIX] [--user USER] [--password PASSWORD] [--concurrency CONCURRENCY] [--from SERIESFROM] [--depth DEPTH]
optional arguments:
-h, --help show this help message and exit
--url URL Graphite URL
--prefix PREFIX Metrics prefix
--user USER Basic Auth username
--password PASSWORD Basic Auth password
--concurrency CONCURRENCY
Concurrency
--from SERIESFROM Only get series that have been active since this time.
--depth DEPTH Maximum depth to traverse. If set, then the branches at the depth are printed.
Example of using the walk_metrics.py
script:
walk_tree.py \
--url https://graphite-us-central1.grafana.net/graphite \
--user <user> \
--password <API Token> \
--from=-1w \
| tee metrics-index.txt
# Optionally:
sort metrics-index.txt > metrics-index-sorted.txt
The countSeries() function
If you know which are the metrics you need to monitor,
then you can use the countSeries()
function in Graphite’s query language to count the number of nodes found in a seriesList.
Note
This function does not resolve the pattern recursively. E.g.countSeries(foo.*)
would take into accountfoo.bar
but notfoo.x.y.z
. This is why people sometimes querycountSeries(foo.*)&countSeries(foo.*.*)&countSeries(foo.*.*.*)
. If the backend returns an error that you’re trying to query too many series at once, you should try one of the other mentioned approaches.
For more information, refer to the countSeries documentation.
Note
This also works for tagged metrics. If you have these metrics:
foo.bar;t=v1
foo.bar;t=v2
Then countSeries(seriesByTag('name=foo.bar'))
will return 2
.
Measuring cardinality via carbon-relay-ng
All above methods query the API of your Grafana Cloud Graphite service to obtain insights. You can also get insights using carbon-relay-ng itself. Carbon-relay-ng is the agent typically used to send data to Grafana Cloud Graphite, and we can leverage its features to analyze the cardinality of metrics traffic passing through.
Setting up aggregations to capture insights for specific series
This is done by leveraging the aggregator functionality.
Let’s say you have metrics in a format like this flowing into your carbon-relay-ng, and you would like to know how many metrics per datacenter (dc) are seen during each 10 second interval.
servers.dc1.foo 123 1599854045
servers.dc1.bar 123 1599854046
servers.dc2.foo 123 1599854045
This can easily be achieved with a count aggregation like so:
[[aggregation]]
# count how many metrics are seen each 10s, broken down by dc
function = 'count'
regex = '^servers\.(dc[0-9]+)\..*'
format = 'aggregate_count.servers.$1'
interval = 5
wait = 10
This will cause the relay to emit timeseries such as aggregate_count.servers.dc1
and aggregate_count.servers.dc2
measuring, at each point in time how many metrics (points) are seen for each one. This is not quite the same as counting active series but it’s a good proxy measure especially if all metrics are sent at the same interval.
Note
The wait parameter is important. It should be set to the max time delay expected in the data. See the carbon-relay-ng aggregator documentation for more info.
Deriving insights for existing aggregations
Each aggregator defined in carbon-relay-ng emits interesting metrics that give you good clues into the volume (and the reduction of volume) of data they process.
Note
These metrics pertain to the entire aggregator and not segmented by output key like in the above example.
service_is_carbon-relay-ng.instance_is_$instance.mtype_is_counter.unit_is_Metric.direction_is_in.aggregator_is_*
service_is_carbon-relay-ng.instance_is_$instance.mtype_is_counter.unit_is_Metric.direction_is_out.aggregator_is_*
The meanings are quite simple: the amount of points going into an aggregator, and the amount of points getting flushed out of the aggregator.
These are counters, so use perSecond()
to see the rate per second.
Exploration methods that only work for tagged metrics
Many of the above mentioned approaches also work for tagged metrics. But there are a couple of additional API endpoints available to drill into tags specifically.
FindSeries API
The /tags/findSeries
endpoint is similar to /metrics/find
, except for a given query
it will return all matching nameWithTags of metrics. This way you can find out how many metrics match a given combination of tags.
curl -u ${USER_ID}:${ACCESS_POLICY_TOKEN} "$GRAPHITE_ENDPOINT/findSeries?expr=os=ubuntu' -O -
["foo.bar;os=ubuntu;tag1=tag2","foo.bar;os=ubuntu;tag1=tag1"]