Caution
Grafana Alloy is the new name for our distribution of the OTel collector. Grafana Agent has been deprecated and is in Long-Term Support (LTS) through October 31, 2025. Grafana Agent will reach an End-of-Life (EOL) on November 1, 2025. Read more about why we recommend migrating to Grafana Alloy.
Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.
Operation guide
This guide helps you operate Grafana Agent.
Horizontal Scaling
There are three options to horizontally scale your deployment of Grafana Agents:
- Host filtering requires you to run one Agent on every machine you wish to collect metrics from. Agents will only collect metrics from the machines they run on.
- Hashmod sharding allows you to roughly shard the discovered set of targets by using hashmod/keep relabel rules.
- The scraping service allows you to cluster Grafana Agents and have them distribute per-tenant configs throughout the cluster.
Each has their own set of tradeoffs:
- Host Filtering (Beta)
- Pros
- Does not need specialized configs per agent
- No external dependencies required to operate
- Cons
- Can cause significant load on service discovery APIs
- Requires each Agent to have the same list of scrape configs/remote_writes
- Pros
- Hashmod sharding (Stable)
- Pros
- Exact control on the number of shards to run
- Smaller load on SD compared to host filtering (as there are a smaller # of Agents)
- No external dependencies required to operate
- Cons
- Each Agent must have a specialized config with their shard number inserted into the hashmod/keep relabel rule pair.
- Requires each Agent to have the same list of scrape configs/remote_writes, with the exception of the hashmod rule being different.
- Hashmod is not consistent hashing, so up to 100% of jobs will move to a new machine when scaling shards.
- Pros
- Scraping service (Beta)
- Pros
- Agents don’t have to have a synchronized set of scrape configs / remote_writes (they pull from a centralized location).
- Exact control on the number of shards to run.
- Uses consistent hashing, so only 1/N jobs will move to a new machine when scaling shards.
- Smallest load on SD compared to host filtering, as only one Agent is responsible for a config.
- Cons
- Centralized configs must discover a minimal set of targets to distribute evenly.
- Requires running a separate KV store to store the centralized configs.
- Managing centralized configs adds operational burden over managing a config file.
- Pros
Host filtering (Beta)
Host filtering implements a form of “dumb sharding,” where operators may deploy one Grafana Agent instance per machine in a cluster, all using the same configuration, and the Grafana Agents will only scrape targets that are running on the same node as the Agent.
Running with host_filter: true
means that if you have a target whose host
machine is not also running a Grafana Agent process, that target will not
be scraped!
Host filtering is usually paired with a dedicated Agent process that is used for
scraping targets that are running outside of a given cluster. For example, when
running the Grafana Agent on GKE, you would have a DaemonSet with
host_filter
for scraping in-cluster targets, and a single dedicated Deployment
for scraping other targets that are not running on a cluster node, such as the
Kubernetes control plane API.
If you want to scale your scrape load without host filtering, you can use the scraping service instead.
The host name of the Agent is determined by reading $HOSTNAME
. If $HOSTNAME
isn’t defined, the Agent will use Go’s os.Hostname
to determine the hostname.
The following meta-labels are used to determine if a target is running on the same machine as the Agent:
__address__
__meta_consul_node
__meta_dockerswarm_node_id
__meta_dockerswarm_node_hostname
__meta_dockerswarm_node_address
__meta_kubernetes_pod_node_name
__meta_kubernetes_node_name
__host__
The final label, __host__
, isn’t a label added by any Prometheus service
discovery mechanism. Rather, __host__
can be generated by using
host_filter_relabel_configs
. This allows for custom relabeling
rules to determine the hostname where the predefined ones fail. Relabeling rules
added with host_filter_relabel_configs
are temporary and just used for the
host_filtering mechanism. Full relabeling rules should be applied in the
appropriate scrape_config
instead.
Note that scrape_config relabel_configs
do not apply to the host filtering
logic; only host_filter_relabel_configs
will work.
If the determined hostname matches any of the meta labels, the discovered target is allowed. Otherwise, the target is ignored, and will not show up in the targets API.
Hashmod sharding (Stable)
Grafana Agents can be sharded by using a pair of hashmod/keep relabel rules. These rules will hash the address of a target and modulus it with the number of Agent shards that are running.
scrape_configs:
- job_name: some_job
# Add usual service discovery here, such as static_configs
relabel_configs:
- source_labels: [__address__]
modulus: 4 # 4 shards
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ^1$ # This is the 2nd shard
action: keep
Add the relabel_configs
to all of your scrape_config blocks. Ensure that each
running Agent shard has a different value for the regex
; the first Agent shard
should have ^0$
, the second should have ^1$
, and so on, up to ^3$
.
This sharding mechanism means each Agent will ignore roughly 1/N of the total targets, where N is the number of shards. This allows for horizontal scaling the number of Agents and distributing load between them.
Note that the hashmod used here is not a consistent hashing algorithm; this means that changing the number of shards may cause any number of targets to move to a new shard, up to 100%. When moving to a new shard, any existing data in the WAL from the old machine is effectively discarded.
Prometheus instances
The Grafana Agent defines a concept of a Prometheus Instance, which is
its own mini Prometheus-lite server. The instance runs a combination of
Prometheus service discovery, scraping, a WAL for storage, and remote_write
.
Instances allow for fine grained control of what data gets scraped and where it gets sent. Users can easily define two Instances that scrape different subsets of metrics and send them to two completely different remote_write systems.
Instances are especially relevant to the scraping service mode, where breaking up your scrape configs into multiple Instances is required for sharding and balancing scrape load across a cluster of Agents.
Instance sharing (Stable)
The v0.5.0 release of the Agent introduced the concept of instance sharing,
which combines scrape_configs from compatible instance configs into a single,
shared Instance. Instance configs are compatible when they have no differences
in configuration with the exception of what they scrape. remote_write
configs
may also differ in the order which endpoints are declared, but the unsorted
remote_writes
must still be an exact match.
In the shared instances mode, the name
field of remote_write
configs is
ignored. The resulting remote_write
configs will have a name identical to the
first six characters of the group name and the first six characters of the hash
from that remote_write
config separated by a -
.
The shared instances mode is the new default, and the previous behavior is
deprecated. If you wish to restore the old behavior, set instance_mode: distinct
in the metrics_config
block of your config file.
Shared instances are completely transparent to the user with the exception of
exposed metrics. With instance_mode: shared
, metrics for Prometheus components
(WAL, service discovery, remote_write, etc.) have a instance_group_name
label,
which is the hash of all settings used to determine the shared instance. When
instance_mode: distinct
is set, the metrics for Prometheus components will
instead have an instance_name
label, which matches the name set on the
individual Instance config. It is recommended to use the default of
instance_mode: shared
unless you don’t mind the performance hit and really
need granular metrics.
Users can use the targets API to see all scraped targets, and the name of the shared instance they were assigned to.