How we use metamonitoring Prometheus servers to monitor all other Prometheus servers at Grafana Labs
One of the big questions in monitoring can be summed up as: Who watches the watchers? If you rely on Prometheus for your monitoring, and your monitoring fails, how will you know?
The answer is a concept known as metamonitoring. At Grafana Labs, a handful of geographically distributed metamonitoring Prometheus servers monitor all other Prometheus servers and each other cross-cluster, while their alerting chain is secured by a dead-man’s-switch-like mechanism.
This article explains how we came up with this solution.
As a leader in the observability space, we already had a fairly solid setup. For each Kubernetes cluster, we have a Prometheus HA pair set up, and we run a global Alertmanager cluster with geographically separated instances in the E.U. and the U.S. The Prometheus HA pair discovers and scrapes all services in-cluster. The services are found through the built-in Kubernetes service discovery. As such, the Prometheus instances will also scrape each other and any Alertmanager in the cluster. Alerts go from any Prometheus instance to all Alertmanager instances, then they are globally deduplicated by the Alertmanager cluster and finally sent to PagerDuty for further escalation.
Pretty solid, right? We can afford to lose a few elements in this setup. For example, if one of the instances in a Prometheus pair dies, we get notified. Or, if an Alertmanager instance isn’t sending out alerts, we still get notified from another.
But what if disaster strikes and multiple instances die at once, or a whole cluster loses network connectivity to the outside world? Let’s see how we can handle these scenarios:
Outage of both members in a Prometheus HA pair or a complete Kubernetes cluster (equivalent)
Who watches the watchers?
Outage of the global Alertmanager cluster
Who will sound the horn if the watchers see something?
Outage of PagerDuty
What if the horn is gone?
Simply put, if anything in the Prometheus → Alertmanager → PagerDuty chain breaks completely, we won’t get notified. These global outages are rare, but they do happen. When they happen, it is often because of rolling out new configuration; a configuration is generally rolled out to all instances globally because we want a homogeneous Prometheus configuration. As we do more continuous deployment, this might go unnoticed even more often than with manual rollouts.
Prometheus is usually run inside a cluster, but for cross-cluster service discovery to work, we had to overcome a few hurdles.
Networking
At Grafana Labs, we have a global cross-cluster network setup. This allows us to let services from different clusters talk to each other. The setup is a combination of routes-based legacy networks and VPCs at Google, and a VPN tunnel to our network on Azure. Together they form a global network fabric.
One of the main purposes for this setup is our global Alertmanager setup. This enables gossiping between the Alertmanagers before escalating alerts to PagerDuty, and it allows for all Prometheis to send alerts to these instances. Considering this, we simply allow our metamonitoring Prometheis to discover and scrape all Prometheis in all interconnected clusters, which are generally served on port 9090.
Authentication
For an out-of-cluster Prometheus to use Kubernetes service discovery it needs to talk to a Kubernetes API, which must be authenticated. Prometheus supports two methods for authenticating to a Kubernetes cluster: basic auth or with a bearer token. As basic auth has a wider attack surface, GKE has disabled basic auth since version 1.12, leaving us with bearer token authentication for an out-of-cluster Prometheus. A bearer token is created by the token controller for a Kubernetes service account.
We solved this hurdle with Vault. We use Vault to manage secrets and it’s already available cross-cluster, so it’s the perfect candidate. We create a service account with cluster-wide permissions and deploy a small process to ship the token to Vault.
The required cluster-wide permissions:
- get/list/watch Pods (for Prometheus service discovery)
- TokenReview capabilities (for Vault authentication)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: metamonitoring-sd-prometheus
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- watch
- list
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authentication.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
We also need a Vault policy to limit the service account to a write-only path. Here is a Terraform sample that does the trick for us. (Note the path, which is specific to our setup with Pentagon.)
## path "secret/data/<namespace>/<destination_cluster_name>/*"
resource "vault_policy" "metamonitoring" {
name = "metamonitoring"
policy = <<EOT
path "secret/data/metamonitoring/dev-us-central1/*"{
policy = "write"
}
EOT
}
resource "vault_kubernetes_auth_backend_role" "metamonitoring-dev-us-central1" {
backend = vault_auth_backend.dev-us-central1-k8s.path
role_name = "metamonitoring"
bound_service_account_names = ["metamonitoring-sd-prometheus"]
bound_service_account_namespaces = ["metamonitoring"]
token_ttl = 3600
token_policies = ["metamonitoring"]
}
With this in place, we can now write the service account token to Vault. We repeat this for each cluster.
Putting it all together
Now that we have all our credentials in one place (Vault) and Prometheus can access all other Prometheis, we need to pull it all together. As adding and removing clusters is rare, we opted for statically listing the clusters, leaving room for improvement so we could also do dynamic cluster discovery at some point.
To pull the secrets from Vault, we use our modified fork of Pentagon. This will map secrets from Vault paths to Kubernetes secrets in the namespace it runs in. Based on the static list of clusters, we map the service account token from its respective Vault path to Kubernetes secrets. This will be mounted into the Prometheus container for consumption.
{
_config+:: {
namespace: 'metamonitoring',
cluster_name: 'dev-us-central1',
clusters: {
'dev-us-central2': { server: 'https://1.1.1.2' },
'dev-us-central3': { server: 'https://1.1.1.3' },
},
pentagon_mappings+:: [
self.pentagonKVMapping(
'secret/data/%(namespace)s/%(cluster_name)s/%(cluster)s' % $._config { cluster: c },
'metamonitoring-%s' % c
)
for c in std.objectFields($._config.clusters)
],
},
}
Since we have all secrets available in the Kubernetes cluster now, we can deploy Prometheus. We generate the scrape configs based on the static list of clusters and refer to the secrets mounted in the container.
{
prometheus_config+:: {
scrape_configs: [
{
job_name: 'prometheus-%s' % c,
kubernetes_sd_configs: [
{
role: 'pod',
api_server: $._config.clusters\[c].server,
bearer_token_file: '/secrets/%s/token' % c,
tls_config: {
ca_file: '/secrets/%s/ca.crt' % c,
insecure_skip_verify: false,
},
},
],
bearer_token_file: '/secrets/%s/token' % c,
tls_config: {
ca_file: '/secrets/%s/ca.crt' % c,
insecure_skip_verify: false,
},
relabel_configs:
config.relabel_configs
+ [
{
target_label: 'cluster',
replacement: c,
},
],
}
for c in std.objectFields($._config.clusters)
],
},
}
Now we have our metamonitoring Prometheus setup, and it’s able to discover and scrape all Prometheis in all clusters.
Sounding the horn if the chain is broken
We also want to verify the integrity of our chain, from Prometheus to Alertmanager to PagerDuty. If anything in this chain is not passing along our alerts, we want to get notified. For this we leveraged a service called Dead Man’s Snitch, a dead-man’s-switch-like mechanism for monitoring scheduled tasks.
We implemented a Prometheus alert rule that is always firing through our chain:
groups:
- name: metamonitoring
rules:
- alert: AlwaysFiringAlert
annotations:
message: Always firing alert for Prometheus metamonitoring.
expr: vector(1)
for: 1m
From Alertmanager we route this alert to PagerDuty every 15 minutes, essentially creating a heartbeat:
route:
receiver: pd-AlwaysFiring-Heartbeat
repeat_interval: 15m
continue: false
match_re:
alertname: AlwaysFiringAlert
PagerDuty is configured to forward this alert to Dead Man’s Snitch and autoresolve it after 10 minutes, ensuring that the next repeat interval from Alertmanager triggers a new alert. In turn, Dead Man’s Snitch is configured to expect at least one alert every hour.
Finally, the heartbeat chain
This chain is now set up to be completely redundant. The Prometheis run as an HA pair, and we have pairs geographically spread—the same goes for our global Alertmanager setup. PagerDuty employs an active-active architecture (see their FAQ).
And with this configuration, we have addressed the aforementioned scenarios:
Outage of both Prometheus HA pair or a complete Kubernetes cluster (equivalent)
Who watches the watchers? The Prometheus metamonitoring servers are spread geographically and monitoring each other. If one cluster goes down, the other will fire an alert. In case all Prometheis or all clusters are down, the AlwaysFiringAlert will stop firing, Alertmanager will stop sending the heartbeat to PagerDuty, and Dead Man’s Snitch will sound the horn.
Outage of the global Alertmanager cluster
Who will sound the horn if the watchers see something? In case the global Alertmanager cluster is down, Alertmanager will stop sending the heartbeat to PagerDuty and Dead Man’s Snitch will sound the horn.
Outage of PagerDuty
What if the horn is gone? If Dead Man’s Snitch does not receive a message from PagerDuty, Dead Man’s Snitch will sound the horn.
Resources
Tools:
Jsonnet libs: