How a production outage in Grafana Cloud's Hosted Prometheus service was caused by a bad etcd client setup
On March 16, Grafana Cloud’s Hosted Prometheus experienced a ~12min partial outage writing to the London region of our Hosted Prometheus service resulting in delayed data storage without loss. To our customers who were affected by the incident, I apologize. It’s our job to provide you with the monitoring tools you need, and when they are not available, we make your life harder. We take this outage very seriously. This blog post explains what happened, how we responded to it, and what we’re doing to ensure it doesn’t happen again.
Background
The Grafana Cloud Hosted Prometheus service is based on Cortex, a CNCF project to build a horizontally scalable, highly available, multi-tenant Prometheus service.
In January 2019, we added the ability to send metrics from pairs of HA Prometheus replicas to the same Hosted Prometheus instance and have those metrics be deduplicated on ingestion. To enable this, we track the identity of the replica from which we have “elected” to accept writes; if we don’t see metrics from this replica for a period of 15 seconds, we accept writes from the replica we receive a write from next, thus changing the “elected” replica. To track these identities we use etcd, a raft-based distributed key-value store that is also used by Kubernetes to store runtime and configuration data.
The incident
At 11:19:00 UTC on March 16, the node running the etcd leader was abruptly terminated by a scheduled Kubernetes version upgrade. By design, another etcd replica automatically became the leader ~10 seconds later.
About 60 seconds later, two replicas of the Cortex distributor (the service on the write-path that is responsible for deduplicating samples) started logging “context deadline exceeded” when trying to fetch keys from our etcd cluster, and writes for some customers via those replicas started failing.
The issue was caused by a stuck TCP connection from the distributors to the old etcd master, which can happen when the underlying node dies and the TCP connection is not gracefully closed.
Detection and resolution
We use SLO-based alerting, which alerted our oncall engineer to the problem at 11:31, a full 11 minutes after it started. As this problem manifested as an error rate of 20%, it would have taken ~18 hours to breach our monthly SLA.
The affected distributors were restarted at 11:32, and the errors stopped, ending the incident.
Takeaway
It is important that we learn from this outage and put in place steps to ensure it does not happen again.
The total length of the incident was 12 minutes. As the timeout error is considered recoverable by Prometheus’s remote_write, the failed writes were retried by the sending side and succeeded when they hit other Distributor replicas. This means the incident resulted in no data loss.
The etcd client supports gRPC keepalive probes which were not correctly configured in Cortex. The incident was reproduced with iptables
rules in our dev environment, dropping packets between a distributor instance and the etcd instance it was connected to. Enabling the keepalive probes in the etcd client was shown to prevent the problem:
cli, err := clientv3.New(clientv3.Config{
Endpoints: cfg.Endpoints,
DialTimeout: cfg.DialTimeout,
+ DialKeepAliveTime: 10 * time.Second,
+ DialKeepAliveTimeout: 2 * cfg.DialTimeout,
+ PermitWithoutStream: true,
})
We have also followed up with our infrastructure team to work out why the machine was abruptly terminated; as the etcd Pods are managed by the etcd operator, our scripts didn’t know how to gracefully reschedule them. What’s more, the API call our scripts used to terminate an instance only gave the instance around 90 seconds to cleanly shut down. We are working to ensure our Kubernetes upgrade process gracefully terminates Pods and machines going forward.
This outage is not all bad news: We relied heavily on Grafana Loki, our new log aggregation system, to quickly dig into logs during the outage, recovery, and for the post mortem. We wouldn’t have been able to do that work as quickly and precisely without Loki. This meant reduced time-to-recovery for our users and less mental overhead during a stressful phase for us.