How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage -- and Taught Us Some Important Lessons
At Grafana Labs, we run our hosted metrics Graphite service in several Kubernetes clusters in several different locations, each running one or many Metrictank clusters that process millions of data points per second. Some of them are using Google’s BigTable as their backend store, but the majority of them are backed by Cassandra clusters, which we also run on Kubernetes. This makes Cassandra a mission-critical piece of infrastructure for our Grafana Cloud Graphite service.
On Dec. 7-8 for 6 to 23 hours, Grafana Cloud Graphite customers in our US-East cluster could not execute queries that needed data which could not be found in our in-memory caches and thus needed to be loaded from Cassandra. (By design, the last few hours of data are always available in-memory as this is the most frequently read data.) At its peak, 20% of the queries were failing.
In this blog post, we will provide a postmortem of the outage and share what we learned and what we’ve done as a result.
The Problem
The outage was triggered by a Google Cloud Platform incident: GCP Solid State Disks (SSDs) were operating with such degraded performance that many I/O operations failed for an extended period of time.
We received an alert that some Metrictank instances could not drain their write queue, which indicated that they were unable to write to Cassandra. At the time Google had not yet published an incident report, but one of our team members noticed that his Discord chat server was down. That chat server was also hosted on GCP, so we were already able to guess that the issue might be related to GCP. Later on, about 90 minutes after the incident started, Google published an incident report, confirming our suspicion.
With the SSD persistent volumes being so severely impacted by the outage, compute nodes were experiencing kernel deadlocks and mild filesystem corruption. As a result of these issues, monitoring processes running on the nodes automatically triggered reboots. Due to the Google outage, when nodes started up again, some were unable to attach the disk volumes needed by our Cassandra pods – leading to further automated node reboots. As a result, write operations to our Cassandra cluster either failed completely or took over 2 seconds to execute. Under these conditions, the Cassandra cluster failed to perform its task. We tried scaling up the number of compute nodes in our cluster in the hope that new nodes would have a better chance at being able to attach disks, but even this did not work.
We utilize Kubernetes StatefulSets for running our Cassandra clusters, and this StatefulSet was configured with a Pod Management Policy of “OrderedReady.” This policy ensures that pods in the StatefulSet are started up sequentially rather than all at once. During the storage outage, all of our Cassandra pods had failed, and so for the Cassandra cluster to come back online, each pod in the cluster needed to be started in sequence, with Kubernetes waiting for a pod to become healthy before starting the next one. Google had communicated that most storage volumes should be working; however, to make a bad day worse, the second Cassandra pod in the sequence was unable to start due to its storage volume still being unavailable. Because of the Pod Management Policy, no other Cassandra nodes could be started until after this one problematic node started.
Resolution
To recover the Cassandra cluster once the GCP outage was resolved, we had to delete then recreate the Kubernetes StatefulSet with an updated Pod Management Policy. This allowed all except one of our Cassandra pods to start up, which was sufficient to allow the Cassandra cluster to return to normal operation.
The Next Problem
The problems didn’t stop there. After the Cassandra cluster was restored to an operational state, none of the Metrictank instances could reconnect to the new cluster. As long as Metrictank is connected to at least one node in the Cassandra cluster, it will obtain the IP addresses of any re-created nodes from the nodes it is connected to. Since all of the Cassandra nodes had to be re-created and were assigned new IP addresses, Metrictank completely lost connectivity to Cassandra. It had the right service hostname but was simply not re-resolving to the new IPs because this failure mode was never considered.
Resolution (Step 2)
In order to restore connectivity, we had to restart all Metrictank instances – which took more than 20 hours. A common operational procedure is, during times of extended downtime, to increase the retention on our Kafka topics to avoid data loss. This means when the Metrictank instances restart they will have a higher than usual amount of data to consume. We also temporarily added more nodes to the cluster to ensure that the restarts did not get slowed down by CPU starvation. While the Metrictank instances were replaying their write ahead log from Kafka – and consuming significantly more data than usual – some of them encountered another bug in Metrictank that added to the total recovery time.
That bug led to a deadlock when Metrictank’s queue of data waiting to be written to Cassandra was full and the ingestor attempted to add additional data for a series that already had items in the queue. By design, the ingester intentionally blocks when the write queue is full. While it is blocked, new metrics are just spooled in Kafka. However, the code path where the block occurs also happens to hold an internal lock to prevent race conditions when concurrently reading and writing state information about the series. Once data for a series has been successfully persisted to Cassandra, a callback is made to update some of this state information, an action that also requires use of the same internal lock. This led to the deadlock situation in which the ingester couldn’t release its lock until the write queue was drained, but the write queue couldn’t drain until the ingester released the lock. It was a longstanding bug that we had not addressed because we thought it needed a significant code refactor. Plus, we could normally work around it by simply increasing the write queue size to be larger than needed. Over the course of many years, we saw this happen maybe once or twice, and it was easily resolved by deploying a config tweak.
The fix was never a high priority, but when we hit the issue at the least opportune time, we decided to address it once and for all. We realized that we could fix it cleanly without requiring a significant refactor, by combining the use of locks with atomics.
Lessons Learned and Positive Takeaways
It was a good decision to use a WAL (Write Ahead Log) to quickly ingest and store the received data before processing it, because after the Cassandra cluster was recovered we were still able to ingest all data from Kafka and process and store it in Cassandra. At the same time, we were lucky that on Kafka we used non-SSD disks; otherwise it might have been affected by the same issues that broke Cassandra.
Our globally distributed team worked together effectively to cover all time zones during an incident that took 23 hours. Involved in the handling of this incident were three people located in Paraguay, Albania, and Australia.
This incident prompted us to fix two old bugs in the Metrictank code. If we were to get into a similar situation again, the recovery should take less than 3 hours, instead of 23 hours.
We fixed the problem of Metrictank not being able to reconnect to Cassandra if all Cassandra IPs had changed by re-instantiating the gocql connection when we detect it to be dead.
The second problem was the deadlock that could happen when Metrictank’s write queue completely filled up while new data was still getting ingested. We were able to fix this by using atomics to update one internal data structure, instead of protecting it via a lock. Sometimes looking at an old problem with fresh eyes reveals the solution can be much simpler than previously thought.
Our monitoring was effective at detecting that the write instances of Metrictank were unable to flush their write queue. Without this alert, it could have taken much longer, as the error rates increased very slowly because Metrictank kept serving the majority of queries from memory. We spend a lot of time thinking about best practices to monitor Metrictank clusters and curate a list of metrics that you typically will want to alert on. Check out the operations guide in the Metrictank documentation for useful recommendations.
Luckily, this happened during the weekend, when our customers are typically not working, and their workload is reduced to alerting queries (which mostly query cached data).
Conclusion
We apologize for this incident. While the impact was relatively minor, a few customers were still noticeably affected, and for a much longer timespan than what is the norm. We strive to do better and are working on making our systems more reliable. For future status updates, please see our status page.