Incident management that actually makes sense: SLOs, error budgets, and blameless reviews
Incident response is about more than just putting out fires. Yes, there are definitely those all-hands-on-deck moments when an incident arises. But you also need a structure in place ahead of time to provide the right information to the right people when they need it — while still avoiding alert fatigue. Plus, you need a culture that actually encourages people to continually improve what’s in place without just pointing fingers.
That’s the focus of the latest episode of “Grafana’s Big Tent” podcast, hosted by Tom Wilkie, Grafana Labs CTO, and Mat Ryer, Grafana Labs Engineering Director. The pair is joined by Devin Cheevers, Grafana Labs Product Director, and Alexander Koehler, Senior SRE at Prezi, a video and visual communications software company.
Alex’s team is responsible for platform provisioning and on-call coverage, and it turns out the way they operate has a lot in common with how we handle incident response and management at Grafana Labs. You can read some of the highlights below, but listen to the full episode to hear the quartet discuss things in more detail. Plus, you’ll get to hear each of them describe the worst incident they ever caused, and you’ll get random musings about who is actually an AI bot (Devin: definitely not; Tom: maybe?), the perils of gravity, and why this podcast is the MCU of observability.
Note: The following are highlights from episode 2, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
Common ground on centralization
Alex Koehler: We are big believers of “you build it, you run it.” Every team has its own on-call rotation and maintains the services they write. There’s also an additional group, which we call SRE that’s more like an infrastructure or platform team, and we are also available 24 hours a day, seven days a week to manage AWS issues or to help the developers debug stuff.
Devin Cheevers: What has the history of that team been at Prezi?
Alex: They started in the past to provide a platform for the developers as a golden path to production, so to say. They provide unique resources and deploy in the same way, so that every service looks at least 80% the same. And the platform team is there to manage the underlying infrastructure, to make sure that the deployment engine is working and to look at cost control. And we provide central tooling like logging, an alerting stack, and a monitoring stack, and access management to our developers.
Tom Wilkie: This sounds very familiar to Grafana Labs. We have a central platform team doing very similar sets of responsibilities — looking after our infrastructure, and our costs, and our observability stack as well. So it’s good to meet someone else who follows the same pattern.
SLOs, error budgets, and flexibility for failure
Mat Ryer: What’s the appetite for risk then at Prezi?
Alex: Every team has an error budget and is allowed to act inside this budget, and deploy small batches, and deploy fast and early. But when it comes to bigger changes on an infrastructure level, we are a little bit conservative.
For example, if you update a Kubernetes cluster on AWS, there is only one way, and it’s: Update the version and be happy or not. But on the application level, we are quite flexible and fast. So it’s a mixed approach.
Mat: For some things you do need to be more careful, but it is nice as an engineer to innovate and do things. And error budgets are great; we do have an episode on SLOs in season one, for anyone interested. We dig into what error budgets are, and how to use them. It’s great.
Tom: Do you think engineers think about the error budgets on a day-to-day basis? Or do you think it’s more of a release management or engineering leadership thing?
Alex: It’s a fail-safe mechanism. I don’t think the application developers think about error budgets the whole day. But it brings them back to reality when they mess up.
Mat: Just knowing that there’s an error budget, being in a culture where there is an error budget I think is enough, because it tells you, “Look, it’s OK. Do things.”
Tom: Failure is tolerated.
Mat: Yeah, exactly.
Alert fatigue and a lack of proper context
Alex: We have error alerts on fast burn rate for error budget. So imagine you have one hour of error budget left, and you will burn it within half an hour. You get an alert, and I think that’s also part of the fast feedback cycle for developers, to just make them aware that when they do something, it has consequences.
Tom: Would you say that’s the primary source of your paging alerts at Prezi?
Alex: I don’t have any statistics, but we get a lot of alerts about error-budget burn rate.
Tom: One of the things I really love about SLOs is they’ve got a really good signal-to-noise ratio. If they do alert, something’s really wrong, and you really do have to go and look at it.
The downside of these kinds of alerts though is they really lack context. They’re gonna tell you “You’re about to break your SLO on this service,” but they’re not going to tell you why. So how do you tackle that problem?
Alex: That’s an issue for us as well. The current SLO setup was done by infrastructure engineers, and they are pretty service-focused. So we have alerts on every microservice we run, and that’s not user-facing. So we get an alert, for example, that an authentication service is broken, but what does this mean for the customer? And is there an effect on our paying customers as well, or not?
Tom: We make this distinction internally between critical alerts that will page someone and warning alerts that basically just go to Slack. And I’m sure, like every company, you’ve got a Slack channel somewhere with thousands of warning level alerts going off, that you’ve trained all your engineers to ignore, basically. Often I find the context for that critical alert is going off somewhere in that warning Slack channel. But unfortunately, it’s very, very hard to find.
Alex: The Prezi culture is “all hands on deck” when a certain criticality is reached. So we distinguish between SSH warning alerts, and critical alerts. And that brings different parts of the company together, and looking at the things, and that’s better than being alone and trying to decide, “Is this now a critical incident or not?”
Understanding the value of blameless post-incident reviews
Tom: With incidents and outages, the main thing is how you handle it afterwards: how you get the team together, how you honestly discuss what the problem was, and how you make sure you victimize and blame the person that caused it and fire them immediately. That’s really what the culture should be. Sorry, I guess the British sarcasm doesn’t come across on a podcast.
Alex: I was at an enterprise company earlier in my career, and I actually did that exact thing you described. They had operations meetings every morning at 9, and then the vice president was there, and you had to explain what happened during the night. And that was so frustrating because every time someone or something went wrong you had to go up there and explain yourself. And that was not blameless, that’s for sure.
Tom: It’s so important to have that blameless post-mortem culture. Otherwise you’re never actually going to discover what went wrong, and you won’t discover it quickly, and you won’t be able to have that kind of open conversation about preventing it in the future. If you’re just pointing fingers, it’s horrible.
We have this enviable problem in some teams now in Grafana Labs, where it’s a bit like that scene from “Life of Brian” where everyone goes “I’m Brian. No, I’m Brian. I’m Brian, and so is my wife.” Everyone wants to take responsibility for the incident, because there’s this culture of like “No, this is how I did wrong, and how we can make it better in the future”, and it’s so healthy.
Alex: It also encourages people to actually change things and not wait for someone else to change things, or wait for a longer period. Right now you have to be quick and adapt to changing market situations, and release new features, and so on. If you have a culture of fear, then you will lose track pretty soon.
Tom: Yes, exactly. A culture of fear is just discouraging people from making changes. And in today’s world, your ability to move quickly, change quickly, and accept change is really what means you survive. So I guess it goes without saying, you have a similar blame-free post-mortem culture at Prezi.
Alex: Yeah. After every incident we have, we just come together and collect all the necessary information, the useful information, and conduct a post-mortem. We try to be blameless, and record what went well, where we have been lucky, where we could improve, and so on.
Tom: Mat, so you actually started the Grafana Incident tool internally at Grafana Labs, and what we’re talking about here was one of your motivations, wasn’t it?
Mat: Yeah, exactly. If you look at the tool today, you’ll see that we really sort of celebrate the people that are dealing with the incident. Forget blameless, we’re almost like celebrating the people that are getting stuck in and doing things, and finding the places where we can break it. And then if you have good post-incident reviews, you’re getting in there and following up.
Spreading the pain
Devin: So the blameless thing is great. I’ve been part of a number of companies that have that. But what about the kind of natural phase or rhythm that I’ve seen happen, which is: You release the project or product, you go a couple of quarters focused on feature work, and then you get to a place where you’re like “Listen, we need to focus on reliability a little bit more this month, or the next few months.” But how do you go back and represent that to the business? How do you go to the business and say, “Listen, we want to focus on reliability. Here’s the value.”
Tom: I can represent the business in this situation, right? Tell me why you need to go in and focus on or give a higher focus to reliability? Do you have data that shows your SLO performance has been poor? Do you have data that shows your latencies are high, or high customer churn because experience with the product due to reliability is low? We were talking about error budgets earlier. Are you seeing that you’re routinely breaching your error budget? How many incidents a day, a week, a month are you having? All of this kind of data can really help motivate these problems…
I feel very deeply about a bottom-up DevOps culture where it should be up to the individual product teams, the individual engineering teams to decide what the most important priority is. And if they come to me and say “I want to focus on reliability,” of course I’m going to try and accommodate that.
Alex: As an engineer it’s not only about reliability, it’s also about maintenance. You have to update your libraries, you have to make sure you’re using the latest OpenSSL library, or any other tool you use. So maybe it goes hand in hand, working on reliability and working on maintenance. And in my mind, it comes down to this: 20% for every week or month or quarter, just work on foundational things before they hit you in the back. Because when you don’t do that, they will come after you at some point.
Tom: Well, hopefully no one’s gonna come after you…
Mat: Consequences will.
Tom: You can also look at the amount of time you’re having to spend on toil. If it’s updating dependencies, or it’s manually adjusting the scaling on your service — the kind of manual labor that could be automated.. And literally, go and measure the tag tickets, or PRs into your repo as toil or not toil, and present the ratio of those.
Alex: One thing which is cool in managed Kubernetes, they force you to update. When you run vanilla Kubernetes, you just install it and you’re good to go, and it runs basically forever. But with managed Kubernetes, they force you — for example for AWS, every three months — to just update. And when you don’t do that, at a point in time they update it for you, and then stuff breaks. So that basically drives you to do it.
Tom: Yeah. And there’s other cool things you can do in those situations. We put a limit on the age of the binaries that are allowed into production. So you can’t run a binary that’s like three months old. You’re just not allowed to. The system stops you. And this forces you to have a recent build.
And given that you know you’re gonna have to do this very frequently, why not have a CI system building it for you, why not have a CD system doing the automatic deployment for you, and why not have to worry about all of this? And use something like Dependabot, or Renovate Bot to make sure your dependency is always up to date.
I would say these kinds of things should be relatively easy to justify with data. It’s just a case of gathering that data can sometimes be quite painful… But yeah, knowing the right data to gather, like number of issues that are tagged with toil, number of incidents, SLO performance, or error budget, these kinds of things.
Mat: Yeah, I like that approach where you put the tooling in place, and you sort of put these gates in place. And the earlier you do that, the better. You make it completely everyday normal stuff. And then it becomes easy. It’s very difficult if you try to do all this stuff manually.
Tom: Especially with dependency updates, right? Dependabot has really changed the game here, because if I was only updating my dependencies when there was a major CVE, I wouldn’t be doing it that frequently, which means I’d have to learn how to do it each time… And if I’m on some old major version, and I’m having to move to a new major version, that might actually involve rewriting some of my code. That might actually be painful. Whereas if I’ve taken that pain and I’ve spread it out over lots of small increments, every day or every week for the past six months, and got the team used to doing it, got everyone comfortable with the idea that we’re just always going to be on the latest version, then hopefully we’ve made it more predictable, we’ve made it less stressful, we’ve kind of taken the edge off of it.
Mat: I like that also, for just releasing main. I love projects where you can just release main. You can only really do it if you’ve got amazing test coverage, basically; that’s the trick to do it. But if you do have that, and if you trust your tests that much, the freedom you have is just great. And of course, it’s a trade-off. There’ll be times when you wish – you know, “This is a bad idea. This feels like a bad idea.” But yeah, I agree. Spread the pain out, I love that point. Spread the pain out. It’s necessary anyway.
Tom: That’s my motto for life.
Migrating to Grafana OnCall, and the importance of simplicity
Note: Read Alex’s blog post for more details about Prezi’s switch from PagerDuty to Grafana IRM, which includes Grafana Incident and Grafana OnCall.
Alex: We started using Grafana Cloud for ingesting logs in 2022, and in that process we discovered that we could have the same features we used in PagerDuty in Grafana OnCall, without additional costs, which is a huge saving for us.
And we’re quite happy with that, because it just works. And I think our engineers love it, because it gives them the same features. And as I mentioned earlier, we gave them the responsibility for planning their on-call shifts. Before they had to do it in Terraform and get approval from an SRE, and then they had to run Terraform. And that’s hard to keep the transparency in Terraform, or what the shifts are looking like.
Tom: Very cool. And how’s the experience with the tool been so far? Have you enjoyed it?
Alex: We really like it. Also, it gives us one UI to go into and have a look at our logs. And then, if something is wrong, have a look at our alerts and get the escalation there. So it’s basically one tool for more features. We even increased our usage with the Grafana Incident recently.
Tom: So I think one of the things we’re always keen to talk about, one of the things we’re always keen to understand with users is where do you want to go with this tool?
Alex: The more time I see Grafana OnCall and Grafana IRM, the more stressful it is. And that’s not because the product is bad, because you only look there to just be paged when you have been paged. What my wish is just that the tool stays simple, in terms of just usable, and not be over complex, and be friendly for our engineers that are not quite often doing on-call work.
I spoke about the different teams having responsibility for their tools, and there are actually teams that are quite often Grafana OnCall users because their services are producing errors and are quite important. But there are other teams there that are not in there quite as often. So if the product stays usable for them as well, that is a big, big plus.
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.