Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

We cannot remember your choice unless you click the consent notice at the bottom.

The engineering on-call experience: misconceptions, lessons learned, and how to prepare

The engineering on-call experience: misconceptions, lessons learned, and how to prepare

2024-03-14 9 min

The on-call experience is sometimes a dreaded one for software engineers. Those late-night alerts and frantic Slack messages, after all, don’t exactly sound pleasant. But what’s an on-call shift really like? Is that perception of constant fire-fighting and 3 AM wake-up calls actually realistic? 

Michael Mandrus and Owen Smallwood, both senior software engineers here at Grafana Labs, wanted to set the record straight.

Michael and Owen are part of a team that works primarily on Grafana Cloud, building new features and supporting the underlying infrastructure. They both joined the company in 2022, and started their first on-call shifts in 2023. Despite months of training and shadowing experienced on-call engineers, they had some reservations.

“As an outsider looking in and seeing people on call, they seemed to know everything,” Michael said. “They know all the logs and metrics you can look at for the systems, they know all the points of integration that could be causing a problem when an alert comes up — I just thought to myself, ‘I could never do this.’”

Fast forward to today, and Michael and Owen have successfully completed dozens of on-call shifts. They’re based in North America, but share on-call rotations with Grafana Labs engineers in South America and Europe. In general, they have a shift about every four weeks, during which they’re available for 12 hours a day for five consecutive days. During those shifts, they step away from their usual day-to-day work to focus exclusively on ensuring the best possible user experience in Grafana Cloud.     

Over time, Michael and Owen learned a lot about the on-call experience, including the fact that it’s not as dire as it’s sometimes portrayed. Here, they wanted to share their biggest misconceptions about being on call, some tips and tricks they’ve learned along the way, and why the experience can be rewarding — and, dare they say, fun.

A headshot of Michael Mandrus and Owen Smallwood.

Note: The following excerpts have been edited for clarity and length.

What were some of the biggest challenges you faced when starting your on-call shifts? 

Michael: We both started out as Grafana developers, and we’d never been on an on-call rotation. I think, traditionally, a lot of developers build a feature, it gets deployed somewhere, and then it’s someone else’s job to maintain it. That was the mindset I was used to working with.

Then, when I went on call, everything changed. First of all, it’s a whole new category of work. You have to make quick decisions and understand enough about a lot of things so that you can investigate issues, or at least find the right people to investigate issues.

You have to be on your shift for 12 hours a day, five days in a row. And during those 12 hours, you’re always in your home, or within a close enough distance to your home, so you can help, when needed. There’s also just a lot of tooling that we both had to learn, and we had to learn enough about Grafana Cloud infrastructure to effectively investigate issues. 

What’s surprised you the most about the on-call experience? 

Michael: When you’re on call, your job isn’t just to fix things. If you can fix it and you have time to fix it, then, good. But it’s really about organization and communication and doing what you can. As soon as you recognize that you’re at your limit, or something more pressing comes up, you need to find someone else to hand off to. There’s a lot of juggling. You’ll get alerts during an incident and you’ll get pinged for an escalation, and then people internally will tag you to help with things. It’s a big balancing act. But the expectation isn’t to know everything. It’s to keep the ship moving. 

The expectation isn’t to know everything. It’s to keep the ship moving.

Michael Mandrus, senior software engineer at Grafana Labs

Another big surprise is how collaborative it is. The alerting squad is also on call, then there’s the customer support team — all these people are on call the same week. So you have this little on-call community and build a rapport with the same people. I expected to be on an island, but there are a lot of other people involved and that communal aspect is cool.

Lastly, I was surprised that the level of disruption really depends on the week. You’ll have a week where almost nothing happens, and then you’ll have a week where you’re really busy. Fortunately, Grafana Labs encourages flex time, so if you do have a really busy week, then you can claim some of your time back when you’re not on call.

What advice would you give an engineer about to embark on their first on-call shift?

Owen: You have to change the way you work. When I’m on call, if I’m not getting any pages, I make an effort to step away from my computer to, for example, go for a walk around the block with my dogs. I take my phone with me, in case I get paged, but it’s so important to take those short breaks. 

There are a bunch of other little things to make your life easier. For example, originally, when I got a page on my phone, it would make the standard sound of a text message. But it got to the point where I would get a text message from a family member or friend, and immediately think it was related to work. Grafana OnCall, our on-call management tool that we also use internally, has a mobile app and a unique ringtone for alerts; it can even push alerts through when your phone is on silent. As small as it sounds, that’s made a big difference for me, in terms of quality of life.

Another big part of the on-call experience for me was just handling things under pressure. At times, especially in the beginning, there would be an incident and I’d find myself feeling a bit flustered. You have to put yourself in a different mindset. Yes, it’s important to react quickly, but no one’s getting hurt. It’s also so important to be comfortable throwing up your hands and saying, “I have no idea what this is.” Ask for help instead of just grinding away and trying to figure it out by yourself.

Michael: Yeah, it’s so important to collaborate with your fellow engineers. If you are finding the on-call experience to be difficult, you probably aren’t alone. We collaborate on ways to improve our shifts — for example, we began periodically reviewing and refining alerts to reduce noise, and we are now toying around with different on-call schedules that would reduce the number of days in a shift.

Tell me more about the tools you use to manage and streamline the on-call experience.

Michael: As Owen said, we use Grafana OnCall, and we dog food it pretty heavily internally. We use OnCall for schedules, overrides, shift swaps, and paging. You can set it up and configure it specifically for you and your team. We also use Grafana Incident for all of our incident response, which minimizes overhead and lets us focus on finding a resolution. 

The OnCall and Incident apps are tightly integrated with Slack, so you can declare any incident from OnCall or Slack and they’ll sync with each other. So we have some tooling that makes the communication and handoffs better. 

Runbooks are also really helpful. Each alert we get refers us to a step-by-step guide on debugging and resolving the alert. If your runbooks are clear and up-to-date, you can follow them easily. 

Owen: We also did an internal Hackathon project that focused on documenting commonly used commands for when you’re on call, because there’s such a wide variety of commands you use to resolve issues. We wanted a single repository for those commands, with descriptions and different parameters. We use that a lot.

Is there anything about the on-call culture or experience at Grafana Labs, specifically, that you’ve found valuable?

Michael: We do a shadowing program, which is really helpful. There’s a primary on-call engineer and then there’s a secondary one. The secondary is like having an experienced back-up; they get pinged for everything and they’re there if you need them, but they’re not the primary stakeholder. Shadowing different engineers while I was training exposed me to multiple approaches to being on-call, which has helped me develop a style that I enjoy. 

Owen: Our team dynamic is another thing. In the early days of Grafana Cloud, we would develop features and throw them over a fence to another team to make sure they stay running. But we’re shifting to more of a paradigm where teams own the stuff that they build, entirely. So they build it, they run it, they maintain it, and it’s their job to make sure it’s running. This means most engineering teams will have an on-call rotation eventually, if they don’t already.

We are also extremely lucky because the other half of our team is in Europe. That makes a big difference, in terms of our shifts — otherwise, we’d likely be on call 24x7 for a few days, and I imagine that would introduce way more complications.

How has being on call helped you grow as an engineer?

Michael: Every time you’re on call, you learn something new. One week, for example, you might learn how to fix rolling release channels, and the next week you learn how to scale a database. We also took a Kubernetes training course before starting on-call shifts, which has helped me with other aspects of my job, including teaching users how to deploy our software on premises. It’s almost like you’re learning more about the next level up — you already know a lot about the apps, but you’re learning more about the environment the apps run in. That makes you a better developer. 

I also want to stress that, overall, I find it fun, even with the occasional long days. Yes, it can be tiring, but it’s also really interesting and engaging. 

To learn more about our engineering and work culture here at Grafana Labs, you can check out these recent blog posts. And for more information on some of the on-call tools mentioned above, you can read our technical documentation for Grafana OnCall and Grafana Incident, which are both part of the Grafana Incident Response & Management (IRM) solution.