How we’re building a production readiness review process at Grafana Labs
Production readiness review (PRR) is a process that originated at Google, described as the first step of site reliability engineering engagement in the company’s famous SRE book. The idea of thoroughly reviewing a product before handing over the pager is a really good one, but except for Google-scale companies, there aren’t that many organizations that can afford dedicated SRE teams.
At Grafana Labs, our product teams often take the role of the SRE teams. Moreover, due to a lot of technical similarities, we have an on-call rotation that’s responsible for multiple products — staffed with engineers from different product teams. Since there’s no official SRE onboarding, we’ve built production readiness review as a completely separate process that strives to add value by having the product in question reviewed by an experienced engineer, ideally outside of the product team. The output is a list of identified issues, which in the long run should reduce toil and risks that the product might face.
I presented a talk on this topic, “Production Readiness Review: Providing a solid base for SLOs,” at SLOConf this year. In this blog post, I’ll walk through our PRR process and some best practices that we’ve developed along the way.
The checklist
A well-written checklist is a crucial ingredient of the PRR. The list is not trying to meticulously cover all the ground, but rather apply common sense and past experience to pin-point the crucial topics to talk about. The goal is to plug gaps that would pose significant risks, not to design a bullet-proof checklist.
One of the main factors to consider when designing a PRR checklist is to not get too far ahead of the current status quo. The current best practices and state of the products should always be taken into account when producing a checklist that gives specific, actionable feedback with clear added value. These two elements are crucial here — going too much into the best practices and recommendations territory (e.g., discussing code design patterns used) might become a slippery slope, slowly turning the PRR into an architecture review with few practical suggestions.
Once made, the checklist itself is far from being set in stone. As the company and tooling evolve, the checklist will need updating. Not to mention the fact that a first draft of the PRR checklist might be far from perfect.
At Grafana Labs, we’ve already made some updates to our PRR checklist. The version 0 was a brainstormed raw list of topics and questions that we considered important enough to be addressed. This, however, would be pretty hard to fill in by product stakeholders. Thus, version 1 had its form improved to include examples and clarifications. This was used for a canary round of PRR, which also resulted in considerable feedback and incremental PRR checklist improvements.
You can find a snapshot of our PRR checklist here.
The review
The review proceeds in multiple steps:
- A member of a product team chosen to lead the process reaches out to the PRR team, asking for a review. The PRR team chooses a reviewer from the reviewer pool who will become the primary point of contact for this review.
- The product stakeholder fills in the PRR checklist. If there’s need for any guidance (be it clarifications for the PRR checklist or process questions in general), the reviewer is there to help.
- Regular 1:1 sessions between the product stakeholder and the reviewer get scheduled. We’ve experimented with both weekly and bi-weekly 30-minute meetings. It often took 10 or more sessions to go over the checklist thoroughly, which resulted in this process taking 2-3 months to complete. From our experience, timeboxing to 30 minute meetings was good for keeping a sharp focus on a smaller scope of questions. More than fortnightly-spaced meetings usually mean loss of context. The total duration of the PRR shouldn’t be too long — as the product evolves, the checklist answers might start changing rapidly. This is especially true for pre-launch PRR.
- During the meetings, we sometimes found it useful to have the reviewer assume the role of an attacker who tries to break the system in review. Any issues we identify are filed as product bugs.
- Once the whole checklist is reviewed, the focus changes from finding to fixing issues. There’s no strict need for a regular 1:1, but communication between the reviewer and product stakeholder (either lower-frequency meetings or async) is crucial to make sure progress continues to be made on closing identified issues.
- Once the issues have been fixed, the product has passed the PRR.
Looking ahead
The PRR doesn’t necessarily end there, though. We’ve already identified some areas to look at in the future:
- We’ve only started doing PRR, but both the checklist and the products are continuously evolving even while PRR is running. We’re looking into a periodic and/or incremental PRR as part of our continuous product improvements.
- Once we collect enough data, cross-correlating PRR documents to see which areas are more problematic might help us identify weaknesses in our engineering process.
- As of now, any updates to the PRR checklist happen organically (i.e., when an issue is discovered during the review process). Figuring out ways to be more systematic with processing the feedback and updating the PRR could make a huge improvement.
Conclusion
The PRR canary at Grafana Labs was a success — we’ve managed to identify some important issues (such as missing index files possibly taking down the service, and tooling to easily fix that) and certainly see more improvements coming this way.
Does your organization have a production readiness review process? We’d love to hear from the community as we continue to work on ours. And if this work sounds interesting to you, and you’d like to help shape our PRR, we’re hiring. :-)