How Stack Overflow uses Grafana to optimize its systems
Founded in 2008, Stack Overflow is the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. More than 50 million professional and aspiring programmers visit Stack Overflow each month to help solve coding problems, develop new skills, and find job opportunities. Grafana Labs spoke with Kyle Brandt, Director, Site Reliability Engineering at Stack Overflow, to learn about the various data sources and monitoring systems they rely on, metric visualization problems they were having, and how not all dashboards are created equal.
The problem
Stack Overflow’s custom monitoring tool Opserver provided purpose-built dashboards, but lacked a way to easily create custom self-service dashboards. They needed a tool to allow their developers and SRE teams the freedom to quickly create custom-tailored dashboards to visualize data from OpenTSDB, Elasticsearch, and their alerting tools, all in the same experience. They wanted to empower the teams that knew the data the best.
Milliseconds matter
Stack Overflow’s Ad-server team was the first to discover Grafana. They were searching for a tool to create custom server latency dashboards. Displaying ads on a website is latency sensitive; a millisecond delay can have a huge impact on revenue. Server latency also affects which ads are displayed to whom on the site. The quicker the ad is served, the more targeted it can be for the user. Grafana’s real-time dashboards were critical in discovering where the Ad-Server team could optimize to have the best server performance possible. Grafana quickly spread from the Ad-Server team to other teams at Stack Overflow, since it can visualize data from many different data sources, both open source and commercial. For Stack Overflow, this meant OpenTSDB data could be visualized alongside Elasticsearch data, which could be viewed alongside their custom alerting data from Bosun.
The Bosun alerting system and a new Grafana plugin
Bosun is an open source alerting system Stack Overflow created. It has an expressive domain-specific language for evaluating alerts and creating detailed notifications. It also tests alerts against history for a faster development experience. Bosun is robust, but comes with a complex user interface and an often steep learning curve. Its visualization options are also limited. In Kyle’s recent GrafanaCon talk, “The Culture and Realities of Monitoring at Stack Overflow,” he described monitoring as “a medium for humans to communicate with other humans through machines.” The Bosun project reflects a deep understanding of the impact of alerting on culture, which Kyle is extremely sensitive to. An intuitive UI and consistent user experience are key to making complex systems easier to understand – something Grafana has always prioritized. So the team decided to build a plugin to bring the power of Bosun into Grafana’s user-friendly interface.
Grafana’s plugin architecture allowed the Stack Overflow team to create a data source plugin for Bosun to visualize Bosun alerting data directly in Grafana. Since Grafana is more user-friendly, people tend to pick it up more naturally. We have turned Grafana users into Bosun consumers, and soon hope to turn Bosun consumers into Bosun authors.
Kyle Brandt, Director, Site Reliability, Stack Overflow
The plugin allows teams to use the Bosun expression language inside Grafana to achieve visualizations not previously possible. Grafana can also display annotations created from Bosun, adding valuable context to various metrics behavior. Showing relevant alerts directly on Grafana dashboards also provides the benefit of having to look in fewer places for information. This consolidation provides actionable insight at a critical moment – right when they’re viewing the data. The Bosun team didn’t keep this new plugin for themselves; it is freely available for download at Grafana Labs’ Plugin Repository.
Stack Overflow is focused on the audience and placing yourself in their shoes. Think of what they know and what they might not know. Don’t tell them what they need to know; show them what they need to know on a Grafana dashboard.
Kyle Brandt, Director, Site Reliability, Stack Overflow
Conclusion
Grafana allows teams across Stack Overflow to quickly and easily build custom self-service dashboards for what’s important to them, no matter where the data lives or which database it’s stored in. Because Grafana is open source and has a robust plugin architecture, the Bosun team was able to create a plugin to leverage its powerful alerting system, and can now visualize the data in new ways. The new plugin empowers users new to Bosun to write queries and set alerts directly from Grafana’s UI, as well as the flexibility to leverage Bosun’s native expression language. With the popularity of the Bosun plugin internally, the team shared the plugin with the entire Grafana community, and it has been installed thousands of times by users of both projects.