How Verizon achieved automation and self-service with Grafana
Can you monitor us now?
That was the question Verizon started asking as the Fortune 500 company expanded its portfolio beyond communications services to include brands such as Yahoo! and Huffpost.
“We’re not just grandma’s landline,” Sean Thomas, Verizon Systems Engineering Manager, told the audience at GrafanaCon in L.A. “We’re not just your mobile provider. We are a media company. We have 5G solutions. We’re building technology. We’re building the future.”
By the end of 2018, Verizon employed 144,500 people to do just that. “In terms of scale, Superbowl LIII had 70,000 people at it. That means that we filled up two stadiums at Superbowl LIII and still had a few thousand people partying in the parking lot tailgating,” said Thomas.
But the varsity team for monitoring was the Verizon Systems Engineering team, which oversees cloud engineering, analytics, ITSM automation, and tools.
Thomas, who helps lead the full stack development division, said as the company grew the team strived to get a full picture of its internal systems so they could “take things into the future from a large-scale enterprise perspective.”
At the time, there were 40 servers running analytics for all of Verizon’s systems such as change management, availability management, change tracking, and event management. Those servers ran in an SSRS environment with SQL on Windows so the licensing costs alone were not ideal.
“It wasn’t efficient. It wasn’t scalable, it wasn’t modern, and it was just a pain to get anything done,” Thomas said during the GrafanaCon session. As Verizon restructured internally, “one of the hardest parts that we ran into was if the business said, ‘Hey, we’re going to make this change. This department is now called this.’ When that happened, you had to do a whole development effort just to change the name on all these reports. It was crazy.”
Grafana to the rescue
The goal for the System’s Engineering team was to bring all the different data sources into a single, easily accessible view for end users and the executive team.
After looking at the infrastructure in place, Derek Meyer, a Verizon Engineer on the Automation Tools team, started looking into open source options. “I’ve always been a person who enjoyed open source software,” he said, “and try to contribute where I can.”
Meyer started playing around with Grafana. “I tossed up my little play website and played with my own data,” he explained. After some initial experimentation with other engineers, they decided to pursue Grafana as a company directive.
While putting together a new monitoring model, the team had set up the infrastructure to run several MySQL databases with replication to replace the SQL servers that incurred licensing costs. They also had Linux boxes set up for some time as well.
“We compared our old model to our new model and said, ‘Gee, this is a no brainer. Why don’t we continue down this path using Grafana?’” said Meyer.
There were a few hurdles along this path, however.
First, the team had to figure out how to handle its legacy infrastructure on premises, said Meyer. “Every time you build up on-prem stuff it’s, ‘Here’s a request; build me a server.’ Or ‘Here’s another request to get the OS on it.'"
“That’s six months to do … if you’re lucky,” added Thomas.
To improve the ease of scalability, the team came up with a hybrid solution that leverages containers. “A lot of our data remains on prem just because of the sensitivity of it,” explained Meyer.
“Security has always been the biggest issue,” said Thomas. “That’s the main reason that we’re looking at a hybrid over a full cloud approach … There is quite a bit of sensitive data that the security and governance teams are uncomfortable with having out there.”
But, Meyer said, “we can stick the front-end in a hybrid situation cloud and help reduce the time as well as increase our redundancy.”
When they shifted their attention to the old SSRS servers, engineers discovered there was more than 500,000 lines of static code for stored procedures such as change management and instances.
“The code had been around for a very long time and to make a change to it you were really hoping that what you were doing wasn’t going to break something else,” said Meyer.
Instead the Verizon team broke down the existing code and drastically decreased that number to 500 lines of dynamic code in only five stored procedures thanks to the functions within Grafana.
“Those 500,000 lines were in 200+ different stored procedures. Lots of them were multi-thousands of lines, where everything was the same but one variable. When you want to go try and change it, it was hard,” explained Meyer, “we do all of our change metrics, our instance response, and ticket tracking off of five stored procedures now by leveraging Grafana and MySQL.”
Did Grafana really make a difference?
With all these large shifts in infrastructure, “the next big question is, ‘Was the change worth it?’” said Thomas.
The numbers speak for themselves: Before Grafana was implemented, Verizon operated with 100,000% more lines of stored procedure code and 4,000% more stored procedures.
“I triple-checked these percentages,” said Thomas. “That’s actually correct.”
But here are three major improvements that Thomas and Meyer outlined to drive their point home:
1. Better use of time
One of the most positive outcomes of Grafana has been how much time is saved in managing and monitoring metrics at the company.
When a line of business changes names or a new VP joins the executive team or there’s a management reorg, “everything dynamically updates from the source data,” says Thomas. “The dashboards that previously showed the information for one person shows it for the new person. I can automatically get everything I need.”
In the past, any org changes would involve multiple developers who would need to dedicate at least 30 days to complete the development effort.
“Every single one of those lines in the store procedure had to be updated – and everybody knows what happens when that goes on,” said Thomas. “You miss one line and that, of course, is the line that one VP looks at. Another VP is looking at a completely different dashboard. The two numbers don’t gel, and your CIO gets two different stories from two different VPs. Then guess who gets the phone call at 2 AM?”
With the new system, “taking that [process] down to automated tasks and just updating 500 lines of code, that’s two [free] FTE right there,” said Thomas. “Those developers are not focusing on dashboards. Now they can focus on actual deliverables and everything that you actually have to get done through the year.”
2. Empowered end user
Prior to Grafana, reports were manually created for every request. “We had thousands of reports, a lot similar to each other,” said Meyer. “Over time they were going stale. You don’t always know if they’re all working without checking thousands of dashboards. The automation behind it was extremely difficult to do.”
Also, because there are various ways to view the same data, separate SSRS reports were required for each development effort.
“Now it’s a filter at the top of the page,” said Thomas. “Executives don’t have to fill out [requirements]. They get the data as they need it. It makes their ops reviews quicker to put together. It’s all at their fingertips.”
With this self-service model for metrics “you empower the end user,” said Meyer. “Anybody from a call center rep to a CIO can turn around and leverage that information and see it in a way that they want.”
Plus with some of the log-in abilities in Grafana, “if you tie it in to your LDAP ability, you can set it so that certain reports are only available to certain people,” said Meyer.
“It’s got a lot of flexibility,” Meyer added, “and just makes life so much easier.”
3. Fewer fire alarms
Thankfully there have also been fewer unwanted data charges for the engineering teams.
“One of the big things that we first noticed immediately was fewer fire alarms,” said Thomas. “When I say fire alarms, I mean late-night text messages, late-night phone calls with ‘This data’s wrong; this data’s inconsistent.’"
Thomas has also noticed his inbox is getting much less traffic. “There’s a significant reduction in emails,” he said. “If you have one dashboard wrong in a company the size of Verizon, you don’t hear about it from one person. You get 17 different emails, all from different executive directors or different management teams.”
All of these factors add up to a better quality of life for developers at the company. “How many of us in here pulled 24 hour days doing something? Or had to get up at 2 AM to try to fix something? Or you left work and you turned around and said, ‘Oh crap, I forgot I need to do this by the morning,’” said Meyer.
“I know over the last 10 years my stress level and my blood pressure have gone up,” said Meyer. “Now with the infrastructure that’s in place, I don’t necessarily have to worry as much about it.”
Focusing on the future
With the days of the “drop everything” fire alarm in the past, engineering teams can now look towards the future.
In recruiting more teams to use Grafana, “we first showed off the capabilities by decommissioning 40 licensed systems, moving it onto this singular platform,” said Thomas. “The next piece is now we’re marketing. We’ve got cloud engineering teams, our network teams, our storage teams coming on board and seeing the power that’s available within Grafana.”
As Thomas’s team shifts away from development efforts involving dashboards and metrics, “we can actually get real work done.”
That involves contributing to making Grafana even better. “This truly is that single pane of glass solution. There’s more data sources being added on a regular basis. It’s an open source solution for those data sources,” said Thomas.
And if Grafana doesn’t offer the solution Verizon needs at the moment, “maybe it doesn’t exist today, but it could exist tomorrow,” said Thomas. “We have plenty of talent within the company that could certainly contribute and create those data sources if they’re needed."
For more from GrafanaCon 2019, check out all the talks on YouTube.