How to improve uptime with real-time monitoring, Grafana dashboards, and Grafana Loki: Inside Dish Network's observability stack
Dish Network is on a mission to connect people and things by changing the way the world communicates. With products ranging from Dish and Sling TV to retail wireless services and 5G networks, monitoring their satellite communications equipment is mission critical to maintaining extreme uptime for Dish’s 20 million customers across the United States.
In their GrafanaCONline 2022 talk titled “Grafana and Grafana Loki in space: Monitoring Earth Station Operations for Dish Network” (available to watch on demand now), Systems Administrator Engineer Ted Raymond shared how his team uses Grafana to improve their already outstanding uptime with real-time dashboards and insights on everything from uplink equipment performance to weather conditions.
Increasing uptime with automated monitoring and alerts
Raymond’s Earth Station Operations team maintains the ground segment of Dish Network’s satellite communications: antennas, antenna controls, transmitters, and up and down converters. They also work closely with team members across the country to ensure that regional and national television content remains on the air.
Although the team was already delivering 99% uptime as part of their efforts, Raymond wanted to close that 1% gap and get a better real-time view of the health and performance of their equipment.
Prior to implementing Grafana dashboards, data collection and post-event analysis was inefficient and manual. “It took a long time to get data collected and presented in a way that made sense to the average user within the organization,” said Raymond. Additionally, only specialty personnel could access this data and share it, so collaboration between teams often took days or weeks.
Raymond’s team quickly set up Grafana and configured Grafana Alerting to fire whenever equipment performance became an issue – for example, if an antenna or signal level wasn’t meeting requirements. With these alerts, the team could fix problems before they became a major disruption in uptime. “We set [Grafana] up in one of our control rooms and had it talking to one database. That day, we were already querying the database and building dashboards and panels and then graphs,” said Raymond.
Now they have Grafana feeding them real-time data on everything from satellite performance to antenna motion and transmitter status. Using thresholds in each dashboard and Grafana alerts, the team is able to catch HDD consumption issues and antenna operation problems much more efficiently.
They have also expanded their observability stack to include Grafana Loki for reading log files such as debug files from software and VAR log messages for server health. The team also uses Loki to read log files for servers collecting SNMP traps and Grafana to extract that data and build charts and alerts around those log files.
“Instead of wondering when something happened, now we can use Grafana to send out an alert that says at this point in time this device sent out an SNMP trap,” said Raymond. “It takes the guesswork out of when the fault occurred.”
Even better, anyone can build dashboards and interpret the data: “Users with no real programming background can walk into Grafana and customize dashboards with ease,” said Raymond.
Modernizing monitoring and building mission-specific Grafana dashboards
The team expanded their use case to monitor internal server health and performance, external weather conditions at uplink sites, and even a specific mission where teams were working together to de-orbit a spacecraft.
“Pre-Grafana, working on a mission like that would have required phone calls, emails, and more to coordinate data collection. With Grafana, we were able to tie in all the necessary servers and antennas and build a dashboard within 15 minutes to monitor the movement of the spacecraft,” said Raymond.
“In the old days, we were trending data from one database at a time and running reports in a spreadsheet. With Grafana, we can query all 60 monitoring and control systems at the same time and have that data up on a dashboard live,” said Raymond, who also points out that Grafana not only united Dish Network’s data. “Working on this goal of modernizing and creating efficiencies within the team has brought us closer together.”
Watch the full session to learn more about Dish Network’s system architecture and see how they’ve grown Grafana operations throughout the company. All our sessions from GrafanaCONline 2022 are now available on demand.