How Mux cut metrics volume by 60%, increased retention times, and improved developer productivity with Grafana Cloud
Every time the platform engineering team at San Francisco-based startup Mux deploys new software, there are two must-have components: proper access controls and observability. But until recently, their observability stack left the team frustrated, reactive, and largely in maintenance mode.
The company operates an API-first video platform designed to give development teams world-class video streaming and analytics capabilities, which means they need to scale dynamically to accommodate the unpredictable usage demands for very compute-, network- and storage-intensive workloads. Couple that with the wide variety of user-generated content on the platform and the unique troubleshooting that comes with it, and it’s easy to understand why Mux needs fine-grain visibility into their workloads and a system that properly supports it.
“All the clusters need observability; all the applications need observability. Whenever we deploy new infrastructure, the first thing we put in there is access and observability — and that creates the sprawl that multiplied as we continued to grow,” says Brian Lieberman, Senior Software Engineer at Mux.
So after struggling for years to maintain an in-house OSS stack (Elasticsearch, Kibana, Prometheus, Jaeger, Grafana), they knew it was time for a change. They went with Grafana Cloud, in part because of their existing familiarity with Grafana OSS, but the payoff has been a lot more than just an easy onramp.
They’ve cut their metrics volume by 60% while also greatly expanding their data retention time, going from seven days to 30 days for traces and from 14 days to 13 months for metrics. This has helped Mux reduce noise, improve long-term analysis, and take to a more proactive approach to incident management — all while keeping their costs level and their engineers engaged with more high-impact work.
“Grafana Cloud probably saves us hundreds of engineering hours a year. Our platform engineers don’t have to manage the stack any more, and our product engineers don’t have to work through multiple observability tools, which used to really slow down our response times,” said Ryan Grothouse, VP, Engineering at Mux.
We recently sat down with members of the Mux platform engineering team to learn more about how they’re using Grafana Cloud for metrics, logs, and traces – and how they plan to extend their stack in Grafana Cloud with Grafana Cloud K6, Grafana IRM, Grafana SLOs, and Grafana Cloud Profiles.
Can you tell us about the problems you were running into with your previous observability setup?
Ryan Grothouse, VP, Engineering: Our product engineers were jumping across multiple instances to correlate and pinpoint observability across them. It was not only frustrating, but it often slowed mean time to resolution and probably led to us missing a lot of things because of context switching.
Ron Lipke, Senior Engineering Manager: Doing anything across products was nearly impossible because that data was so fragmented and we didn’t have anything that was federated for our developers to go and get that insight. We were doing stuff like exporting to CSVs and then ingesting them back into Kibana — hacky stuff that was not fun.
What were some of the immediate benefits that stood out when you switched to Grafana Cloud?
Ron: UX improvement for our developers. Some of the immediate feedback was just like, “Wow, I don’t have to go across all of these individual URLs and logins. I can just flip a cluster label and I can see all these things!” Also, the retention time was mind blowing for everybody because it really allowed us to do more long-term trend analysis and it even helped our incident management postmortems.
Kyle Weaver, Staff Software Engineer: Maintenance as well. It would be a lot more toil on our team to ensure the correct scale and configuration on the compute side.
And you’ve relied a lot on Adaptive Metrics, our cardinality optimization tool, right?
Kyle: Adaptive Metrics is an amazing feature. It not only saves us hundreds of thousands of dollars a year but it’s also a forcing function for us to look closely at our metrics to find additional opportunities for time series reduction and cardinality improvements.
Ron: We used to spend a lot of time reacting to alerts, and less time being proactive and thinking about how we use our metrics. Now it’s fun to see developers being mindful when they make changes or add a new service and how that will impact metrics for our billing purposes. It’s really changed our mentality.
At Mux, your team builds most things in-house. How do you weigh decisions about when to invest in a hosted solution instead?
Ryan: We want to maintain flexibility within our stack so we’re not locked in to any given vendor and we can evolve and adapt with the needs of the business. But there are certain areas where that isn’t valuable, and observability is one of those. If we can free people from managing those types of problems, then they can dedicate more time to driving business and product value.
Grafana Cloud was a really natural pivot for us since we were already orchestrated with the open source version, so it was actually a pretty easy transition. We didn’t have to do a lot of refactoring. So the decision to go with a hosted solution is really about the cost of doing it ourselves and the pain that that imposes on us.
Grafana SLO is another tool in Grafana Cloud that you’re looking to adopt. Can you tell us more about that?
Ron: That’s a consolidation effort as well. We use other vendors and moving SLOs in-house to Grafana Cloud just makes sense for us — one, for the cost, and two, it’s just easier to use and manage since we’re already on the platform.
Ryan: With Grafana SLO, we can leverage the same sort of metrics that we’re leveraging elsewhere and just layer SLOs on top of that. We don’t have to wire that up in any other way; the tools just naturally perform better together. And there’s a common user experience, which makes it a lot easier on everyone involved.
Now that you’ve been working in Grafana Cloud for a while, what other benefits have you seen?
Ryan: We never really quantified MTTR before and after, but the general sentiment is that it has absolutely sped up. Still, I can share one concrete example. With our old retention times, there were many times when we would get a support case and we’d have to tell the customer that we no longer had that data to troubleshoot the issue. That wasn’t an acceptable answer for our customers, and thankfully that’s no longer a problem for us.
Brian: It’s allowed us to move away from treading water with our stack. It used to be about keeping it running. And usually we were on old versions of everything; nobody wanted to upgrade because the further you fall behind, the harder it gets and the more time it takes. By allowing developers to move faster, identify problems really quickly, and even proactively do it themselves — that shift allows us all to get to a much better place, and it just makes life better.