Kubernetes, Kepler, and carbon footprints: the latest tools and strategies to optimize observability
If you’ve never heard Kubernetes pronounced in Greek, or listened to an extended metaphor about restaurant seating and resource utilization, you don’t want to miss this week’s episode of “Grafana’s Big Tent” podcast about all things optimization.
Grafana Labs CTO Tom Wilkie is joined by Thomas Dullien, former co-founder and CEO of Optimyze; Bryan Boreham, distinguished engineer at Grafana Labs and a Prometheus maintainer; and Niki Manoledaki, senior software engineer at Grafana Labs and a contributor to the CNCF Environmental Sustainability Technical Advisory Group (TAG ENV) for an in-depth discussion on reducing resource waste and tracking carbon metrics.
From fleet-wide profiling to cutting-edge tools like Kepler and Karpenter, this episode covers all the latest strategies for optimizing observability.
Read some of the show’s highlights below, but be sure to listen in to the full episode for more specifics on how Grafana approaches carbon metrics and why Bryan recently made Grafana Labs Engineering Director Mat Ryer call him “captain.”
Note: The following are highlights from episode 8, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
Optimization starts with removing the biggest “stupid”
Tom Wilkie: Bryan, how did you halve Prometheus’ CPU memory usage?
Bryan Boreham: Profiling is one of my favorite tools for looking at what is actually being used, either in CPU or in memory. For Prometheus, it showed that a tremendous amount of the memory was in the metadata. But it wasn’t just finding multiple copies of the same string and reducing them to one copy. The real gain came from reducing overheads in Go. It’s a grind — constantly questioning, “Why do we need this?” Eventually, small changes add up.
Niki Manoledaki: Which feedback loops did you use to evaluate the effectiveness of your changes?
Bryan: I have two or three, but at the end of the day, it’s going into production at Grafana Labs. That’s the ultimate feedback loop, where we’re running this stuff with trillions of data points coming in every day. But before you put it into production, you can put it into a development environment. And that’s my typical acid test.
Thomas Dullien: Optimization is an empirical science. You identify the largest inefficiency — what I call “the biggest stupid” — and fix it.
When I say “stupid,” I’m not blaming the developers. Their job is to make it run, and then afterward, make it run fast. So when you first look at a system, you find large chunks of things to optimize.
But it’s difficult to quantify the impact of the change without putting it into production because modern systems are such complex beasts, with so many secondary effects. It makes optimizing really gratifying, because it’s very measurable. You know, at Google, everybody that wanted to get promoted wanted to have a quantifiable impact. Nothing gives you quantifiable impact like fleet-wide profiling and then optimizing stuff.
Bryan: It’s really ideal to make one change, and measure it, and then one change, and measure it, and so on. And that’s how I end up with about 10 balls in the air at any one time!
Platform-level wins with Kubernetes
Tom: When we first started focusing on optimization at Grafana, we were wasting almost 40% of our resources. Niki, what did we do to improve that?
Niki: It’s been trial and error with Cluster Autoscaler, and then moving to Karpenter, and deploying tools like the Vertical Pod Autoscaler. Karpenter’s consolidation algorithm has helped us get to an average of 20% idleness, which is 80% allocation.
Tom: Let’s explain the difference between allocation and utilization.
Thomas: Let’s say you go to a restaurant and reserve a table for 10 people. That’s your allocation. And then the number of people actually seated at any point in time would be the utilization.
Niki: We’ve been able to improve utilization quite a bit at the node level using the Kubernetes descheduler, which evicts workloads based on specific configurations.
Thomas: It’s interesting that Kubernetes has added the descheduler. I said earlier that optimizing starts with looking for the biggest “stupid,” and this just illustrates that all the big “stupids” you find were not stupid when they were made. Because no software ever runs on the hardware that it was designed for, and no configuration parameter is ever properly tuned for the machines that it runs on. Things evolve over time, and a lot of the optimization work is really, to some extent, janitorial. You’re sweeping up the crud that has accumulated over the years.
Sustainability metrics: measuring impact
Niki: Thomas, how do you approach sustainability in computing?
Thomas: We have imprecise metrics for carbon, but that’s better than no metrics. We can calculate carbon impact from profiling data, but it’s very much a ballpark metric right now.
To estimate energy consumption, we track how many CPU core seconds each software uses daily to calculate an approximate power draw per core under full utilization. And while workloads may vary, this gives us a range for kilowatt-hour usage. Converting that to CO2 is where things get tricky, so what we do is take the average carbon intensity for a kilowatt hour in a specific region and multiply it out. Carbon intensity data is publicly available.
Of course the data we get is imprecise. But it will be a largely constant imprecision, so if I make a big optimization and it drives down my metrics significantly, I can see the impact.
I’m also hopeful that we can get cloud providers to share more accurate ways of measuring electricity consumption and set things up to give large customers a ballpark estimate for how much power they are drawing.
Niki: The new CNCF project Kepler uses eBPF and machine learning to give us energy metrics in cloud infrastructure in Kubernetes workloads, which is really exciting. We’re working closely with them in the CNCF sustainability TAG to include energy consumption metrics in benchmarking, and profiling too.
Thomas: That’s great. Gathering power draw characteristics for new microarchitectures as they come out is something that’s better done as a community than by any individual vendor, so I’m looking forward to seeing where Kepler goes.
Fleet-wide profiling: solving “murder mysteries”
Tom: When did we suddenly become able to gather CPU profiles from thousands of computers constantly, continuously?
Thomas: We launched the Optimyze product in August 2021, which made fleet-wide profiling available to everyone. But Google has had the Google-Wide Profiler (GWP) since 2010. Google only needed to measure CPU consumption in terms of C++, but at Optimyze, we realized that eBPF can be used to do stack unwinding for high-level languages, and we added those unwinders for other languages in eBPF: Java, Python, etc. It was very grimy work! But it was made available in August of 2021, and it got put into OpenTelemetry a few months ago.
Tom: And what can you do with these fleet-wide continuous profiles?
Thomas: Most importantly, you can see who’s eating your CPU, and where, and how and what time. You find a whole lot of murder mysteries! It’s like switching on the light in a dark basement.
You learn about the importance of allocators, because garbage collection allocation is a significant chunk in any fleet. You find how important certain open source libraries are, because in a large enough infrastructure, the biggest open source libraries will eclipse the biggest process.
I loved the murder mystery aspect, because you find something weird in any big infrastructure, and it’s an easy win. We actually had a product design rule — 10 minutes to dopamine. Within 10 minutes of rolling out fleet-wide profiling, an engineer can find something to optimize.
Niki: I think the thesis from this conversation is that while we can optimize at the cluster level, and node level, and resource request level, at the end of the day it’s really the workload level itself that should not be forgotten when we’re talking about cloud environments, because of the large-scale of cost reduction and resource optimization that can happen.
Thomas: I’m a huge fan of continuously profiling everything, because the overhead is quite manageable, and the insights you’re getting are absolutely fascinating. Something goes weird on a Sunday morning, CPU utilization spikes, and you can go back and see what’s happening there. That’s pretty great.
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.