Monitoring Kubernetes: Why traditional techniques aren't enough
Engineers and observability experts could spend hours (or at least one podcast episode) talking about the benefits of Kubernetes, and how it allows you to deploy at scale and manage a large number of applications and services. But keeping an eye on your clusters — which includes being able to track performance metrics and retrieve logs that can help you debug application crashes — isn’t something that happens without a little extra effort.
And that’s the topic of the latest episode of “Grafana’s Big Tent”: Why do we need to monitor Kubernetes? Leading the discussion are co-hosts Mat Ryer, Grafana Labs Engineering Director, and Tom Wilkie, Grafana Labs CTO. They are joined by Vasil Kaftandzhiev, Grafana Labs Staff Product Manager for Kubernetes & AWS monitoring solutions, and Dio Tsoumas, a Grafana Champion with a background in engineering, who leads the DevOps team at the consumer research company GWI.
You can read some of the show’s highlights below, but listen to the full episode to find out more about the changing landscape of observability in cloud-native architectures, the pros and cons of service meshes, the excellence of exemplars, and the one cost of Kubernetes that many people overlook.
Note: The following are highlights from episode 5, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
The shift from traditional monitoring
Mat Ryer: Why do we need to do a podcast episode on monitoring Kubernetes? Aren’t traditional techniques enough?
Deo Tsoumas: If we are engineers and we deploy services, and we own those services, monitoring is part of our job. And it should come out of the box.
Tom Wilkie: It’s a really interesting point: Who’s responsible for the observability nowadays? In the initial cloud generation, the responsibility for understanding the behavior of your applications almost fell to the developers, and I think that explains a lot of why APM exists.
But do you think in the world of Kubernetes, that responsibility is shifting more to the platform, more to out-of-the-box capabilities?
Deo: It should be that, 100%. Engineers who deploy code or post code should know where the dashboards are and how to set up alerts. But most of the time we just deploy something, and you have a ton of very good observability goodies out of the box. Maybe it wasn’t that easy in the past; it’s very easy to do now. The ecosystem is in a very good position to be able to support a big engineering team out of the box with a very small DevOps team.
Tom: What is it about the infrastructure, and the runtime, and all of the goodies that come with Kubernetes that mean observability can be more of a service that a platform team offers and not something every individual engineer has to care about?
Deo: It should be owned by the team who writes this kind of stuff. Now, why Kubernetes? I think now we’re in a state where the open source community is very passionate about this. People know that you should do proactive monitoring, you should care, and Kubernetes made this easier. Auto-healing is now a possibility. So as an engineer, maybe you don’t need to care that much about what’s going on. You should, though, know how to fix it.
Cloud bills and superheroes
Vasil Kaftandzhiev: Cloud provider resource costs is a topic that comes to mind when we’re talking about monitoring Kubernetes. It is such a robust infrastructure phenomena that touches absolutely every part of every company. On top of everything else, developers now usually have the responsibility to think about their cloud bill as well, which is a big shift.
Deo: It’s very easy to have monitoring and observability out of the box, but cost can be a difficult pill to swallow in the long run. I’ve seen many cases where it’s getting very expensive, and it scales a lot.
Tom: Understanding the cost of running a Kubernetes system is an art unto itself. I will say, though, there are certain aspects of Kubernetes that make this job significantly easier. Everything lives in a namespace, and it’s relatively easy — be it via convention, or naming, or extra labeling and extra metadata — to attribute the cost of a namespace back to a service, or a team. For me, the huge unlock for Kubernetes cost observability was the fact that this kind of attribution is easier.
Deo: Cloud providers enable this functionality to be supported out of the box, so it means you have your deployment in a namespace, and then you’re wondering, “My team owns five microservices. How much do we pay for it?” You see it costs only five pounds per month — which is very cheap — but there is an asterisk that says, “Unfortunately, this only means your pod requests.”
Engineers need to own their services, which means they need to care about requests and limits. And if those numbers are correct — and it’s very difficult to get them right — then the cost will be right as well.
Tom: But a lot of this doesn’t rely on the cloud provider to provide it for you, right? A lot of these tools can be deployed yourself. You can go and take something like OpenCost, run that in your Kubernetes cluster, link it to a Prometheus server and build some Grafana dashboards. You don’t have to wait for GCP or AWS to provide this report for you. That’s one of the beauties of Kubernetes, in my opinion — this ability to build on and extend the platform, and not have to wait for your service provider to do it for you.
Deo: While the technology is there, you need to be very vocal about how you champion these kinds of things. There needs to be a lot of effort to make them correct.
Vasil: I really love the effort reference. Today, if we’re talking about observability solutions, we start to blend the knowledge of the technology deeply into the observability stack and the observability products that are there. This is the only way around it. With the developers and SREs wearing so many superhero capes, the only way forward is to provide them with some kind of robust solutions to what they’re doing. I’m really amazed by the complexity and freedom and responsibilities that you people have. It’s amazing. As Peter Parker’s uncle said, “With a lot of power, there is a lot of responsibility.” So Deo, you’re a Spider-Man.
Deo: I completely agree. You’re introducing all these tools, and engineers can get a bit crazy. So it’s very nice when you hide this complexity. They don’t need to know about OpenCost, for example. They don’t need to know about dashboards with cost allocation. The only thing they need to know is that if they open a PR and they add something that will escalate cost, it will just fail.
Vasil: There is an additional trend that I’m observing which is that engineers start to be focused so much on cost that it damages the reliability and high availability — or any availability of their products. That is a strange shift and emphasizes that sometimes we neglect the right thing, which is having good software produced.
Why Kubernetes deserves special attention
Mat: What are the challenges specifically when it comes to monitoring Kubernetes? What makes it different? Why is it a thing that deserves its own attention?
Tom: I would divide the problem in two: There’s monitoring the Kubernetes cluster itself — the infrastructure behind Kubernetes that’s providing you all these fabulous abstractions — and then there’s monitoring the applications running on Kubernetes.
When you’re looking at the Kubernetes cluster itself, this is often an active part of your application’s availability, especially if you’re doing things like auto-scaling and scheduling new jobs in response to customer load and customer demand. The availability of things like the Kubernetes scheduler — things like the API servers and the controller managers and so on — matters. You need to build robust monitoring around that; you need to actively enforce SLOs around that to make sure that you can meet your wider SLO.
The second aspect is where the fun begins. The Kubernetes system tells you about your jobs and about your applications — using that to make it easier for users to understand what’s going on with their applications is where the fun begins.
Deo: Completely agree. If you say to engineers, “You know what? Now you can have very nice dashboards about the CPU, the nodes, throughput, and stuff like that,” they don’t care.
If you tell them, “Expose your own metrics that say ‘scale based on memory,’ or ‘scale based on traffic,’ " right away, they become very intrigued because they know the bottleneck of their own services. If you tell them, “You can scale based on the services — and by the way, you can have very nice dashboards that go with CPU memory, and here is your metric as well,” this is where things become very interesting.
And then you start implementing new things like pod autoscaler, or vertical pod autoscaler. Or you can say this is what the service mesh looks like, and then you can scale, and you can have other metrics out of the box.
Most engineers don’t have golden metrics out of the box, and that is a very big minus for most of the teams. Some teams, they don’t care. But golden metrics means throughput, error rate, success rate, stuff like that. In the bigger Kubernetes ecosystem, you can have them for free, and if you scale based on those metrics, it’s an amazing superpower. You can do whatever you want as an engineer and you don’t even need to care where those things are allocated, how they’re being stored, or how they’re being served.
You only need some nice dashboards, some basic high-level knowledge about how you can expose them or how you can use them, and then just be a bit intrigued so you can make them the next step and scale your service.
Kubernetes’ observability evolution
Mat: Have things changed a lot, or was good observability baked into Kubernetes from the beginning, and it’s evolving and getting better?
Tom: One of the things I’m incredibly happy about with Kubernetes is that all of the Kubernetes components, effectively from the very early days, were natively instrumented with really high-quality Prometheus metrics. That relationship between Kubernetes and Prometheus dates all the way back to their inception from that heavy inspiration from the internal Google technology. They both heavily make use of this concept of a label. Prometheus was built for these dynamically scheduled orchestration systems, because it heavily relies on this kind of pull-based model and service discovery. I 100% credit the popularity of Prometheus with the popularity of Kubernetes. It’s definitely a wave we’ve been riding.
Then there’s this concept of Kubernetes having rich metadata about your application — your engineers have spent time and effort describing the application to Kubernetes in the form of like YAML manifests for deployments, and stateful sets, and namespaces, and services, and all of this stuff gets described to Kubernetes. One of the things that makes monitoring Kubernetes quite unique is that description of the service can then be effectively read back to your observability system using things like kube-state-metrics. So this is an exporter for Kubernetes API that will tell Prometheus, “This deployment is supposed to have 30 replicas. This deployment is running on this machine and is part of this namespace . . . " It’ll give you all of this metadata about the application as Prometheus metrics.
This is quite unique. This is incredibly powerful. And it means the subsequent dashboards and experiences that you build on top of those metrics can natively enrich things like CPU usage — really boring — but you can actually take that CPU usage and break it down by service really easily. That’s what I think gets me excited about monitoring Kubernetes.
Deo: I agree 100%. And the community has stepped up a lot. I had an ex-colleague who was saying DevOps work is so easy these days because loads of people in the past made such a big effort to give the community all those nice dashboards and alerts that they want out of the box.
I want to add that even though kube-state-metrics and Prometheus are doing a very good job natively integrating with Kubernetes, it’s not enough in most cases. Let’s say one of the nodes goes down, you get an alert and you know that a few services are being affected. Unless you give engineers the tools to easily figure out what is wrong, it’s not enough. I think in most cases, you need a single place where you have dashboards, from kube-state-metrics, Prometheus metrics, but also logs. You need logs. And then you need performance metrics, you need your APM metrics…
I think the Grafana ecosystem is doing a very good job. In our case, we have very good dashboards that have all the Prometheus metrics, Loki metrics, and traces, and then you can jump from one to another. We have Pyroscope as well. People can jump in, and right out of the box find out what is wrong — it’s very powerful. They don’t need to know what Pyroscope is and what profiling is. You just need to give them the ability to explore the observability bottlenecks in their applications.
Tom: There’s a really rich set of dashboards in the Kubernetes mixin that work with pretty much any Kubernetes cluster and give you the structure of your application running in Kubernetes. You can see how much CPU each service uses, what hosts they’re running on, you can drill down into this really, really easily.
If you use the metadata that Kubernetes has about your application in your observability system, it makes it easier for developers to know what the right logs are, to know where the traces are coming from, and it gives them that mental model to help them navigate all of the different telemetry signals.
If there’s one thing you take away from this podcast, that’s the thing that makes monitoring and observing Kubernetes — and applications running in Kubernetes — easier, and special, and different, and exciting.
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.