Explore your infrastructure with Kubernetes Monitoring
Kubernetes Monitoring offers visualization and analysis tools for you to:
- Evaluate the health, efficiency, and cost of Kubernetes infrastructure components.
- Analyze historical data as well as forecasts.
- View predictions created with machine learning.
- Manage alerts.
Navigate to Kubernetes Monitoring
- Navigate to your Grafana Cloud portal.
- In the menu, select the stack you want to work with.
- Click the upper-left menu icon.
- In the main menu, expand Infrastructure, then click Kubernetes.
Explore using the Kubernetes structure
Kubernetes Monitoring pages mirror the hierarchy of Kubernetes objects, so you can begin at any level above containers. Main pages include lists of Clusters, namespaces, workloads, and Nodes.
For example, the Cluster main page shows the list of your Clusters. When you click on a Cluster in the list, it opens the Cluster detail page. That page shows the details for the Cluster along with a list of Nodes within that Cluster.
You can continue to drill into a Node and see the list of Pods for that Node, all the way to the container level.
There are also main pages for Cluster configuration as well as managing alerts, cost, and efficiency. For additional navigation tips, refer to Navigation tips for Kubernetes Monitoring.
Start with high-level snapshot
The Kubernetes Overview page gives you a high-level view of your Clusters, usage, and alerts. This page brings to the forefront key data about your infrastructure.
Refine counts of Kubernetes objects
Adjust the time range and filter by Cluster and namespace to narrow and include historical data for:
- Clusters, Nodes, namespaces, workloads, Pods, and containers
- Deployed container images
Find usage spikes
Use the time range selector to focus on a time period while looking for patterns or spikes in CPU and memory usage in your Clusters. When spikes occur:
Zoom in on the graph to narrow the time selection.
Hover over and click the peak of the spike to see the percentage of use compared to capacity. In the following example, the spike shows 46.5% of CPU usage compared to capacity.
Click the link to view the Cluster. The Cluster page shows the time range you set when zooming in on the graph.You can continue by sorting the list of Nodes in this Cluster by highest CPU usage to investigate the issue causing the spike.
Review and drill into alerts
Sort the Firing Since column of alerts to focus on either the most current or the oldest alerts that are firing.
Click the container or Pod name related to the alert to jump directly to the detail page.
Manage alerts
View and respond to all Kubernetes-related alerts from the Alerts page and the Kubernetes Overview page.
You can also:
- Manage preconfigured alerting rules
- Copy a preconfigured alert
- Create a new alert
Analyze costs
On the Cost page, use the Overview and Savings tabs to gain an understanding what Kubernetes is costing and how you can save. You can see the cost of each item in a list view as well as on the detail pages.
Understand efficiency and resource use
Optimize resource usage and efficiency by:
- Correlating between average and maximum resource usage to understand performance and troubleshoot stability issues.
- Observe resource usage for each Kubernetes object.
- Discover any stranded resources in your fleet.
Throughout Kubernetes Monitoring, resource usage statistics are available for Kubernetes objects.
Learn what’s predicted
CPU and memory prediction can help you ensure resources are available during spikes in usage, as well as help you decrease the amount of unused resources due to over provisioning. To use prediction tools, first enable the Machine Learning plugin.
The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers:
- Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
- Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development. This feature can be seen on this Node details page.
Detect outlier Pod CPU usage
Identify any Pods that have CPU usage different from other Pods.
To do so, on a workload detail page, click the Detect Outlier CPU Usage amongst Pods button.
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development. This feature can be seen on this namespace details page.
Uncover energy usage
On any detail page, click the Energy tab to view the energy usage of:
- Workloads and namespaces
- Clusters
- Nodes
- Pods
- Containers
When you configure Kubernetes Monitoring to gather energy metrics, Kepler exposes and gathers metrics, and Alloy collects these metrics.
Energy metrics are separated into these categories:
- Package, including CPU cores
- DRAM (memory)
- GPU
- Other
- Total (the sum of all categories)
Use Explore for troubleshooting
Click Explore this query in the Machine Learning plugin to view the raw data and troubleshoot issues. Here you can adjust parameters and see a more detailed graph of the findings.
Analyze historical data
Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range remains the same for period you set until you change it again.
As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.
Zoom into an area of any graph on the detail pages to narrow the time range selector even further. The time range remains selected until you click Back to default.
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development. This feature can be seen on this namespace details page set for the last 12 hours.
Find deleted Kubernetes objects
You can find deleted Clusters, namespaces, workloads, Nodes, Pods, and containers to understand what occurred in the past. To do so, set the time range selector to a past time period.
The following example shows a time range of the previous 30 days, and then filtering for Nodes with the condition of “No data”. The Node detail page shows a graph depicting when the Node expired.
Note
Grafana Cloud has a default 30-day limit for queries. If your Kubernetes object was deleted 30 days beyond the current date, use the time range selector to choose a specific 30-day time frame in the past.
Discover bare and unmanaged Pods
You can find unmanaged (or static) and bare Pods that have been directly created.
Navigate to the Workloads main page, and filter by the Pod type. For example, to locate unmanaged static Pods, filter for StaticPod.
View network bandwidth and saturation
Use the network panels to understand when bandwidth limits are causing network saturation, which can lead to dropped packets. On any detail page for Cluster, namespace, workload, Node, or Pod, click the Network tab to view:
- Network Bandwidth Rx/Tx: Shows the rate of received and transmitted bytes
- Network Saturation Rx/Tx dropped packets: Shows rate of received and transmitted packets dropped
- Network Bandwidth and Network Saturation by Node, workload, or Pod: Shows the bandwidth and saturation by object
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development. This feature can be seen on the Network tab of this namespace details page.
View logs and events
From any detail page, click the Logs & Events tab to view the logs and events for that Kubernetes object.
Resolve issues with built-in tools
Navigate easily from Kubernetes Monitoring to other capabilities in Grafana Cloud to analyze, troubleshoot, and solve issues.
Start an automated diagnostic
From a Pod, Cluster, namespace, or workload detail view, you can begin an automated investigation by clicking Run Sift investigation. Sift performs a set of automated system checks, and surfaces potential issues in your Kubernetes environment. It then works to identify the root cause of an incident.
Access root cause analysis tool
Note
To access root cause analysis tools, enable Asserts on your stack.
Within Kubernetes Monitoring, access Asserts Workbench to perform root cause analysis. From any list of Clusters, Nodes, workloads, namespaces, or Pods you choose, select the box to the left of the list item, and click the Compare in Asserts Workbench button. The RCA Workbench opens in a new tab.
Within any details page where the Assertions button appears, click it to continue your investigation into issues.
You can jump to the connections view in Asserts to view connections between entities.
Jump to the application layer
On the detail page for a Pod or workload, click Application Observability to navigate directly to more data, such as the service health.
To return to Kubernetes Monitoring, click the browser back button.
View queries to troubleshoot with Explore
To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools for troubleshooting.
Navigate to traces
If you choose to enable traces when you configure Kubernetes Monitoring, you can easily click to see them.
Click the main menu icon.
Click Explore.
Choose the Tempo data source.
With the TraceQL tab selected, enter your search query.
Click Run query.
A table of traces appears.
Click a trace to see the detail.
Manage configuration
If you have the admin
role, you can manage the configuration of Kubernetes Monitoring by working with:
- Data source choices
- Alerts
- Integration installations
- Optional custom log queries
- Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date
Access more information
Click the documentation links on a page to find more information about what you’re viewing.
Navigation tips
Here are some tips and shortcuts for getting around in Kubernetes Monitoring.
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development. This feature can be seen on the Kubernetes Monitoring Overview.
Jump between main pages
From any main page, click the icon beside the page title to see the menu of all main pages. Then click the page you want to open.
Dock the main menu
To keep the main navigation open:
- Click the main menu icon.
- Click the menu docking icon to keep the main menu open.
Filter, sort, and set the time range
Use filters and sorting, along with the time range selector, to target the data you want.
Jump to main lists
From the counts on the Kubernetes Overview home page, click All to see that component’s list of items in your Kubernetes fleet.
Control app refresh
You can control the automatic refresh interval of the GUI as well as disable the auto refresh.
Use color cues
Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition. For example, sometimes text is a different color for Pod status:
Text | Color | Comments |
---|---|---|
Failed | Red | Failed Pod |
Running | Green | Healthy Pod |
Running | Red | Pod is failing to start |
Succeeded | Green | Job Pod successfully run |
Unknown | White | Pod status is unknown |
Waiting | Yellow | Pod is waiting because of startup, such as Pod initializing or container creating |
Waiting | Red | Pod is waiting because of a problem, such as crash loop back off or image pull back off |
For more information on Pod status, refer to the Kubernetes documentation on Pod lifecycle.
The following table describes the color indicators for resource capacity and the state of resource usage:
Usage Colors | Usage | Comments |
---|---|---|
Green | 60-90% of maximum | This is the ideal state of resource usage. |
Yellow | Below 60% | Low usage percentages indicate that the item might be over provisioned. |
Red | 90%+ | Your resource usage is close to or above its configured capacity. |