Troubleshoot Kubernetes Monitoring
This section includes common errors encountered while installing and configuring Kubernetes Monitoring components.
Tips for Helm chart deployment
If you have configured Kubernetes Monitoring with the Grafana Kubernetes Monitoring Helm chart, here are some general troubleshooting techniques:
- Within Kubernetes Monitoring, view the metrics status.
- Check for any changes with the command
helm template ...
. This produces an `output.yaml’ file to check the result. - Check the configuration with the command
helm test --logs
. This provides a configuration validation, including all phases of metrics gathering through display. - Check the
extraConfig
section of the Helm chart to ensure this section is not used for modifications. This section is only for additional configuration not already in the chart, and not for modifications to the chart.
Two common issues often occur when a Helm chart is not configured correctly:
- Duplicate metrics
- Missing metrics
Duplicate metrics
For duplicate metrics, determine whether some metric data sources (such as Node Exporter or kube-state-metrics) already exist on the Cluster. If so, remove them or adjust the Helm chart values to use the existing ones and skip deploying another instance.
Missing metrics
It’s helpful to keep in mind the different phases of metrics gathering when debugging:
- Discovery: Find the metric source. In this phase, find out whether the tool to gather metrics is working. For example, is Node Exporter running? Can Alloy find Node Exporter? Perhaps there’s configuration that’s not correct because Alloy is looking in a namespace or for a specific label.
- Scraping: Were the metrics gathered correctly? As an example, most metric sources use HTTP, but the metric source you are trying to find uses HTTPS. Identify whether the configuration is set for scraping HTTPS.
- Processing: Were metrics correctly processed? With Kubernetes Monitoring, metrics are filtered to a small subset of the useful metrics.
- Delivering: In this phase, metrics are sent to Grafana Cloud. While this is an uncommon problem, if there is an issue, there are likely no metrics being delivered.
- Displaying: A metric is not showing up in the Kubernetes Monitoring GUI. If you’ve determined the metrics are being delivered but some are not displaying, there may be a missing or incorrect label for the metric. Each metric has a Cluster label.
View metrics status
To view the status of metrics being collected:
- Click Configuration on the menu.
- Click the Metrics status tab.
- Filter for the Cluster or Clusters you want to see the status of.
For more information about each status, click the Docs link.
Status icons
Each panel shows an icon that indicates the status of the incoming data, based on its source and Cluster selected within the time range:
- Green circle with check mark: Data for this source is being collected. The version of the source or online status also displays.
- Yellow caution with exclamation mark: Duplicate data is being collected for the source that was expected. For example, there may be two instances deployed for the same metric source.
- Red circle with X: There is no data available for this item, within the time range specified.
Check initial configuration
When you initially configure, if any box shows “Offline”, it can be any of the following:
- The feature was not selected during Cluster configuration.
- The system is not running correctly.
- Alloy was not able to gather data correctly.
- No data was gathered during the time range specified.
View the query with Explore
If something in the metrics status looks incorrect, click the icon next to the panel title. This opens Explore where you can examine the query for any issues, such as an incorrect label.
Look at a historical time range
Use the time range selector to understand what was occurring in the past. In the following example, Cluster events were being collected but are not currently.
Resolve missing efficiency usage data
If CPU and memory usage within any table shows no data, it could be due to missing Node Exporter metrics. Navigate to Configuration in the main menu, and click the Metrics status tab to determine what is not being reported.
Resolve missing metrics
If you are missing metrics even though the Metrics status tab under the Configuration menu is showing the configuration is set up as you intended, check your configuration for an incorrectly configured label for the Node Exporter instance.
Make sure the Node Exporter instance
label is set to the Node name. The labels for kube-state-metrics node
and Node Exporter instance
must contain the same values.
Resolve missing workload data
If you are seeing Pod resource usage but not workloads usage data:
- Navigate to the Configuration page.
- Scroll to the step for Backend installation.
- Click Install to install alert rules and recording rules.
Resolve update error
If you attempted to upgrade Kubernetes Monitoring with the Update button on the Settings tab under Configuration and received an error message, complete the following instructions.
Warning
When you uninstall Grafana Alloy, this deletes its associated alert and recording rule namespace. Alerts added to the default locations are also removed. Save a copy of any customized item if you modified the provisioned version.
- Click Uninstall.
- Click Install to reinstall.
- Complete the instructions in Configure with Grafana Kubernetes Monitoring Helm chart.
Resolve duplicate metrics
To identify why you have duplicate metrics:
At the Metrics status page, look for the yellow caution icon which indicates multiple metrics are being collected. Click the Explore icon next to the panel title to examine the query for correctness.
Visit the Cardinality page in Kubernetes Monitoring to narrow down the origin of your active series.
Resolve “no data” in a panel
If a panel in Kubernetes Monitoring seems to be missing data or shows a “no data” message, open the query for the panel in Explore to determine which query is failing.
This can occur when new features are released. For example, if you see no data in the network bandwidth and saturation panels, it is likely you need to upgrade to the newest version of the Helm chart.
Resolve OpenShift errors
With OpenShift’s default SecurityContextConstraints
(scc
) of restricted
(refer to the scc
documentation for more info), you may run into the following errors while deploying Grafana Alloy using the default generated manifests:
msg="error creating the agent server entrypoint" err="creating HTTP listener: listen tcp 0.0.0.0:80: bind: permission denied"
By default, the Alloy StatefulSet container attempts to bind to port 80
, which is only allowed by the root user (0
) and other privileged users. With the default restricted
SCC on OpenShift, this results in the preceding error.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 3m55s (x19 over 15m) daemonset-controller Error creating: pods "grafana-agent-logs-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.runAsUser: Invalid value: 0: must be in the ranges: [1000650000, 1000659999], spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
By default, the Alloy DaemonSet attempts to run as root user, and also attempts to access directories on the host (to tail logs). With the default restricted
SCC on OpenShift, this results in the preceding error.
To solve these errors, use the hostmount-anyuid
SCC provided by OpenShift, which allows containers to run as root and mount directories on the host.
If this does not meet your security needs, create a new SCC with the required tailored permissions, or investigate running Agent as a non-root container, which goes beyond the scope of this troubleshooting guide.
To use the hostmount-anyuid
SCC, add the following stanza to the alloy
and alloy-logs
ClusterRoles:
. . .
- apiGroups:
- security.openshift.io
resources:
- securitycontextconstraints
verbs:
- use
resourceNames:
- hostmount-anyuid
. . .