Menu
Grafana Cloud

Troubleshoot Kubernetes Monitoring

This section includes common errors encountered while installing and configuring Kubernetes Monitoring components.

Tips for Helm chart deployment

If you have configured Kubernetes Monitoring with the Grafana Kubernetes Monitoring Helm chart, here are some general troubleshooting techniques:

  • Within Kubernetes Monitoring, view the metrics status.
  • Check for any changes with the command helm template .... This produces an `output.yaml’ file to check the result.
  • Check the configuration with the command helm test --logs. This provides a configuration validation, including all phases of metrics gathering through display.
  • Check the extraConfig section of the Helm chart to ensure this section is not used for modifications. This section is only for additional configuration not already in the chart, and not for modifications to the chart.

Two common issues often occur when a Helm chart is not configured correctly:

  • Duplicate metrics
  • Missing metrics

Duplicate metrics

For duplicate metrics, determine whether some metric data sources (such as Node Exporter or kube-state-metrics) already exist on the Cluster. If so, remove them or adjust the Helm chart values to use the existing ones and skip deploying another instance.

Missing metrics

It’s helpful to keep in mind the different phases of metrics gathering when debugging:

  • Discovery: Find the metric source. In this phase, find out whether the tool to gather metrics is working. For example, is Node Exporter running? Can Alloy find Node Exporter? Perhaps there’s configuration that’s not correct because Alloy is looking in a namespace or for a specific label.
  • Scraping: Were the metrics gathered correctly? As an example, most metric sources use HTTP, but the metric source you are trying to find uses HTTPS. Identify whether the configuration is set for scraping HTTPS.
  • Processing: Were metrics correctly processed? With Kubernetes Monitoring, metrics are filtered to a small subset of the useful metrics.
  • Delivering: In this phase, metrics are sent to Grafana Cloud. While this is an uncommon problem, if there is an issue, there are likely no metrics being delivered.
  • Displaying: A metric is not showing up in the Kubernetes Monitoring GUI. If you’ve determined the metrics are being delivered but some are not displaying, there may be a missing or incorrect label for the metric. Each metric has a Cluster label.

View metrics status

To view the status of metrics being collected:

  1. Click Configuration on the menu.
  2. Click the Metrics status tab.
  3. Filter for the Cluster or Clusters you want to see the status of.
**Metrics status** tab with status indicators for one Cluster
Metrics status tab with status indicators for one Cluster

For more information about each status, click the Docs link.

Clicking a link to open documentation about the metric status
Clicking a link to open documentation about the metric status

Status icons

Each panel shows an icon that indicates the status of the incoming data, based on its source and Cluster selected within the time range:

  • Green circle with check mark: Data for this source is being collected. The version of the source or online status also displays.
  • Yellow caution with exclamation mark: Duplicate data is being collected for the source that was expected. For example, there may be two instances deployed for the same metric source.
  • Red circle with X: There is no data available for this item, within the time range specified.

Check initial configuration

When you initially configure, if any box shows “Offline”, it can be any of the following:

  • The feature was not selected during Cluster configuration.
  • The system is not running correctly.
  • Alloy was not able to gather data correctly.
  • No data was gathered during the time range specified.

View the query with Explore

If something in the metrics status looks incorrect, click the icon next to the panel title. This opens Explore where you can examine the query for any issues, such as an incorrect label.

Opening Explore from **Metrics status** to view the query
Opening Explore from Metrics status to view the query

Look at a historical time range

Use the time range selector to understand what was occurring in the past. In the following example, Cluster events were being collected but are not currently.

Changing the time range for **Metrics status**
Changing the time range for Metrics status

Resolve missing efficiency usage data

If CPU and memory usage within any table shows no data, it could be due to missing Node Exporter metrics. Navigate to Configuration in the main menu, and click the Metrics status tab to determine what is not being reported.

Resolve missing metrics

If you are missing metrics even though the Metrics status tab under the Configuration menu is showing the configuration is set up as you intended, check your configuration for an incorrectly configured label for the Node Exporter instance.

Make sure the Node Exporter instance label is set to the Node name. The labels for kube-state-metrics node and Node Exporter instance must contain the same values.

Resolve missing workload data

If you are seeing Pod resource usage but not workloads usage data:

  1. Navigate to the Configuration page.
  2. Scroll to the step for Backend installation.
  3. Click Install to install alert rules and recording rules.

Resolve update error

If you attempted to upgrade Kubernetes Monitoring with the Update button on the Settings tab under Configuration and received an error message, complete the following instructions.

Warning

When you uninstall Grafana Alloy, this deletes its associated alert and recording rule namespace. Alerts added to the default locations are also removed. Save a copy of any customized item if you modified the provisioned version.
  1. Click Uninstall.
  2. Click Install to reinstall.
  3. Complete the instructions in Configure with Grafana Kubernetes Monitoring Helm chart.

Resolve duplicate metrics

To identify why you have duplicate metrics:

  • At the Metrics status page, look for the yellow caution icon which indicates multiple metrics are being collected. Click the Explore icon next to the panel title to examine the query for correctness.

  • Visit the Cardinality page in Kubernetes Monitoring to narrow down the origin of your active series.

Resolve “no data” in a panel

If a panel in Kubernetes Monitoring seems to be missing data or shows a “no data” message, open the query for the panel in Explore to determine which query is failing.

Opening Explore when a panel shows no data
Opening Explore when a panel shows no data

This can occur when new features are released. For example, if you see no data in the network bandwidth and saturation panels, it is likely you need to upgrade to the newest version of the Helm chart.

Resolve OpenShift errors

With OpenShift’s default SecurityContextConstraints (scc) of restricted (refer to the scc documentation for more info), you may run into the following errors while deploying Grafana Alloy using the default generated manifests:

msg="error creating the agent server entrypoint" err="creating HTTP listener: listen tcp 0.0.0.0:80: bind: permission denied"

By default, the Alloy StatefulSet container attempts to bind to port 80, which is only allowed by the root user (0) and other privileged users. With the default restricted SCC on OpenShift, this results in the preceding error.

Events:
  Type     Reason        Age                   From                  Message
  ----     ------        ----                  ----                  -------
  Warning  FailedCreate  3m55s (x19 over 15m)  daemonset-controller  Error creating: pods "grafana-agent-logs-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.runAsUser: Invalid value: 0: must be in the ranges: [1000650000, 1000659999], spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

By default, the Alloy DaemonSet attempts to run as root user, and also attempts to access directories on the host (to tail logs). With the default restricted SCC on OpenShift, this results in the preceding error.

To solve these errors, use the hostmount-anyuid SCC provided by OpenShift, which allows containers to run as root and mount directories on the host.

If this does not meet your security needs, create a new SCC with the required tailored permissions, or investigate running Agent as a non-root container, which goes beyond the scope of this troubleshooting guide.

To use the hostmount-anyuid SCC, add the following stanza to the alloy and alloy-logs ClusterRoles:

yaml
. . .
- apiGroups:
  - security.openshift.io
  resources:
  - securitycontextconstraints
  verbs:
  - use
  resourceNames:
  - hostmount-anyuid
. . .