How to use Kubernetes events for effective alerting and monitoring

• 2023-01-23 • 10 min

Kubernetes, a graduated project of the Cloud Native Computing Foundation (CNCF) ecosystem, is the most prominent and widely used container orchestration systems. It’s used to manage and deploy containers in a wide range of environments, from IoT devices based on Raspberry Pis to enterprise environments consisting of millions of services.

However, the teams that manage these clusters need to know what’s happening to the state of objects in the cluster, and this in turn introduces a requirement to gather real-time information about cluster statuses and changes. This is enabled by Kubernetes events, which give you a detailed view of the cluster and allow for effective alerting and monitoring.

In this guide, you’ll learn how Kubernetes events work, what generates them, and where they’re stored. You’ll also learn to integrate Grafana with your Kubernetes environment to effectively use the information supplied by those events to support your observability strategy.

What are Kubernetes events?

Kubernetes events provide a rich source of information. These objects can be used to monitor your application and cluster state, respond to failures, and perform diagnostics. The events are generated when the cluster’s resources — such as pods, deployments, or nodes — change state.

Whenever something happens inside your cluster, it produces an events object that provides visibility into your cluster. However, Kubernetes events don’t persist throughout your cluster life cycle, as there’s no mechanism for retention. They’re short-lived, only available for one hour after the event is generated. (Later in this post, we’ll show you how to get actionable insights from these events by using Grafana.)

The events can be generated for a number of reasons, but the following examples are some of the most common causes.

1. State change

Kubernetes events are automatically generated when certain actions are taken on objects in a cluster, e.g., when a pod is created, a corresponding event is created. Other examples are changes in pod status to pending, successful, or failed. This includes reasons such as pod eviction or cluster failure.

Remember: services, PersistentVolumes, StorageClasses, and other objects also produce events; we’re using a pod as an example for simplicity.

2. Configuration changes

Events are also generated when there’s a configuration change. Configuration changes for nodes can include scaling horizontally by adding replicas, or scaling vertically by upgrading memory, disk input/output capacity, or your processor cores.

3. Scheduling

Scheduling or failed scheduling scenarios also generate events. Failures can occur due to invalid container image repository access, insufficient resources, or if the container fails a liveness or readiness probe.

This is not an exhaustive list; there are many more reasons for this to occur.

Types of Kubernetes events

You can use Kubernetes events to monitor your system for problems by tracking the number of events and their content over time and setting up alerts for specific conditions. The alerts and frequency tracking ensure that problems are detected and dealt with in a timely manner. To understand how to track these events, you need to understand more about the five types of events.

1. Failed events

Failed events are caused when there’s a manifest-level error on your object manifest, or there’s a problem pulling the container image from the repository. Image pull errors can happen due to:

The wrong credentials for a private repository Rate limiting A typo on the image name or tags

The best way to troubleshoot this is to check that the image address and tags are correct. To do this, pull the image through docker pull <image_name> before you add it to your manifest. If it fails to pull or download your image locally, it will throw an ImagePullBackOff error. While there are many image-related errors, the most common ones involve a faulty tag or name, followed by incorrect credentials.

2. Evicted events

Kubelet is responsible for running containers on nodes in a Kubernetes cluster. It ensures that all containers are healthy and running, and it coordinates with the primary node to provide an accurate view of the cluster state. In some scenarios, though, insufficient resources will cause it to evict your pods from your node with resource constraints.

If other nodes with sufficient resources are available, the scheduler will then schedule your pods to that node. To prevent evictions, use taint to prevent new pods from being scheduled on nodes with high utilization.

3. Failed scheduling events

A failed Kubernetes scheduling event happens when there isn’t a sufficient node. This can be caused by a taint on your nodes, insufficient resources, or nodes that don’t match selectors for your pod or deployment.

The FailedScheduling message provides details about the conditions surrounding the failure, which you can use to develop a plan for how to resolve the conflict. Use the kubectl describe object/object-name command to get details about the error and make rectifications.

4. Volume events

Whenever you want to persist data for your workloads, you need persistent volumes in Kubernetes. FailedMount and FailedAttachVolume are common events caused by a networking or configuration error between your persistent volume and claims, which prevents disks from being used by your pods. Causes of configuration error include incorrect access modes, new nodes with insufficient mount points, or new nodes with too many disks attached.

Events can provide visibility into the corrections needed for your application to run correctly and resolve errors. Troubleshooting steps can involve detaching the disk manually or using selectors, labels, and tolerations to tell Kubernetes Scheduler to start the pod in a specific node.

5. Node events

Kubernetes nodes are the machines in a cluster that run your applications and store your data. You can run Kubernetes on a number of platforms, including public clouds, private clouds, bare metal servers, and even your laptop. However, unhealthy nodes can cause 5XX errors, unhealthy deployments, or other infrequent or frequent events.

Rebooting your node will help to resolve this, and the rebooted event is generated when a restart occurs. You can reboot manually, or the control plane can do it automatically.

NodeNotReady is an event that occurs when your node is still in preparation mode and isn’t yet ready to schedule pods. Another node event is HostPortConflict, which occurs when your cluster becomes unreachable or is unable to connect, possibly due to an incorrect NodePort, DaemonSet conflicts, or a node failure.

Accessing your Kubernetes events

Kubectl is a powerful Kubernetes utility that helps you manage your Kubernetes objects and resources. The simplest way to view your event objects is to use kubectl get events.

When you add an Nginx pod to your cluster, you’ll see output similar to this:

hrittik@Azure:~$ kubectl get events
LAST SEEN   TYPE 	REASON      OBJECT         	MESSAGE
5s          Normal   Scheduled   pod/nginxreplica   Successfully assigned default/nginxreplica to aks-nodepool1-26081864-vmss000004
5s          Normal   Pulling 	pod/nginxreplica   Pulling image "nginx"
5s          Normal   Pulled  	pod/nginxreplica   Successfully pulled image "nginx" in 272.563572ms
5s          Normal   Created 	pod/nginxreplica   Created container nginxreplica
5s          Normal   Started 	pod/nginxreplica   Started container nginxreplica

Here, details about container creation, start, and image pull are all shown.

To view your data in JSON, pass in the -o json flag to your events command. The new command will look like this:

kubectl get events -o json

You can also get events through a specific namespace by using the --namespace=${NAMESPACE} flag. If you need to get events related to specific objects, this can be achieved by describing an object using the command kubectl describe object/object_name. You’ll find all the object-specific events in the Events section of the object description.

Taking the above example of a newly created pod, you can get the same information about the pod’s events by running kubectl describe pods/nginxreplica, then scrolling through the sections to reach Events:

hrittik@Azure:~$ kubectl describe pods/nginxreplica
Name:         nginxreplica
Namespace:	default
Priority: 	0
Node:         aks-nodepool1-26081864-vmss000004/10.224.0.4
Start Time:   Wed, 17 Aug 2022 22:32:13 +0000
Labels:       run=nginxreplica
Annotations:  <none>
Status:   	Running
IP:           10.244.0.10
IPs:
  IP:  10.244.0.10
Containers:
  nginxreplica:
	Container ID:   containerd://975825f21309b2d56af402018f3a1678812ec41ad287fe30bfd61f2430e236be
	Image:      	nginx
	Image ID:       docker.io/library/nginx@sha256:790711e34858c9b0741edffef6ed3d8199d8faa33f2870dea5db70f16384df79
	Port:       	<none>
	Host Port:  	<none>
	State:      	Running
  	Started:  	Wed, 17 Aug 2022 22:32:14 +0000
	Ready:      	True
	Restart Count:  0
	Environment:	<none>
	Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fbhx4 (ro)
Conditions:
  Type          	Status
  Initialized   	True
  Ready         	True
  ContainersReady   True
  PodScheduled  	True
Volumes:
  kube-api-access-fbhx4:
	Type:                	Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
	ConfigMapName:       	kube-root-ca.crt
    ConfigMapOptional:       <nil>
	DownwardAPI:         	true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type	Reason     Age	From           	Message
  ----	------     ----   ----           	-------
  Normal  Scheduled  4m28s  default-scheduler  Successfully assigned default/nginxreplica to aks-nodepool1-26081864-vmss000004
  Normal  Pulling    4m28s  kubelet        	Pulling image "nginx"
  Normal  Pulled     4m28s  kubelet        	Successfully pulled image "nginx" in 272.563572ms
  Normal  Created    4m28s  kubelet        	Created container nginxreplica
  Normal  Started    4m28s  kubelet        	Started container nginxreplica

While events contain a great deal of insight and information, they’re inherently limited, since you can’t query or retain them longer than an hour. To develop actionable insights, you have to fetch and retain the events so you can draw conclusions from an aggregate view over a period of time.

To do this effectively, you need a more robust solution, like Grafana.

Grafana Agent and monitoring

The Grafana Agent is a lightweight data collector that can be installed in your Kubernetes cluster to collect telemetry data, such as logs, events, and traces. That data can then be forwarded to the Grafana LGTM Stack (Loki for logs, Grafana for visualization, Tempo for Traces, and Mimir for metrics), Grafana Cloud, or Grafana Enterprise. No matter what form of Grafana you deploy, you can create and share dynamic dashboards that monitor and alert on your cloud native applications, tools, and Kubernetes clusters.

With the help of the Grafana Agent, you can track all the events in your cluster — not just the ones from an hour ago.

Metrics eventhandler

The eventhandler_config is a Grafana Agent integration for Kubernetes that allows your agent to fetch real-time events from the Kubernetes API and forward them to Grafana Loki, which acts as your log aggregator. You can deploy the Grafana Agent as a Kubernetes DaemonSet or a Deployment.

Deploying the Grafana Agent can sometimes cause log replication issues due to the lack of a cache file (cache_path) that stores all logs since the integration’s activation with DaemonSet. Because of this, duplicate entries can be shipped when a restart occurs; however, you can de-duplicate your logs at the dashboard level.

Grafana, which has long been a part of the CNCF ecosystem as a tool for monitoring and observability, comes in handy when you’re looking for simple solutions for your Kubernetes cluster. When you use the event handler extension, you gain access to a very wide range of features. Some of the most important include:

Scalable metric collection with host_filtering and sharding.
The ability to use LogQL on top of your queries.
Specific filters, such as cluster with the name of your cluster, to find events from specific clusters.
Alerting on Slack, emails, or other channels when there’s a change of state, evictions, or errors.
Aggregating events over a period of time to provide a more comprehensive view in your logs and dashboards.

Using the Grafana Agent is very simple. You need to have access to a Kubernetes cluster and a Grafana instance, which can either be SaaS-based or self-hosted. Edit the agent manifest to reflect the correct values for your Loki credentials, then apply it to your cluster and connect it to your Grafana instance.

If you’re interested in monitoring your Kubernetes clusters but don’t want to do it all on your own, we offer Kubernetes Monitoring in Grafana Cloud — the full solution for all levels of Kubernetes usage that gives you out-of-the-box access to your Kubernetes infrastructure’s metrics, logs, and Kubernetes events as well as prebuilt dashboards and alerts. Kubernetes Monitoring is available to all Grafana Cloud users, including those in our generous free tier. If you don’t already have a Grafana Cloud account, you can sign up for a free account today!

Feedback

Relevant sources:

Feedback

How to use Kubernetes events for effective alerting and monitoring

What are Kubernetes events?

1. State change

2. Configuration changes

3. Scheduling

Types of Kubernetes events

1. Failed events

2. Evicted events

3. Failed scheduling events

4. Volume events

5. Node events

Accessing your Kubernetes events

Grafana Agent and monitoring

Metrics eventhandler

Related content

How to use Kubernetes events for effective alerting and monitoring

What are Kubernetes events?

1. State change

2. Configuration changes

3. Scheduling

Types of Kubernetes events

1. Failed events

2. Evicted events

3. Failed scheduling events

4. Volume events

5. Node events

Accessing your Kubernetes events

Grafana Agent and monitoring

Metrics eventhandler

Related content

Demystifying the OpenTelemetry Operator: Observing Kubernetes applications without writing code

Creating alerts from panels in Kubernetes Monitoring: an overlooked, powerhouse feature

Monitoring Kubernetes: Why traditional techniques aren't enough