How Grafana Labs Effectively Pairs Loki and Kubernetes Events
As we’ve rolled out Loki internally at Grafana Labs, we wanted logs beyond just simple applications. Specifically while debugging outages due to config, Kubernetes, or node restarts, we’ve found Kubernetes events to be super useful.
How It Works
The Kubernetes events feature allows you to see all of the changes in a cluster, and you can get a simple overview by just retrieving them:
➜ ~ kc get events
LAST SEEN TYPE REASON KIND MESSAGE
38m Normal Killing Pod Killing container with id docker://grafana:Need to kill Pod
38m Normal SuccessfulDelete ReplicaSet Deleted pod: grafana-54f599867-xqdw7
38m Normal Scheduled Pod Successfully assigned default/grafana-5c6c645897-s4c2b to gke-ops-tools1-gke-u-ops-tools1-gke-u-14d4793c-6kc4
38m Normal Pulling Pod pulling image "grafana/grafana-dev:master-d54851f8e21347da81a74b60bae0601d53184439"
38m Normal Pulled Pod Successfully pulled image "grafana/grafana-dev:master-d54851f8e21347da81a74b60bae0601d53184439"
38m Normal Created Pod Created container
38m Normal Started Pod Started container
14m Normal Killing Pod Killing container with id docker://grafana:Need to kill Pod
38m Normal SuccessfulCreate ReplicaSet Created pod: grafana-5c6c645897-s4c2b
14m Normal SuccessfulDelete ReplicaSet Deleted pod: grafana-5c6c645897-s4c2b
14m Normal Scheduled Pod Successfully assigned default/grafana-844858cf5f-fqhn6 to gke-ops-tools1-gke-u-ops-tools1-gke-u-14d4793c-ks8l
14m Normal Pulling Pod pulling image "grafana/grafana-dev:master-81c42fc912cba9c3e553d5ac433147a04638a045"
14m Normal Pulled Pod Successfully pulled image "grafana/grafana-dev:master-81c42fc912cba9c3e553d5ac433147a04638a045"
14m Normal Created Pod Created container
14m Normal Started Pod Started container
14m Normal SuccessfulCreate ReplicaSet Created pod: grafana-844858cf5f-fqhn6
38m Normal ScalingReplicaSet Deployment Scaled up replica set grafana-5c6c645897 to 1
38m Normal ScalingReplicaSet Deployment Scaled down replica set grafana-54f599867 to 0
14m Normal ScalingReplicaSet Deployment Scaled up replica set grafana-844858cf5f to 1
14m Normal ScalingReplicaSet Deployment Scaled down replica set grafana-5c6c645897 to 0
This also captures when nodes go unresponsive and when a pod has been killed along with the reason.
How Grafana Labs Pairs Loki and Kubernetes Events
Most recently, Kubernetes events proved to be effective in debugging our last outage:
15m 15m 1 ingester-6f9b57ccbd-rq9qs.15b2d20d55e14865 Pod Normal Preempted default-scheduler by <namespace>/querier-9467b8d85-7kwf5 on node gke-us-central1-us-central1-bigger-no-6dc155a4-jsqx
Persisting and being able to query the events is important, but unfortunately, Kubernetes only persists the events for one hour to reduce the load on etcd. Loki, however, is a good fit to store and query the events.
I started exploring different ways Grafana Labs could get the events into Loki, including adding a source to Promtail itself. Luckily, I found that Heptio, which was acquired by VMWare in 2018, had already built eventrouter for this exact use case – extracting events from Kubernetes and sending them to a third-party service.
One of the good things about eventrouter is that it’s pluggable, and one can write a Loki sink. But it’s also possible to write the events out stdout JSON and use Promtail to scrape them, which is the route I went for.
After deploying eventrouter, seeing the events in Kubernetes was very simple with the query {name=’eventrouter’}
. But then I started to notice that there were so many events and decided that we should be able to select or sort events based on the namespace at the very least.
To do so, I leveraged Promtail’s pipeline configuration to add namespace as an additional label to the logs exported to Loki:
- match:
selector: '{name="eventrouter"}'
stages:
- json:
expressions:
namespace: event.metadata.namespace
- labels:
namespace: ""
This would take the namespace from the event’s JSONPath and add it as a label.
As a result, I can query all the events from just the grafana-com namespace:
What’s Next
While we’re quite happy that we can now store and retrieve events, we’ve found the UI is quite lacking when dealing with JSON logs. I personally find the entire JSON blob is just distracting when we’re looking for the values to a few keys, so we’re looking at ways we can improve handling JSON logs.
More About Loki
Launched at KubeCon North America last December, Loki is a Prometheus-inspired service that optimizes storage, search, and aggregation while making logs easy to explore natively in Grafana. Loki is designed to work easily both as microservices and as monoliths, and correlates logs and metrics to save users money.
Less than a year later, Loki has almost 6,500 stars on GitHub and is now quickly approaching GA.
At Grafana Labs, we’ve been working hard on developing key features to make that possible, including loki-canary early detection for missing logs, the Docker logging driver plugin and support for systemd, and adding structure to unstructured logs with the pipeline stage.
You can also read about query optimization in Loki in our three-part series which covers topics such as the use of Go, iterators, as well as ingestion retention and label queries.
Be sure to check back for more content about Loki.