Loki’s Path to GA: Query Optimization, Part Three
Launched at KubeCon North America last December, Loki is a Prometheus-inspired service that optimizes storage, search, and aggregation while making logs easy to explore natively in Grafana. Loki is designed to work easily both as microservices and as monoliths, and correlates logs and metrics to save users money.
Less than a year later, Loki has almost 6,500 stars on GitHub and is now quickly approaching GA. At Grafana Labs, we’ve been working hard on developing key features to make that possible. In the coming weeks, we’ll be highlighting some of these features.
In part one of this blog series on query optimization, we focused on benchmarks, object pools, and finding memory leaks, and part two was about iterators. This final installment will cover ingestion retention, label queries, and what’s next.
Ingestion Retention & Label Queries
Loki is built with the same architecture as Cortex. Queriers need to stream logs from ingesters and fetch and uncompress chunks from cloud storage before processing them together. We use the ingesters to compress the individual writes into a big chunk to reduce the writes to the backend 1000x. Once chunks are full, the ingester will periodically flush them to object storage. However, they are still retained in memory for 15 minutes to be used for label queries. Label queries return all label name and values. This allows Grafana Explore to suggest streams to search for.
We realized that 15 minutes’ worth of chunks, even compressed, can be quite a lot, and this was causing the single binary version of Loki to consume a lot of memory. Since we’re committed to making Loki easy to use, we decided to implement label queries using storage instead. We started utilizing the index in the backend database to look up the label names and values. Implemented via cortex#1346, cortex#1337 and loki#521.
This allowed us to immediately flush the chunks without holding references to them for 15 minutes. However, as we stated before, selecting label names is only possible by loading the chunks, which can still result in slow query performance and high memory usage when fetching them.
We are currently investigating adding a new type of index similar to Prometheus which would map series IDs to all their labels: Design doc and PR
This way we can get label names and values directly from our indexes.
Lastly, we replaced the official Go gzip package with compress, which improved our ingester compression speed. Sadly, the decompression wasn’t improved, but we have a plan to distribute the compression across our queriers or even explore other compression algorithms like snappy, which trades space for speed.
Conclusion and Future
As of today, Loki is already fast, and we can grep multiple Kubernetes deployments for 24 hours of logs ({app=”loki”} |= “error”
). Most of the request time is spent uncompressing chunk blocks; gzip is known to be cpu intensive when decompressing.
In the future we plan to improve this by implementing query parallelization (distributed grep) using a frontend on the querier and also take advantage of Loki’s index sharding so each query would fetch data for a single shard in parallel. See the proposal for Cortex, which can be similarly applied to Loki.
Of course, all our work at Grafana Labs is open source, so feel free to take a look at all the PRs that we’ve made to improve Loki’s performance.
Thanks to Callum, Dieter, Ed, Goutham, and Julie for helping me make this blog live!
More about Loki
In other blog posts, we focus on key Loki features, including loki-canary early detection for missing logs, the Docker logging driver plugin and support for systemd, and adding structure to unstructured logs with the pipeline stage.
Be sure to check back for more content about Loki.