How a regex simplification in Loki increased performance by up to 300x
When we graduated Loki into a GA release last year, there were more than 137 contributors who already made more than 1,000 contributions to the project. We also added hosted Loki to the lineup of Grafana Cloud offerings after it proved to be stable internally for our ops cluster, storing 40TB and half a trillion log lines each month.
There was, however, one persistent problem that kept surfacing, especially for developers who were writing applications in Go: The regex package was slow.
Our goal has always been to make Loki not only cost-effective. We want it to be efficient, too, so we decided to tackle this problem. In this blog post, we’re going to dissect what the root cause was behind the speed issue and how the Grafana Labs team found a workaround to improve overall performance in Loki.
The problem
It’s universally known that the regex package in Go is quite slow. This package is used by Prometheus, Thanos, and basically everyone using Go and regex. And for those who are working with Loki, it’s especially problematic because if you have to search for your logs and apply a regex for each line, one line at a time, then Loki’s performance will be less than ideal.
That’s because the Go team created the regex algorithm using linear time, which means there’s a direct correlation between the amount of data inputted and the time required to carry out a function. So the more data that’s ingested, the longer the application takes to process the workload.
But that’s not without good reason. The regex in Go was designed with safety as a priority. The algorithm specifically blocks a security attack called the regular expression denial of service (ReDoS) attack, during which an attacker provides a regex that takes an inordinate amount of time to process. That, in turn, causes the program to overload and slow down or shut down altogether.
The Go team initially had expected more engineers to use this package for production services, a use case in which the regexp performance is sufficient and the security feature was necessary. But for those who aren’t implementing it in a similar instance, the package can be sluggish. The Go authors have always been aware of the issue (and the complaints), but the trade-off has always made sense: It’s better to have security over speed.
The solution
Our workaround was inspired by the package itself. There is a syntax sub-package that allows users to parse a regex expression and access the abstract syntax tree.
The package leverages simplification, which breaks down the regular expression by replacing it with another combination of expressions. All the while, the original expression remains intact.
With that concept in mind, we applied our own regexp simplification in Loki by using byte comparisons. This is possible because we are simply filtering and we don’t need to capture matches.
So let’s take a basic example of a filtered expression in Loki such as {foo="bar"} |~ "(DEBUG|WARNING)"
.
If you use the regex package in Go, this will run slowly because you’re matching in your log, for each log line, if there is either a debug or warning label. And for a single query, Loki can traverse millions of log lines.
In Loki, a byte comparison would scan logs to see if it contains “debug” or if it contains “warning.” So we actually don’t run a regex expression. We, in fact, bypass that.
How? First we use the syntax package to parse the user regex and see if there are ways to run the regex differently. (Regular expressions can be very complex so we didn’t want to do our own parsing. Luckily, the Go engineers made this very easy for us!)
If no improvement can be made, then Loki will use the normal regex. But if we can improve upon it and avoid the regex, Loki will run a bytes.Contains
, which is less complex to run than a regex.
So in the above example, because the filtered expression can be simplified, we don’t need to run a costly regex. Instead we only run two bytes.Contains on the string.
Not only does the byte comparison improve the performance of the original package and keep costs down. We do not compromise the original security measure that the Go team implemented because we don’t have to replace the regex package in Loki.
The results
When running a 24h query range for {namespace="loki-dev"} |~ "foo|bar"
, the results went from being 11.5s to 1.5s after the regexp simplification was introduced in Loki v1.4.0.
Some regexp were slow for line filtering, specially the alternate op such as `{app="foo"} |~ "err|panic"`. I dove into the golang regexp/syntax package and find way to simplify them for line filtering. Result speak for itself: pic.twitter.com/19R0r5QLVR
— Cyril Tovena (@Kuqd) February 25, 2020
Overall, for various use cases we saw benchmark improvements ranging from 5x to more than 300x!
What’s next
One thing we want to do in the future is extract all that work in the PR and create a package that can be used by the whole ecosystem. We want to give anyone who is running a project that leverages filtering and does not require capturing values access to this work as well.
We also presented our vision for the future of Loki at GrafanaCONline 2020, highlighting some of the team’s long-term goals and non-goals, what features we would like to add, and the active discussion and plans for the LogQL query language. You can watch the recording of the talk here.