Grafana Cloud Logs: Lithic’s gateway to better observability
When consumers use credit or debit cards, it’s all about convenience. But for the engineers that build the underlying issuing infrastructure for those cards, that convenience comes with expectations.
“How many times do you have to get embarrassed by swiping a card that doesn’t work before you just don’t pull that card out of your wallet anymore? We take that availability very seriously,” said Howard Tyson, Head of Platform at Lithic, a New York-based company that builds payments infrastructure. The company’s tech enables banks and fintechs to offer debit and credit cards, and other payment needs. “Internally, we’re trying to hit five 9s on critical services; if we lose a couple of minutes figuring out what went wrong, we blow our error budget for the entire quarter.”
However unrealistic it is, Lithic’s goal is zero downtime. In the off chance it does happen, Lithic’s engineers need to resolve issues quickly so end users remain blissfully unaware of them. Today, Lithic achieves that reliability with Grafana Cloud as part of modern observability best practices — but they didn’t start there.
As Lithic grew since its founding in 2014, the company’s top engineers found themselves relying on a suboptimal Graphite backend that was collecting a huge amount of time series, very little of which was actually being used. On top of that, their logging solution was essentially “one big syslog instance,” Tyson recalls.
The company began using Datadog in an attempt to update their stack and make improvements, but cost became a big concern as Lithic’s business expanded: “The cost was already getting hard to manage and the company was growing fairly fast.”
They also couldn’t give up their old system entirely, so they needed a product that could work with their Graphite stack while also offering better logging capabilities and room to further scale and modernize. And that’s when they found Grafana Cloud.
The path to Grafana Cloud
After years of hosting its own observability stack with Graphite and Grafana OSS, Lithic wanted to switch to a SaaS offering to avoid any undifferentiated heavy lifting. “I’ve run these systems before — it’s painful and everyone is doing the same thing,” Tyson says. So they began their search assuming a dedicated observability vendor could do it more efficiently “because it’s the only thing they’re dealing with instead of being one of a hundred things we have to deal with.”
The Lithic observability team already had enough on their plates. First, they needed a way to manage logs — lots and lots of them. According to Tyson, excessive logs are a common issue among young companies that lack best practices tooling and experience: “Not knowing what’s available, not knowing that other things are super powerful — and just not even knowing they exist. Logs are just the default.”
Lithic has another crucial reason for needing logs, too. Because the company works in the financial sector, it must be able to query logs for compliance purposes for at least a year, and often much longer. Additionally, the customer experience team needs to be able to easily query 30, 60, and 90 days in the past.
There was also the question of getting value from metrics. Brett Jones, Sr. Staff Site Reliability Engineer, joined Lithic in 2022 and recalls, “The first thing that I noticed was that we have metrics, but I don’t know that metrics necessarily equate to monitoring. Just because you have output doesn’t mean you have anything valuable from the output.”
Jones discovered some numbers didn’t make sense, and when he looked into the issue, he realized Lithic was dropping a significant portion of its UDP traffic. “Our system couldn’t keep up with indexing all of those time series,” Tyson explains. “We were dropping metrics on the floor. Our numbers didn’t mean what we thought.”
And since the engineers didn’t have the capacity to completely overhaul the legacy stack, Tyson wondered, “How can we pull these things out in a way that is cost-effective and just effective — full stop — where we can get the visibility and the monitoring that we need on the old stuff while we build a better world for the future? And then we found Grafana Labs, which is doing it with a personal touch, a smile, and with regard to logging, some superpowers.”
Grafana Cloud Logs at Lithic: ‘It felt like magic’
Tyson and his team were initially attracted to Grafana Cloud because it was cost effective. ("Our Datadog bills for logs weekly was more than our annual logs bills with Grafana Cloud," Tyson says.) But it was those logging superpowers in Grafana Cloud Logs, which is powered by the open source project Grafana Loki, that really made a huge difference.
For example, other systems make users create a specific index to filter data, he explains. “With Loki, instead of pre-filtering everything and then having like one bucket where we’ve copied a reference to every record, I can have 100 buckets, but have 100 different people rifle through those individually and look at every single one. That’s less efficient in terms of total cycles per query, but it’s infinitely flexible and we can do it all after the fact.”
He also likes that Loki makes it possible to do detailed metrics analysis after the fact. “You’d have to realize there was a thing you cared about before you cared to ask the question. We could answer it for the past quarter in a quarter, but that’s all,” he explains. “It felt like magic to be able to come up with a question about the last six months today and graph that retroactively. That’s awesome.”
Lithic’s view into their PostgreSQL database focused on replication
Logs: the gateway to better observability
Of course, an observability platform is only really effective if everyone buys into it, and there was some initial reticence about the switch among the 20-person customer support team.
“While the rest of the company charged ahead, our internal tooling needed to catch up for customer implementations, customer success, etc.,” Tyson says. The team never had a UI built for their needs, and they had gotten used to the system in place. “Someone might dump a string into Datadog and search a billion log messages to see if they could find one customer interaction. And then what comes back is a bunch of unrelated stuff all jumbled together in a format that’s not very nice to read. But they learned to use that and they were effective with it.” As a result, the customer team didn’t want to give it up.
As Jones got familiar with how things worked at Lithic, he paid attention to how team members operated. “I noticed a lot of ways that we debug and figure things out was by scanning the logs, reading them, and looking at bits of graphs here and there.”
Before joining the company, Jones had contributed to the open source Loki project and used it heavily, and that gave him a thought: “The tools that Loki provides can give us a lot of this out of the box.” With all of their logs in the system, people would be able to continue to do their usual queries and scans, and extract data from the logs. He also knew he could show them how to write a query to get a histogram and data points off of that. “I saw our heavy use of logs as a gateway into better observability with traces, metrics, and profiling. That was my goal.”
The engineering team began talking to their counterparts to find out more about their specific use cases. They then demonstrated that rather than looking through billions of logs and learning to ignore certain information, they could see everything in one place. Jones also recorded himself using company-specific logs for what he calls “wild queries,” which instantly pulled data out in ways that before, users would have had to scroll around and figure out. “If you can show people that they can do something that powerful live, without having to pre-index, configure, or really do anything, they want to start learning how to use the tool a little bit more.”
The way Tyson thought of it, the team members were under the impression they wanted a faster and better formatted way to look through 10,000 characters of json to find one field. He was able to say, “What if I just put that one field on a column on a table?” He showed them that in Grafana Cloud, they could put in the ID they’re looking for and they get columns for everything they care about, then click on the record to get more detail.
And then the reason to switch to Grafana clicked. “They ended up going from being really worried they were going to give up something they loved, to something where they felt we’d made their jobs way easier,” Tyson says “They said, ‘I was dreading this, and now I feel like this is the best improvement I’ve had to my job all year.’”
Building a partnership for the future
Grafana Cloud’s ability to work with all sorts of data sources has been a major boon for Lithic, and now the team monitors all of them in one Grafana dashboard.
“We had checks in our legacy Grafana, in Datadog, and in Icinga. They all worked a little bit differently so it was difficult to maintain. Being able to centralize those into one pane of glass was a big win,” Tyson says. “Getting that stuff into a common format and with some better supporting dashboards has really improved everybody’s situational awareness when we get into an incident. With Grafana, we can now set alerts much more effectively and cost-efficiently.”
Currently, Lithic is using Grafana Cloud for its Prometheus backend, Graphite backend, and several integrated CloudWatch backends. And while Lithic still needs Graphite, Tyson says it’s a bonus for it to be hosted by Grafana. “Running Graphite databases is surprisingly a pain at scale. Grafana Labs charges us nominally to make this their problem, which is amazing and a big win for us.”
“We have also recently added Grafana Cloud Traces to our stack and it’s instrumental for all our most critical services and we’re already seeing returns. “Traces are that happy medium between the verbosity and unstructured format of logs and the terseness and the numeric values that you get from metrics,” Jones says. “I think tracing is the future. It doesn’t totally replace metrics and logs. But once you get traces, it’s the thing you just start reaching for first.”
In the year that Lithic has been a Grafana Cloud customer, Tyson says he has been thrilled with the product. “It gets way better month after month after month — and the pricing has not been ratcheting up,” he says. Moreover, they feel like they’re a valued partner in this evolving space, which gives them confidence in their continued investment.
As Jones put it: “It feels like we have a real personal relationship with people that care about what our outcomes are.” And thus far, says Tyson, “Grafana is definitely helping us continue to raise the bar.”