Sailpoint

Reduce metrics volume in minutes: How SailPoint manages cardinality and costs with Grafana Cloud

DevOps engineer Lydia Clarke started off 2023 with 75 million active series and one major challenge: Reduce the burgeoning metrics volume at her company, SailPoint, without wreaking havoc on the developer experience.

“The main con was that if we went over our allotted limit, our on-call engineer would get blown up with alerts,” says Clarke, who joined the observability team at the Austin-based identity security company in 2021. “Our temporary solution was to turn off observability in a few of our dev clusters, which wasn’t our favorite solution. But we didn’t have an instant fix.”

That changed with Adaptive Metrics, the metrics management feature in Grafana Cloud that enables teams to aggregate unused and partially used metrics into lower cardinality versions of themselves to reduce costs. Within a few months of applying Adaptive Metrics aggregation suggestions in combination with internal efforts within the engineering team, SailPoint managed to reduce their metrics volume by 33%.

“The main issue was that if you have 50 different services using the same metric, you have to find a way to make changes to the metric without disrupting any of the teams,” says Clarke. “The great thing with Adaptive Metrics is, instead of using my own tools to figure out who’s using the metric and following up with each team, it just tells me exactly what I’m looking for. If I want the metrics that we’re not seeing usage on, the tool gives me that information a lot quicker than having to go dig for it.”

Now, as SailPoint continues to scale and add new services to its portfolio, “Adaptive Metrics really helps us to grow efficiently from here on out, without just blowing up our metrics and our costs,” says Omar Lopez, head of the observability team. Adds Clarke: “It’s given us more control over our metrics.”

As a result, Sailpoint is eager to build on their growing partnership with Grafana Labs. “The engineers are just phenomenal. It really shows through the product itself, but also in the support we get when we have questions or we need solutions,” says Lopez. “I put a lot of faith in their engineering chops.”

Migrating metrics to Grafana Cloud

As a leading provider of identity security, SailPoint is focused on delivering the next-generation identity security platform—a scalable, intelligent, extensible approach to manage and secure access to critical data and applications for the modern enterprise.

“We’re in hyper-growth mode. We’re growing not only the company, but our footprint in the cloud,” says Lopez, who joined SailPoint three years ago. Even then, “our infrastructure was growing, our metrics were growing, and everything was just growing at such a rapid pace that we were hitting our instance limits.”

As their Prometheus servers got bigger and bigger, SailPoint also started to max out its AWS instances, forcing the observability team to horizontally scale their infrastructure with Cortex, and later with Grafana Mimir, the open source TSDB launched by Grafana Labs in 2022. The team used Grafana OSS to visualize and monitor the health and performance of the infrastructure.

Sailpoint’s active series limit dashboard

A look at how Sailpoint tracks their active series limits

While the updated architecture was an effective solution, “it took more and more engineering power to keep that going,” says Lopez. When the team sat down to analyze the total cost of ownership for their self-hosted metrics, taking into account infrastructure costs and manpower, they realized their monthly bills were costly.

And they assumed it would be an even bigger price tag for a hosted solution — until they looked into Grafana Cloud Metrics. “When we crunched the numbers, Grafana Labs was offering to run everything for cheaper, and it would reduce the load on our engineering team. That was our ‘a-ha’ moment,” says Lopez.

Not to mention, “the product and the performance were equal,” adds Lopez. “When we talked to our engineering teams, there was no drop off in performance. Everyone was extremely pleased with the transition.”

In fact, the team was so pleased that they also migrated from Grafana OSS to Grafana Cloud for visualizations and monitoring. “We spent a lot of time managing Grafana internally. Now we have more time to innovate,” says Lopez. “And the enterprise plugins were interesting to us — for example there’s a Snowflake plugin and we run Snowflake. There were a lot of really nice-to-haves that we thought we could leverage.”

Adaptive Metrics: low risk, high rewards

One feature, however, quickly became a must-have: Adaptive Metrics. While Grafana Cloud made storing metrics easier on the team, “cardinality is such a killer,” says Lopez.

When Lopez posed the 75-million-metric question to Clarke to solve, “it was overwhelming to start,” says Clarke. “I didn’t know where I was going to be able to cut down.”

Clarke first turned to Grafana Cloud’s Cardinality Management dashboards, which display how metrics and labels are distributed across time series data sent to Grafana Cloud Metrics. The dashboards also provide usage information that show which of the metrics in Grafana Cloud are actually being utilized.

While the Cardinality Management dashboards helped identify metrics usage, Clarke then had to take that information and do her own detective work within the company. “There are so many things I have to look into,” Clarke says. ”Because my team provides the infrastructure for other teams to create and monitor their own metrics, we don’t always know the context of every single metric we’re supporting. In order to understand what changes could be made to a metric to reduce cardinality, I had to meet with our SRE team to look at the importance of each metric and what visibility they were providing the teams they belonged to. ”

That process was often unpredictable. Once an underutilized metric was identified, Clarke would have to reach out to respective teams who use the metric to make sure that making any changes wouldn’t disrupt their services. “Let’s say you have 50 different services using the same metric, that’s a big challenge,” says Clarke. “It was a lot of following up.”

When that work was done, Clarke then had to manually deploy the changes in SailPoint’s development environment for testing. When everything worked right, she would deploy the changes to each Prometheus instance and troubleshoot as needed to ensure there were no interruptions in services. “It would take hours of my day,” says Clarke.

With Adaptive Metrics, Clarke’s job is infinitely easier. Adaptive Metrics recommends which metrics to aggregate based on the actual usage within your cloud environment. It then continuously reanalyzes your usage, adapting its recommendations to reflect changes in your observability stack

When we use Adaptive Metrics, I just have to download the recommendations, run a script, apply it, and I’m done. That’s maybe five minutes that I can apply the changes compared to the few hours I was spending trying to deploy to Prometheus. It’s an instant fix that we can implement and not worry about affecting anyone.

Lydia Clarke, DevOps engineer

In fact, when SailPoint first started testing Adaptive Metrics, no one even noticed any changes. “One of our biggest concerns was what if we dropped something that a team needs?” says Lopez. “But once we dug into it and started applying these recommendations little by little, it became very clear that engineering didn’t even realize what was happening.”

“We haven’t had any complaints about dashboards not working, or a query. Everything runs like it did before,” adds Clarke. “Adaptive Metrics has been a great way to show engineering that we can make changes to metrics, and they don’t have to worry about things breaking. And I don’t have to worry about causing outages or downtime. It is a really low-risk solution.”

And the return investment has been high. Recently, there was an issue in the production environment where a misconfigured metric caused a spike during deployment. “We applied Adaptive Metrics, we troubleshooted it, and then we reverted the metric,” says Clarke. “Adaptive Metrics gives us more control when we want to get back under our limit if we hit our limit.”

Luckily that doesn’t happen too often anymore. Currently SailPoint manages about 50 million active series, and they have ambitions to reduce that number even more. “We’ve overachieved on our goal,” says Lopez.

More importantly, “now when I’m on call, I’m not worried about getting blown up with Prometheus and Grafana alerts,” Clarke laughs.

‘The best of the best’

From the beginning, there has been one key metric that stood out for SailPoint about working with Grafana Labs: the support.

Though they were open source users, SailPoint engineers often found Grafana Labs engineers responsive to their questions — providing technical support, hosting webinars for the company, and overall, being reliable partners from the start. “What stuck in the back of my mind was that they were just doing this to help us be successful,” says Lopez. “So when it came around to making a move to Grafana Cloud, I always thought about that fondly—that great support from the team.”

He also appreciates the technical prowess behind every Grafana Labs product. “The engineering team is super sharp. They’re experts. This is the best of the best,” says Lopez. “When they release a new tool or they have logs, tracing, or all other aspects of observability, there’s going to be some good engineering thought behind it, and it’s worth looking at and considering.”

In addition to using Grafana Cloud for visualizations and Grafana Cloud Metrics for storage and management, the SailPoint observability team is looking to extend their stack to include Grafana Cloud Frontend Observability for real user monitoring. The company has also adopted Grafana Cloud k6, the hosted performance testing solution powered by the open source Grafana k6 project. (“That is another team entirely. They did their own analysis, and k6 came out the winner,” notes Lopez. “It’s awesome that a different team came to the same conclusion” about Grafana Cloud.)

“We’re in the middle of our observability journey, but we’re just at the beginning of the relationship with Grafana Labs,” says Lopez. “We’re exploring more and more of the features and services in Grafana Cloud. It’s like being a kid in a candy store — we’re looking at everything.”

Sailpoint logo
Industry
Software & Technology
Company Size
2400+
Headquarters
Austin, TX, USA
33%
reduction in metric volume by deploying Adaptive Metrics in conjunction with other internal team efforts
50 million
total active series now managed within Grafana Cloud Metrics