2021: The year of Cortex for IoT?
My Grafana Labs colleague RichiH recently talked about why IoT and time series databases work so well together. It just so happens that we have a highly scalable time series database on hand. Let’s talk about that.
My name is Goutham, and I am a maintainer for Cortex. I have been working on it for nearly three years out of the four-and-a-half years the project has existed. Cortex is built to serve as a scalable, long-term store for Prometheus. It ingests data via the Prometheus remote-write protocol, and you can query the data back using PromQL and Prometheus remote-read.
I am going to make a bold prediction: Cortex will take major steps towards being a database for IoT in 2021. While I am not an expert in IoT, I’ve seen people struggle to send data from their sensors to Grafana Cloud, Grafana Labs’ managed observability stack powered by Cortex, and I want to fix that.
I plan to set up a homelab pushing sensor data to GrafanaCloud and also speak to customers who approach Grafana Labs for an IoT database solution to understand the market better. I am starting by writing this post to put my thoughts on paper and to make sure I didn’t miss anything major. Please email me if you think of something I should include but didn’t.
Cortex is built for Prometheus
The Cortex website describes Cortex as a “horizontally scalable, highly available, multi-tenant, long term Prometheus.” Prometheus is an open source systems monitoring and alerting toolkit. It has a query language called PromQL and a powerful data model that allows you to monitor your applications and infrastructure. At the core of Prometheus is a time series database (TSDB) that makes Prometheus very easy to use for monitoring and alerting. The PromQL query language is among the best if you want to analyze what your services are doing.
Over the past 15 months, the Cortex maintainers, along with the community, have been hard at work to replace the Cortex storage engine to be powered by Prometheus’ own TSDB, but backed by object storage, which is an approach we borrowed from our sister project Thanos.
It’s been a long journey, but we’ve just migrated all our production clusters to this new storage engine. It gives us much higher scalability at a fraction of the cost, and we’re seeing improvements in query latency across all our clusters. But what excites me most about this new engine is that you can scale to hundreds of millions of active series quite easily, with your only dependency being object storage, which is very cost-efficient given our state-of-the-art compression.
Fundamentally, Cortex is a TSDB at its core, just like Prometheus. But I don’t think anyone is using it for anything other than monitoring yet. This is because I think the interfaces to access the DB are still lacking for use-cases beyond Prometheus.
Requirements for an IoT database
I want Cortex to become a database for IoT data. For a proper IoT database, I see a few requirements:
- A solid TSDB: Every IoT database needs a solid time series database backing it. IoT systems generate lots and lots of data — most of it time series — and you need a really efficient and scalable database to store and query.
- A simple way to ingest: IoT devices are simple devices that will likely communicate using MQTT, TCP, or HTTP, and the database needs to accept data directly over those protocols. Moreover, the payload should be simple to generate.
- Out-of-order ingestion: IoT devices will often push data over unreliable networks, or be detached from the network for minutes or weeks, and this means that the data may reach our database out of order. The database needs to be able to handle out-of-order writes.
- A query language: IoT devices usually don’t hold state and push deltas. This means we need a query language that can handle large intervals between points and also gauges. PromQL is powerful, but we need a few more functions like last_over_time for it to be really useful for IoT analysis.
Cortex as an IoT datastore
The best thing that Cortex has going for it today is its TSDB. We’ve built a scalable, inexpensive backend that is also fast and efficient. You can store hundreds of millions of active series and petabytes of data without breaking a sweat.
But Cortex currently only accepts the Prometheus remote-write protocol, which is snappy compressed protobufs and not at all suitable for IoT use cases. To this end, I think Influx built a great protocol that is both easy to generate and consume. I’ve quickly hacked together influx2cortex, a write proxy that accepts different protocols and writes them to Cortex. Today influx2cortex only supports Influx Line protocol, but I have plans to add other protocols in the future (JSON is the main one; I need to see how MQTT push would work). This is only temporary and we intend to merge this functionality into Cortex upstream. We are using influx2cortex as a playground to understand the use-cases and to experiment with the code before merging it upstream.
You can now point your Telegraf or other systems pushing to Influx at the proxy, and it will write equivalent Prometheus series to Cortex. We now support it in Grafana Cloud, and you can take it for a spin.
The ability to handle out-of-order data is important for a push-based system, and it is a problem that is tricky to solve. We currently cannot do this in Prometheus or Cortex. Any solution will drive up the resource usage quite a bit, and I have a hard time imagining how to support it. Having said that, I see it as a very important feature we will have to build. We’re slowly seeing Prometheus and Cortex being used for more and more use cases, and there is consensus in Prometheus to add it upstream. Grafana Loki will be tackling out-of-order inserts soon, and given the similarity in architecture, I think we would be able to use their approach.
When it comes to the query language, I know I am biased, but I think PromQL is a great language. If that is not enough, I am looking to add flux support. There is already a PR that adds it upstream, and I plan to pick it up quite soon. But I expect the flux interface to use remote-read, which means you’ll lose out on many of the caching and parallelization features of Cortex. I envision people will use PromQL for most of their needs and resort to flux only on a very occasional basis. Further, we are continuously adding more functions to PromQL to enable new use-cases!
To reiterate, I think with influx2cortex, Cortex is a good enough IoT database already, and with the addition of flux support, it will be even better. But I am still not sure how important out-of-order ingest support is. If you have experience with this, please reach out, as I’d love to chat with you!
Looking into the future
I think we’ve built something powerful and beautiful with Cortex, and I want to expand its use case from just Prometheus long-term storage to IoT and more. I think the current state of affairs is more than good enough for people to start experimenting and seeing if Cortex is a good datastore for their time series data. At the same time, we will be busy chipping away and improving things. The usual Cortex write path has been optimized to its peak, and we need to do the same for the other protocols.
I will be publishing a series of articles here as I experiment more and more with an IoT homelab and investigate how some of our customers on Grafana Cloud are using Cortex for IoT. If you want to play a fully managed time series database that is also great for Prometheus data, check out Grafana Cloud. The free tier of Grafana Cloud is great for testing and seeing what it can do, and you can sign up without entering a credit card!
Finally, my goal in 2021 is to get 10 hobby users and at least one serious customer using Grafana Cloud for IoT to help me refine this use case for a much wider audience. If this sounds interesting to you, please feel free to contact me — I’d love to talk!