Chronosphere: Metrics at the Scale of Uber

3 Mar 2020 9:35am, by

As companies move to microservice architectures, they find that the sheer volume of metrics that their systems create simply explodes. That was the situation with Uber back in late 2014, when it realized its Graphite/Carbon monitoring stack just couldn’t keep up.

It later tried a mix of stat site for aggregation, Cassandra for time series storage, and ElasticSearch for indexing, but found that inadequate as well.

That ultimately led the ride-share company to build its own metrics platform, called M3, which it open sourced in 2018. Now two of the technical leads on that project, Rob Skillington and Martin Mao, are propelling that technology into a company called Chronosphere.

“Uber was growing, and we had to scale out the infrastructure, but at the same time, the complexity of the monitoring sort of increased by orders of magnitude,” said Mao, Chronosphere’s CEO.

“What happens with these architecture changes, because there are so many smaller components now. What happens from the monitoring side is that a lot more data gets produced. And a lot of this data is high cardinality or high dimensionality data. We experienced that none of the open source tools were we using could really scale to all the new amounts data and the high dimensionality of the data that was now being produced, and actually none of the commercial offerings” could either, Mao said.

Scaling Up and Up

In November, the company announced an $11 million investment led by Greylock Partners.

Greylock’s Jerry Chen told The New Stack:

“Observability, and metrics in particular, have become the necessary pulse for all companies of all sizes. Metrics for apps, infrastructure, and business operations are now essential for companies to run their business.

“Over the past few years, as companies have moved to cloud-native architectures, they have seen an explosion in the number of metrics and high dimensionality or HD metrics. Every app event now creates a ton of data like timestamp, app version, OS version, latency, etc. Existing systems can’t handle HD metrics in a cost-effective manner. The Chronosphere team solved this problem at Uber building M3, a solution that could handle scale but also at cost.

“Chronosphere is building the ability to query high-dimensionality metrics to all companies of all sizes. Using Chronosphere, customers now have visibility into app and infrastructure usage, cost consumption by app or BU, from an always-on and reliable cloud service.”

Mao has described M3DB as being purpose-built for metrics. He explained that other time-series databases — like TimeScale, which is built on Postgres; or Cockroach or PingCAP’s TiDB, which rely on RocksDB — have at their core databases that were not built for the volume or complexity of metrics in the cloud-native world. He mentioned InfluxDB, however, as one of the others built from the ground up to be optimized for time-series data.

He pointed to Datadog as its closest competitor, though it’s targeting a wider swath of customers, he said, while Chronosphere is highly focused on scale and efficient monitoring for large enterprises.

Skillington, now Chronosphere’s chief technology officer, originally described M3 as a remote storage backend for Prometheus.

He said when it was open sourced:

Released in 2015, M3 now houses over 6.6 billion time series. M3 aggregates 500 million metrics per second and persists 20 million resulting metrics-per-second to storage globally (with M3DB), using a quorum write to persist each metric to three replicas in a region. It also lets engineers author metric policies that tell M3 to store certain metrics at shorter or longer retentions (two days, one month, six months, one year, three years, five years, etc.) and at a specific granularity (one second, ten seconds, one minute, ten minutes, etc).  This allows engineers and data scientists to intelligently store time series at different retentions with both fine and coarse-grained scopes using metrics tag (label) matching to defined storage policies.

The open source software has three main components: a time-series database, an ingestion pipeline and a query engine. It doesn’t come with any dashboarding, alerting capabilities or anomaly detection capability, features that the Chronosphere team has built into its SaaS product.

It can be run on public clouds or on-premise. It supports metric ingestion formats and languages including SQL, Prometheus, Carbon, Graphite and more.

Controlling Costs

Chronosphere lets the customer decide how long to store different kinds of data.

Someone rolling out a new deployment of a service may want to know which particular containers or which particular Kubernetes pod is causing issues. But after the deployment has been operational for while, particular IDs or particular containers are less useful. Chronosphere allows the user to specify how long to retain every subset of the data.

“So for example, all of the ephemeral data, like the information about the pods and containers, you can choose to only store them for two hours if you want, for six hours if you want, depending on your use case, and then you only pay for the resource usage of that period of time,” Mao said.

Beyond automating dashboarding and alerting, because it’s focused on large organizations, it’s been adding enterprise feature such as intelligent rate-limiting, fine-grained access controls and resource limitations.

You could implement the open source M3 platform, he noted, but it would be like everyone in the company using one tool, with the ability to step on others’ toes or one person bringing down the system for everyone. Its new features and roadmap are focused on use within a multitenant organization.

In November, it added distributed tracing capability that’s deeply integrated into the existing monitoring stack. In the future, it will be building on integration. For instance, building out the capability to click on any metric data point on the dashboard and go directly to one of its underlying distributed traces.

Mao has said that scalability is the determining factor in rolling out new features.

It wants to provide users more visibility into how different teams are using the product more capability to control their costs by providing options when they get close to maxing out their budgets.

The technology remains in private beta, though it has customers running Chronosphere in production, Mao said. It has no real timeline for ending beta but probably will reach GA later this year. It plans to continue working with a customer to onboarding them rather than using self-service via a webpage or other method.

Feature image: “Helm” by depo17. Licensed under CC BY-SA 2.0.

A newsletter digest of the week’s most important stories & analyses.