How Datadog Monitors Scalable Systems
As application monitoring and container monitoring continue to evolve, enterprises find themselves faced with issues they may not have considered in the past. As systems become more complex, scaling and availability of the solutions deployed to monitor a system’s health must be at the very least on-par in terms of overall scaling and availability with the system they are paired with. As more companies move from on-premise solutions to working entirely in the cloud, dynamic scaling has become crucial. Whether working with Kubernetes, auto-scaling, or via another platform — System monitoring platforms must also scale to meet these needs.
Datadog is a monitoring platform born in the cloud. As such, it understands the challenges companies face when scaling applications — And has engineered its solutions to handle them. Datadog passes this knowledge on to customers by “Building the relevant monitoring and integrations directly into our platform,” said Ilan Rabinovitch, Datadog’s director of technical community. “This means offering tooling and alerting that is configured as dynamically as the infrastructure we are monitoring for our customers, collecting metrics from hundreds of cloud platforms and open source tools out of the box, and offering intelligent algorithm alerting via methods such as outlier detection.”
Operating at Scale
When first looking into application performance monitoring, companies should first consider the type of data they want to collect, and determine what metrics are the organization’s main focal point. Rabinovitch notes that this should be the main focus for any organization interested in setting up a monitoring platform.
“While it might be interesting know that CPU or memory usage might be higher than normal on a given cluster, paging your team about resource usage is the first step in the path to pager fatigue,” Rabinovitch said.
Datadog recommends that teams focus on what they have dubbed ‘work metrics,’ a top-level health depictor of one’s system by measuring its useful output. What an organization defines as useful metrics will vary based on the use-case and customer base being served.
Rabinovitch offers the example that if an organization is based on a web-service, ‘useful output’ may mean the percentage of successfully returning API calls, requests per second served, or other metrics. These statistics are undoubtedly more useful to alert one’s team to than resource-level alarms, as they are a clear indicator of if a company is providing its services to its customers without an issue.
All Systems Go
Datadog has seen adoption in organizations both large and small though it has recently been picked up by Lithium to monitor OpenStack and Kubernetes clusters along with the applications deployed on top of those clusters. Lithium uses OpenStack to piece together not only its production and development environments but its production communities and demo environments for its sales engineers. As such, Lithium turned to Datadog to help it monitor OpenStack around the clock. With Datadog, they are able to see the number of total instances running, available memory, metric deltas, and more.
As it continues to be used in supporting open source projects and platforms, Datadog has also been working with the Apache Software Foundation to help the non-profit identify and resolve capacity problems at their hosting providers. With the advent of auto-scaling hosts on cloud providers such as AWS, Kubernetes, and Mesos, environments are in a constant state of flux.
“This constant change adds operational complexity and makes it difficult to use pre-defined thresholds to alert on abnormal, as normal changes from minute to minute,” Rabinovitch said.
To better decipher the usage and frequency in a modern stack, Datadog recently ran a study on Docker usage and adoption which found that most hosts tended to run roughly four containers concurrently. Each container lived less than a fourth of the lifetime of its host, which meant that monitoring tools should no longer focus on individual hosts as their primary unit of measurement. Rabinovitch notes that Datadog monitors across a system’s boundaries, with anomaly detection that helps users algorithmically detect any abnormalities rather than relying on preset thresholds.
Under the Hood
Like many other software service providers, the team at Datadog has embraced the way of the polyglot, though Rabinovitch noted ”The bulk of our backend systems are written in either GoLang or Python. That being said, we tend to pick the right tool for the job so technology may vary to some extent by team.”
At the data layer, Datadog works with stores such as Cassandra, PostgreSQL, Redis, and Kafka, among many others. Datadog also utilizes a series of homegrown proprietary systems, including a custom database developed for some of its time series data.
“We make heavy use of Chef and Consul to manage our application deployment and service configuration. The Datadog agent which runs on end user’s systems to collect metrics is python based. We release the agent as Open Source software via Github,” said Rabinovitch. When collecting metrics, Datadog does so via an agent running locally on a user’s instances or servers. Users can then submit metrics to the agent by using agent checks, and are then able to pull metrics from other third-party services such as MySQL, Docker, or Cassandra.
Alternatively, developers can submit custom metrics to Datadog using their statsd forwarder built into the Datadog agent, or via SDKs for most common languages (Python, Ruby, .NET, Java, etc). They will then, in turn, store these metrics so that teams can visualize and share data. This includes visualization dashboards, alerts on service health, and analysis with features such as outlier detection, notes Rabinovitch.
Whether your team is looking to get started with system monitoring or is currently well ahead of the curve, Datadog offers a solid solution to any team looking to get deeper insight into the metrics of their entire system.