Cloud monitoring service Datadog teamed up with incident management company PagerDuty to ensure the reliability of services for Airbnb in a just-released case study.
Airbnb, which allows travelers to book a couch or a castle, had developed its own service-based architecture for some components of its site, while other components continued to be part of its main application. Separate engineering teams were created to support the separate components and features, with each responsible for running its own software.
Though teams were responsible for contributing their own operational and business metrics to a central dashboard application, over time, teams used more open source projects with their own dashboards, making it that much more difficult to gain a broad overview of the entire infrastructure.
“When it became clear this approach would be difficult, if not impossible to scale, we decided to look for a comprehensive and more holistic operations performance solution. But we didn’t want to just add another tool that no one would use,” Dave Augustine, engineering manager at Airbnb, said.
Explained Alexis Lê-Quôc, CTO and co-founder of Datadog:
“They were looking for a real-time service — that was very important to them — and to be able to extend across as much of their infrastructure and applications as possible. It’s what you don’t know that kills you in this business.
“We can provide them with a lot of flexibility around the kind of data they’re interested in, the performance metrics, and we can do it in real time. The fact that we integrated well with PagerDuty, that worked particularly well for them.”
At Airbnb, teams can make a lot of decisions on what software should look like, what technologies to use, but the price they pay for this freedom is they have to run their own software, Lê-Quôc said.
With Datadog and PagerDuty, on-call engineers and support staff can customize which metrics are monitored and the thresholds that need to be crossed before being notified. PagerDuty can make sure the proper person is called, and with escalation rules, that the next person in line is notified if the original person doesn’t pick up in a timely manner.
“With comprehensive incident trend data to identify critical versus nuisance issues, we’ve been able to cut out additional noise and focus on those that require immediate attention, and to route them to the right engineer so we don’t bother the wrong people,” Augustine said. “So now when engineers hear from PagerDuty, they know it’s the real thing.”
Lê-Quôc pointed to SmartStack, designed to ensure high availability of inter-component connections, as among the services that Datadog monitors for Airbnb.
“If you have a web application server with a database, what if the database goes down?” he explained.” Even if you have a standby node ready to take over, the application doesn’t know where to go, it doesn’t know there’s a standby node it can connect to and resume its work. So they built this connection, in this case, between the web application server and the database that will reroute the traffic. So we monitor the health of SmartStack to make sure it works.”
Augustine said Datadog is finding relevant details and dependencies it didn’t know about before. Its automated process not only has helped improve the productivity of its engineering teams, but given the company better understanding of teams’ and service performance.
Datadog in late January announced it has added $31 million to its coffers, and tripled revenue and headcount in the previous year. Olivier Pomel, Datadog co-founder and CEO, said at that time the company expected to double or triple in size again in 2015.
Two weeks after the funding announcement, Datadog announced the acquisition of New York-based big data startup Mortar Data to further integrate machine learning into its offerings.
In March it announced a visualization tool called Host Maps to provide a quick, easy view into your hosts and any problems there.
Datadog is a sponsor of The New Stack.
Feature image via Flickr Creative Commons.