Modal Title
Cloud Native Ecosystem / Data / Data Science

Apache Druid: A Real-Time Database for Modern Analytics

you need to look for a database that has an optimized architecture and data format, is built for interactivity and will scale based on your needs. One open source database that checks off all three of these needs is Apache Druid.
May 18th, 2022 10:00am by
Featued image for: Apache Druid: A Real-Time Database for Modern Analytics
Feature image via Pixabay.

David Wang
David Wang is the VP of product marketing at Imply, where he is responsible for the company’s positioning, product messaging and technical content. He is an engineer turned marketer with a career building new categories and increasing awareness across next-gen data management, virtualization and enterprise storage technologies. Prior to joining Imply, David served in leadership roles at Hewlett Packard Enterprise (HPE), Nimble Storage and GE Digital.

Analytics have become the secret sauce for every company. But while it’s been useful for making decisions, it’s not just for internal stakeholders anymore. Companies like Twitter, Atlassian and Citrix are leading their industries because they are delivering insights to their customers.

For the technical leaders now charged with building an external analytics application, trying to figure out what’s the right database backend to use requires new considerations.

The easy answer is to default to a database like PostgreSQL or MySQL or even adapt a data warehouse outside of its standard BI dashboard and reporting functionality. While these options may seem to get the job done quickly at first, it’s important to remember that creating a customer-facing application can have grander implications than an internal application — a potential impact on revenue to name just one. Therefore, it’s critical that you start the job with a database that delivers the best user experience possible.

Loading… Please Wait

There’s nothing more frustrating than sitting and waiting for an application to return with results. It’s fine if your own employee has to wait a few seconds or even minutes for a query to process, but that wait time is unacceptable when it comes to external users like customers.

There are a few reasons why users often run into these long wait times: the amount of data you’re trying to analyze, the database’s processing power, user and API call numbers and more. Overall, it’s based on your database’s ability to run your application.

While it is feasible to use a generic OLAP database to create an interactive data experience when working with large amounts of data, you risk putting yourself at a costly disadvantage. If you try computing all the queries ahead of time, you wind up with an expensive and rigid architecture. Similarly, collecting all the data first can lead to a situation where you have minimized insights. If you only analyze data from recent events, you hinder your users from seeing the whole picture.

Therefore, to create an external-facing data analytics application, you need to look for a database that has an optimized architecture and data format, is built for interactivity and will scale based on your needs. One open source database that checks off all three of these needs is Apache Druid.

With its distributed and elastic architecture, Apache Druid prefetches data from a shared data layer into an infinite cluster of data servers. Because there’s no need to move data and you’re providing more flexibility to scale, this kind of architecture performs quicker as opposed to a decoupled query engine such as a cloud data warehouse.

Additionally, Apache Druid can process more queries per core by leveraging automatic, multilevel indexing that is built into its data format. This includes a global index, data dictionary and bitmap index, which goes beyond a standard OLAP columnar format and provides faster data crunching by maximizing CPU cycles.

High Availability Is a Must

When it comes to internal operations, experiencing an outage isn’t a huge deal, especially if it only lasts a few minutes. It may be a little inconvenient, but it’s not unheard of for OLAP databases and data warehouses to see some unplanned downtime and maintenance windows where services are unavailable.

However, it’s a completely different story when it comes to customer-facing, external analytics applications. If a customer experiences an unplanned outage, they could abandon the application temporarily or indefinitely, causing an impact on revenue. That’s why it’s incredibly important to prioritize resiliency for high availability and data durability when building these kinds of applications.

In order to achieve resiliency for customer-facing, external analytics applications, there are a few questions you should ask yourself: Can I safeguard from a node or cluster-wide failure? What would the impact be if I lost data? What is needed to protect my data and application?

It’s a fact of life that servers will eventually fail. Backing up your data and replicating your nodes is the standard method to ensure resiliency, but unless you maintain a frequent backup cadence, you’ll need to do more to mitigate data loss.

Instead, you need to make sure high availability and data durability are built into your database — specifically one that includes automatic, multilevel replication with shared data in S3/object storage. Apache Druid provides continuous backup capabilities, which automatically protect and can restore the latest version of your database even if your entire cluster is impacted.

Add Users without Adding to Your Costs

Architecting your backend for high concurrency is important because you need an application that can support scores of users and still provide an engaging experience. This can help you mitigate the possibility of angering your customers because their applications aren’t working properly.

It’s important to note that this isn’t the same as architecting for internal reporting since that typically has fewer regular users. Ultimately, that means you need one database for internal reporting and a different one for your highly concurrent applications.

There are three factors to consider when architecting a database for high concurrency: CPU usage, scalability and cost. Some may say that adding on more hardware can fix the issue, but that’s not always the best answer. While increasing the number of CPUs can allow you to run more queries, it will also come with a higher price tag.

Apache Druid provides a smarter and more economical choice because of its optimized storage and query engine that decreases CPU usage. “Optimized” is the keyword here; you want your infrastructure to serve more queries in the same amount of time rather than having your database read data it doesn’t need to.

Building for Today and the Future

Providing an external analytics application can be part of a fantastic customer retention strategy and revenue source. That’s why it’s essential to take the time to find the database that best supports your needs and build the right data architecture that will keep your customers happy.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.