MotherDuck’s Hybrid Query Execution Enhances Real-Time Data Analytics
There’s a good reason MotherDuck, a serverless analytics platform with a novel approach to real-time data analytics, has undergone three rounds of funding in approximately 15 months. Since its startup in 2022, the company has raised $100 million dollars, largely based on one relatively simple value proposition.
With the assistance of DuckDB, an open source OLAP database, MotherDuck queries data in place, at the edge on users’ local machines and, if required, will even move data there to do so.
According to MotherDuck CEO Jordan Tigani, “DuckDB is a database that can give you very low latency answers in a lot of real-time situations. It doesn’t have specialized streaming mechanisms, which sometimes people talk about when they mean real time. But on the other hand, you can do lots of updates per second and get your answers in milliseconds, which is pretty real time.”
In addition to the native speed of DuckDB (which can query data stored in MotherDuck, Amazon Web Services, GPC, and Azure), MotherDuck’s hybrid query execution approach is responsible for its real-time analytics. This paradigm combines the end user’s machine—including laptops—with cloud resources to localize where queries run.
With an assortment of techniques for optimizing queries, MotherDuck excels in plenty of real-time analytics use cases, including e-commerce, retail, stock trading, and more.
MotherDuck was largely founded on the premise that even when users have massive quantities of big data, they tend to interact with relatively modest quantities of data at a time.
Thus, while the incumbent cloud data warehouses focus on centralizing data in a single location, MotherDuck is predicated on the notion that “if you’re only actively using a small amount of data, we can move the data closer to where the user is,” Tigani maintained. MotherDuck’s hybrid query execution is supported by the fact that MotherDuck users have a client-side instance — often accessed through their browsers — and a server-side instance, the latter of which is in the cloud.
This architecture enables MotherDuck to put data close to users to perform analytics related to real-time stock portfolio updates, for example. “If you’ve got the data distributed that way you can get incredible low latency,” Tigani mentioned.
“Laptops these days are extremely powerful and you can get answers in a handful of milliseconds, whereas if you had to ask a cloud service, the initial request wouldn’t have even gotten there in the time that you’ve gotten maybe several answers in a local environment.”
Query Planning and Execution
MotherDuck’s hybrid query execution architecture facilitates a seamless experience in which, when users have an instance of DuckDB in their browser, it appears they’re simply interacting with a table in the database. Consequently, the query planning process is primarily local.
However, if users need to access additional data for questions that isn’t parochially available and requires the cloud for other stock portfolio questions, for example, the system will “run the portion of the query in the cloud that references that cloud data,” Tigani explained. “Then once that gets computed, those results get returned to my computer and put in a local table so next time I run a similar query, that data won’t have to talk to the cloud again.”
This functionality is significant for several reasons. Firstly, it indicates that if users need to work with copious big data, they can with MotherDuck’s cloud resources. It also illustrates how MotherDuck transports data to the edge.
By relying on what Tigani described as a “columnar binary format to pull the data down that we need from the cloud,” the platform eschews pipelines for replicating data at this stage of the user experience. Instead, “we can download the data to you into your browser so you can slice and dice it incredibly interactively,” Tigani said.
The server side, cloud instance of MotherDuck supports real-time use cases like recommendation engines for retail and aspects of fraud detection. In these and other use cases where the scale of the data analyzed exceeds the capacity of edge devices, MotherDuck simply “runs that in the cloud and will have it scale up to a hefty instance in the cloud to answer your OLAP questions,” Tigani remarked.
By downloading the results of queries and the relevant part of tables accessed to local infrastructure, MotherDuck avoids returning to the cloud for increased expediency in this scenario. There are also mechanisms for caching and data materialization that deliver real-time results for large datasets. “We have local caching but we can do local materialized views,” Tigani commented. “So, basically, a local view of data that lives remotely. Very often, people do a bunch of similar queries that they can do over and over again. They might tweak something, or do a sub-query based on something else, or a common table expression.”
Making this data locally available through caching and materialization reduces latency for these operations. DuckDB also provides vectorized query execution — meaning it aggregates rows and values in parallel — to facilitate real-time responses to queries.
Thus, when computing the average sales over a given time, “you’d need to be able to sum up a bunch of numbers at once,” Tigani said. “Average means you sum up the numbers and divide by the number of rows. Typically, you’d add the numbers incrementally, but special processor instructions on modern GPUs like Intel or Arm GPUs can do these sums in parallel. If you do 16 in parallel, you can do this operation in [1/16th] of the time.”
MotherDuck’s hybrid query execution architecture localizes computations, when feasible, to supply real-time analytics in edge environments close to users. Additionally, its cloud resources can scale as necessary to store and analyze data in a centralized location. Organizations avail themselves of these models at the same time to optimize queries by running facets of them locally and in the cloud for maximum efficiency. The solution’s mechanisms for making relevant cloud data parochially available, in addition to maximizing performance with caching, materialized views, and vector querying execution, enable users to get real-time results at enterprise scale.