Materialize: Managed Real-Time Data
The ability to harness real-time data no longer is a nice-to-have, but essential to high-performing companies, a view that Materialize takes to heart with the announcement of early availability of its distributed streaming database.
“People are trying to use fresher more real-time data. But the reality is, it’s been too hard with existing stream processors and real-time frameworks. But we think that in the next five years, real-time data will become the default, which is why we’ve been working so hard on Materialize,” explained Jessica Laughlin, chief of staff at Materialize.
In its latest iteration, Materialize offers a simple SQL interface available as a fully managed cloud service that provides separation of storage and compute, strict serializability, active replication, horizontal scalability and workload isolation.
It presents as a Postgres database supporting full ANSI SQL, while delivering capabilities previously only found in batch-based systems.
“We’re making streaming as easy as batch, bringing batch best practices to streaming,” Laughlin said.
And as it is Postgres-compatible, as director of field engineering Seth Wiesman put it, “If you have used a database before, you have written SQL, you know how to use Materialize.” It integrates with all the tools in the Postgres ecosystem.
Originally written as a single Rust binary, the initial release of Materialize ingested data from Kafka and let users query, transform, and join their streaming data in standard SQL. While it provided incremental view maintenance over fast-changing data, its biggest downside was that it relied on upstream systems to be the persistent source of truth for the data being processed.
Users who wanted to build business-critical, production-ready applications on top of Materialize kept asking the company to incorporate persistence into the offering, the company’s cofounders explained in a blog post.
In the past year, the New York-based company has focused on rearchitecting the code for its stream processor and database into a persistent, scalable, cloud native distributed cloud offering.
Updates as Data Changes
Frank McSherry, while at Microsoft Research in 2013, worked on a project called Naiad, which laid out the concept of Timely Dataflow, supporting real-time queries on continually updated data.
He connected with his cofounder, Arjun Narayan, in Ph.D. database circles, and they kicked off the company in 2019. Narayan is CEO, McSherry chief scientist.
Materialize’s core database engine relies on Timely Dataflow as its stream processing framework. Materialize is basically a wrapper around that. It ingests data from multiple sources, including relational databases, event streams and data lakes before transforming or joining data using the same complex SQL queries used with batch data warehouses.
Rather than static materialized views where you have to refresh them, Materialize incrementally keeps those materialized views up to date as your data changes, yet maintains millisecond-level latency on complex transformations, joins or aggregations.
The strength of Timely Dataflow is fine-grained progress-tracking of operations, Wiesman said, which enables Materialize to provide strict serializability.
On top of that, it uses Differential Dataflow, which enables not only efficient computation on large amounts of data, but also maintains the computations as the data change.
“Let’s say you have one record come in that affects one output row of your view. Because we’re using Differential Dataflow, we only have to recompute that one row. We do work proportional to the diff. So you could have a terabyte of data, a billion rows, but we’re only going to update what is actively changing. It makes the system efficient and cheap to run,” Wiesman said.
The software-as-a-service platform also provides:
- Separation of storage and compute, “so we have independent services that can scale and ebb and flow as your data changes and as your application requirements change,” Laughlin explained.
- Elastic storage in AWS S3 buckets, which keeps costs down.
- Multiway complex joins — stream-to-stream, stream-to-table, table-to-table and more, all in standard SQL.
- Horizontal scalability harnessing Timely Dataflow to help users handle large, fast-scaling workloads.
- Active replication enables users to spin up multiple clusters with the same workload for high availability.
- Workload isolation allows users to spin up multiple clusters with different workloads, still share elastic storage without worrying about interference from others.
The company maintains that Materialize enables companies to be not just data-driven, but to be event-driven.
Any database can give you the answer to a query — “What is the current value of X?” for example, Wiesman explained. An event-driven architecture provides information on which to take action.
“Imagine you are running an auction house, and you are tracking all the bids within your system, and you want to build a tool that will alert the winner of each auction the moment that auction closes,” he said.
“You can write SQL that asks the database who are the winners, but that is a passive action. … You might do that every hour, every 10 minutes. But we want an event-driven architecture where the database tells you, ‘Hey, this auction just closed, and Susan won. Go send her an email, push notification or phone. Actively do some real-world thing.’”
‘Fresh, Up-to-Date Responses’
McSherry and Narayan named Materialize after the database concept of “materialized views,” which refers to precomputing the results for a query so that they’re instantly available when needed, rather than doing the work on demand.
The company has raised more than $100 million, most recently a $60 million Series C round announced a year ago.
Nnamdi Iregbulem of Lightspeed Venture Partners, puts Materialize in a category with ClickHouse and Tinybird that provide “real-time analytics that lets analysts get fresh, up-to-date responses to their business queries at low latency.” He categorizes Kafka among those that stream at high speed and volume, and Apache Flink and Apache Samza among those that filter and transform streaming data in flight.
Materialize users include Centerfield Insurance Services (formerly Datalot).
“We were in the process of rebuilding our main Datalot (now Centerfield) app, which is largely responsible for delivering analytic data,” said Josh Arenberg, the company’s vice president of data, said in email of deciding to try Materialize.
“Presenting analytic data to an app presents some challenges beyond just building reports for users. We were exploring the idea of stream processing to drive these analytics for the following reasons:
- Having up-to-the-minute data available without having to reprocess/rerun queries for individual user visits. This means if you have a lot of visitors to the app, your usage of the database doesn’t scale with the number of concurrent users.
- Building event-driven systems for alerting and automating on analytic data, which means generating a signal for analytic metrics rather than point-in-time snapshots. We want to be able to use this data to take actions when things go out of bounds, not just build point-in-time reports.
- Building real-time visualizations in the app, that update without having to rerun queries or refresh.”
Centerfield was building a streaming pipeline anyway as a way to saturate its analytic database (Snowflake) in real time, so it had the data available already, and its backup position was always to do a reverse ETL and pull data from Snowflake queries on some frequency if the nascent Materialize technology or some other streaming solution didn’t work out.
“But Materialize represents a much cleaner approach, so we were very happy at how well it worked, especially after some less successful experiments with other solutions,” he said.
“I think we gained a lot of confidence from the fact that the underlying tech, Timely Dataflow, has been in use for some time and has been battle tested at some large companies. Since most of the new code was in the SQL layer, we usually found that any of the early problems tended to be in the query processing and not the data flows. And honestly, there just have not been that many problems.
“…We quickly realized [the Materialize team was] building high-quality code and were very responsive to any bug reports we might have. I also love that they are building in Rust, so there are entire categories of bugs we don’t need to worry about.”
Though the Materialize technology has yet to reach general availability, he said it surpassed expectations.
“Being this early meant that there was a certain amount of work involved in operationalizing the technology, much of which has now been solved with the Materialize Cloud offering [announced last fall]. But the core engine has always been solid. I never have seen ‘wrong’ answers from Materialize when we feed it correct source data.
“I also can’t overstate the importance of the quality of the team at Materialize. Without that strong partnership, there is no way we could have accomplished as much as we have in the past couple of years.”
In addition to the data app mentioned above, Centerfield found several other critical uses for Materialize, he said:
- “Much of the past year has been about Datalot’s acquisition by Centerfield, and Materialize has been critical in building real-time feeds of the insurance data into the Centerfield backend. It has made this process easy and reliable.
- Our analytics department has been able to use it to build some truly incredible real-time dashboards for all aspects of the business that are used to steer business decisions throughout the day.
- We have built several machine learning algorithms using Materialize as a real-time feature store. It allows us to build a stream of real-time user interactions to run our models off of.”