Presto: Data Analytics on an Open All-SQL Platform
Organizations increasingly depend on data analytics to guide their operations. But the specialized skills required to work with ever-more-complex data types and sources create roadblocks to turning that data into useful insights.
With different types of data requiring different query engines and programming languages, data engineers must learn and juggle multiple tools, and CIOs must choose between the efficiency of a closed ecosystem and the potential chaos of an open one.
The Presto Foundation open source community offers another alternative: a platform that uses SQL as the main language and interface throughout a disparate data analytics ecosystem.
The open source Presto distributed query platform is a great starting point for most companies, as its comprehensive suite of database connectors efficiently performs data analytics at small to medium scale. As companies’ data analytics needs grow and they specialize into streaming, real time, batch, and other areas, Presto can continue to be the main SQL development engine for data analysts while transparently interfacing with other specialized engines via community-supported integrations.
What’s more, by extending SQL throughout the data analytics ecosystem, Presto helps to democratize the analytics process, enabling more people who rely on data for their work to query the data for answers.
My frame of reference here is as both the chair of the Presto Foundation and director of engineering for Uber, an internet-scale company with daunting data magnitude and complexity challenges. Based on my experience, I think there’s an excellent chance Presto can help your company, too.
If a business is successful, its data grows in volume and complexity over time. The data also grows in its importance as a way for the business to remain competitive in fast-moving markets.
But old-school data warehouses are too static to do effective data analytics, especially at scale. Data lakes and, more recently, data lakehouses address the scale issue for storage, but they still pose challenges for extracting insights from the data.
Currently, there are distinct data analytics engines, query languages, and interfaces for the various siloed systems. Not only is managing them all increasingly difficult, but they also pose problems with data freshness and consistency.
For example, say you want to publish a report and share it immediately, or do regulatory or compliance reporting in real time, with a data warehouse and a query engine requiring programming, it might take 24 to 48 hours to complete what seems like a simple task. By then, the question you were originally trying to answer might be irrelevant.
One example we see at Uber is restaurants that use Uber Eats wanting to know quickly how many orders and of what kinds they’ve received so far. Or how well their digital ads are converting. If data warehouses and query engines can’t deliver that data in time for it to influence business decisions, there’s no way to adjust what’s not working or to double down on what’s proving effective.
The longer a query takes, the less fresh the data. And each time you have to join datasets or move them from one system to another, there’s a risk of introducing inconsistences and inaccuracies into the results.
But what if extracting useful insights from data didn’t require sophisticated programming skills and expertise with disparate data systems? SQL is a simpler language that is familiar to most people who work with data. And it turns out that using SQL throughout the data analytics ecosystem, as both a query language and an interface, greatly simplifies the complexity of data analytics. Using SQL in this way is precisely what the Presto platform enables.
By using the same SQL-based platform to query streaming, interactive, real-time, and batch analytics systems, it expands the data engineering pipeline and allows data engineers to do more work for less effort. It also helps to democratize data analytics. In an internet-scale organization, for instance, perhaps a handful of data engineers are able to code queries for various data types, but tens of thousands of people might be SQL users.
Through widespread SQL familiarity, Presto expands the availability of analytics processes more broadly throughout your organization. Because so many engineers know SQL, more of them can perform data queries, making analytics insights faster and more pervasive, and helping data analysts become more productive.
Making Openness Scale
Too often, people think of open source as appropriate for smaller, more contained use cases but not for large-scale efforts. CIOs, in particular, believe that closed, proprietary systems are more predictable and therefore safer than open source.
Presto started as a project within Facebook, already an internet-scale company, before it became open source, operating under the auspices of the Linux Foundation. Depended upon by internet-scale companies, Presto has been designed and built from the outset to be highly scalable and reliable. And as an open source platform supported by an open community, it has been made even more robust over time as community members make improvements that are fed back into the platform.
When you download the latest version of Presto, you get the same version that’s used in production, testing and all the rest at companies like Uber and Facebook. Even if your organization isn’t now, and might never be, an internet-scale enterprise, it’s important that you’re using the same software as the industry giants — because you can be confident that it can handle anything you’re likely to throw at it, rapidly and reliably.
One aspect that doesn’t scale with Presto — and this is a good thing — is your cloud charges as you increase the number of people pulling and analyzing data. You can download Presto for free, customize it, use it as often as you like by as many people as you like, and your cloud infrastructure bills won’t grow until your data volume itself grows.
Open source offers far greater flexibility in planning and growing a data analytics ecosystem. Open source approaches such as Presto enable a consistent data analytics ecosystem that helps you avoid the vendor lock-in of proprietary systems.
One cautionary note: Not all open source is created equal. For reasons of governance, flexibility, scale, continued technical enhancements and “skin in the game,” it’s important to adopt open source technologies supported by truly open communities, rather than by a single corporate entity. Presto checks that all-important box.
Each month about 10,000 people at Uber use Presto. Because it’s open source, we can modify it to suit our particular needs. For instance, geospatial data is important to a company like Uber, but it’s not so important for a company like Facebook.
What we experience at Uber today is what many other companies will face in the years to come, as data continues to gain in both complexity and importance for business success. Presto is fast, easy to use, and it works at small scale on up, so it can be adopted early in a company’s development and continue to work as the business grows. Even at the scale of a company like Uber, Presto SQL gives us what we need 90% to 95% of the time.
And no matter what your size today, if you’re successful, your data needs will continue to expand. The sooner you move to an open SQL-based approach, the easier it will be to support and power your growth. And in the long run, you’ll get more out of your data if you choose a data platform that is able to grow with your business.