What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Cloud Services / Data / Storage

Google Picking up the Pieces of Its Data Portfolio

At its Cloud Data Summit this week, Google’s announced the preview of BigLake, which the company characterizes as a “storage engine.”
Apr 8th, 2022 10:27am by
Featued image for: Google Picking up the Pieces of Its Data Portfolio
Featured image via Pixabay.

A key draw of the cloud is that it can help enterprises simplify their IT operations, especially with SaaS (software-as-a-service) offerings that eliminate the legwork and headaches of provisioning and housekeeping. But could the proliferation of SaaS services in the data, analytics, and AI space be providing too much of a good thing?

There have been many calls for cloud providers to start untangling the mess and make their clouds easier to use: take the burden of integrating all your services off the shoulders of the customer. Google Cloud, at its data summit this week, made some announcements of some new connective tissue that should address some of this.

The Headline: BigLake

The headline was Google’s announcement of the preview of BigLake, which Google characterizes as a “storage engine.” What it really is, is an API; it generalizes the API that Google originally developed for BigQuery to project a table structure over data stored in Google Cloud (object) storage. That in turn allows analytic and query tools to access the data and provide a granular view of data on which governance and access controls can be enforced.

On one hand, BigLake is Google’s answer to data lakehouses, a rapidly emerging approach to making the distinctions between data lakes and data warehouses disappear.

BigLake supports all the popular formats of data lakes including JSON, CSV, and Parquet; these formats have emerged as de facto standard alternatives to SQL for variably structured data. The API makes data in these formats look like the relational and clustered tables that BigQuery reads. And, consistent with Google’s previous extension of BigQuery to foreign territory with BigQuery Omni, that means that data stored in any Amazon S3-compatible object store will register as a readable data source.

Google is not the first to get there. Amazon Redshift, Microsoft Azure Synapse Analytics and others work off polyglot data stored in object storage, while Databricks, which originated the lakehouse concept, has been steadily upping its game to get more data warehouse-like performance for its Spark data lakes. But the key to BigLake is not just that it delivers a data warehouse-like experience from cloud storage, as BigQuery was already doing that, but in extending the construct so that customers can get consistent access with other Google and open source engines such as Spark, Presto, and Trino, and hopefully in the long run, third party BI, AI, and analytic services.

In effect, BigLake makes all data appear like it does in BigQuery but doesn’t require you to be a BigQuery user. It will support open source data lake constructs including Delta Lake and Iceberg, with Hudi on the roadmap. Google services such as Vertex AI, Dataproc, and serverless Spark will all be able to access the data.

BigLake builds on last fall’s announcement of Dataplex, which is Google’s entry for data fabric. While BigLake provides the API that exposes data in tables, Dataplex is used for logically classifying data into hierarchical zones from the data lake, which defines the pool of data; data zone, which logically groups related data assets together (which could be construed as a form of workspace); and at the most granular level, data assets (the specific entities of data). By organizing data logically, it can be more readily tagged and governed. And as part of defining the data, it harvests metadata that aids in classifying and making data discoverable to a data catalog.

Connecting Operational and Analytic Data

Another piece of the connectivity puzzle is converging operational data with analytics. Among the announcements this week, Google is introducing a preview of Spanner Change Streams. It’s the second piece of the puzzle for integrating Spanner with BigQuery, as last year, Google introduced the ability for BigQuery to conduct federated query into Spanner. Where last year’s announcement focused on data at rest, this adds the ability to update BigQuery in real-time, making it a far more robust alternative to customers having to build their own code.

Of course, change data capture streams are nothing new. But in most cases, customers have to write their own integrations, such as with exporting, say, a DynamoDB stream into a relational table endpoint. Or customers have to go with mixed workload databases that conditionally replicate incoming rows into side-by-side in-memory column stores. For cloud providers like Google, which offer specialized databases, built-in change data capture stream capabilities are essential to bridging operational and analytic systems. For Google’s next act, we’d like to see similar integration with streaming by eliminating the need for Cloud Dataflow customers to write their own bridges.

Picking up the Pieces

With a tip of the hat to the Average White Band, we’re glad to see Google picking up the pieces from its data and analytics portfolio and starting to weave them together. In this case, it’s extending the ability to granularly view, query, and manage permissions on data that would otherwise be buried in cloud object storage, a technology designed for economical storage but not or the finer aspects of access or management. Rather than reinvent the wheel, Google leveraged the capability that it already developed for BigQuery.

Outside the data plane, Google made related announcements coupling Data Studio, Looker, and Connected Sheets. This wasn’t an example of adding new capabilities, as users could already develop data visualizations in Looker; generate visualizations from Connected Sheets in Data Studio; or import Connected Sheets data into Looker. But it added the missing links so Data Studio users could generate visualizations from Looker ML data models, and Looker users to explore data from Connected Sheets. And last fall, Google announced access to serverless Spark without requiring customers to sign up for Dataproc.

Of course, the next step is expanding the reach of all of these capabilities, first within the Google Cloud portfolio, but ultimately to a third-party ecosystem as well. Today, BigLake supports just a handful of data file formats, but they are the obvious early targets, and Google’s not done yet. Likewise, Dataplex, which is designed to make data discoverable and facilitate consistent governance, supports a portion of Google’s data and analytics services including BigQuery, Dataproc, and Data Catalog; that leaves other Google data platforms and services such as Firestore and Dataflow as logical next targets.

Beyond that, there’s the need to build critical mass with third parties; obvious targets would be those databases that are part of Google’s open source data partnerships. And beyond the open database folks are third parties where Google already has skin in the game such as Collibra (in which Google Ventures has a stake) or Trifacta (which Google OEMs as Cloud Data Prep). And that’s not to mention the broader landscape of third-party BI tools.

As Andrew Brust pointed out, this is all about what he terms Analytics Stack redux. “On the BI side, bringing together native and acquired technologies is very reminiscent of what Microsoft, IBM, SAP, and Oracle did back in the 2000s.” The cloud changes many things, but one thing it doesn’t is the need to have a body of solutions that can bolt together.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.