In an attempt to make machine learning (ML) easier for both the developer and the data scientist, Google has added built-in machine learning modeling capabilities to its BigQuery serverless data warehousing service.
The service, now in beta, allows users to initiate the creation of machine learning models directly from data stored within BigQuery, using standard SQL commands. Initially, BigQuery ML supports two different types of models, linear regression for prediction and binary logistic regression, used for predicting one of two classes. The learning rate can be auto-tuned and the derived model weights can be inspected. In additional to SQL, user-defined functions (UDFs) are also supported.
“What is great about this is that all of your data analysts know how to use SQL. Now they can create custom models, using SQL directly in BigQuery,” said Rajen Sheth, Google Cloud director of product management, in a keynote talk introducing the technology at the Google Next 2018 conference this week in San Francisco.
There are a number of other advantages to using this service, Sheth boasted. One is that since queries can be done directly against the BigQuery database, no additional extract, transform, and load (ETL) tools are required. Inspecting data inside BigQuery speeds the modeling time as well.
BigQuery is a scalable data warehousing service, one that has been holding petabytes of data on behalf of customers, who can query the data through the standard Structured Query Language (SQL) known by all database administrators and by many analysts and data scientists. BigQuery is a fully-managed serverless service, relieving users from the burdens of node management.
The traditional method of machine learning through different best-of-breed applications can be a cumbersome process, said Abhishek Kashyap, Google Cloud product manager, in a follow-up technical presentation. First, the data must be exported from its source, including reformatting the data to be analyzed, a job that by itself could take a data scientific team (if you’re lucky to have one) a few weeks. The modeling process can then take months as well.
And all this assumes you have a data scientists, who are quickly becoming a scarce commodity — there are 10,000 data scientists in the world, and 4 million programmers, Sheth noted earlier. Analysts working on their own, without help, can use Excel and a small sample, to run a basic regression, but this is another process that can take months.
BigQuery ML was designed to automate as much of these data preparation and modeling steps as possible, Kashyap asserted. Predictive analytics is one use-case. Instead of just examining the data in a historical context, it can be used to predict future patterns, often with existing data that a company is already storing in BigQuery.
BigQuery can be accessed through a command line tool, web-based user interface, an API, or through a third-party analysis tool, such as a Jupyter notebook or business intelligence software, such as Looker or Tableau‘s.
The media conglomerate 20th Century Fox tested an early release of the service to make sense of its movie marketing data. The marketing team had an existing SQL query for audience analysis, which was appended by only a “create model” statement. From this, BigQuery returned a linear regression model for predicting which audiences would want to see an about-to-be-released movie, data that was used to revise media planning for that movie.
Newspaper giant Hearst also has been using an early version of the software. The company has already been holding data in BigQuery on subscriptions, newsletters usage, user demographics, and so used the ML’s logistical regression capabilities to determine subscribers who were at risk of not renewing.
Google is a sponsor of The New Stack.
Feature image: Rajen Sheth, Google Cloud director of product management. Photo by Alive Coverage, courtesy of Google.