Contributed / Top Stories / Tutorials /

Building a Machine Learning Application? Start with SQL

25 Oct 2017 6:00am, by

Development of machine learning (ML) applications has required a collection of advanced languages, different systems, and programming tools accessible only by select developers.

But now common ML functions can be accessed directly from the widely understood SQL language. This can be especially helpful for organizations facing a shortage of talent to carry out machine learning plans.

In this article, we will provide use cases and examples for how to integrate machine learning workflows with a scalable SQL database, and offer a peek into the future about how this will foster opportunities for further development.

New Database and Hardware Approaches Benefit Machine Learning

Machine learning is not new, and SQL databases are not new, so what has changed? Three things.

Adam Prout
Adam Prout oversees product architecture and development at MemSQL. He joined MemSQL as a co-founding engineer and previously served as Vice President of Engineering. Prout spent five years as a senior database engineer at Microsoft SQL Server where he led engineering efforts on kernel development. Prout holds a bachelor’s and master’s in computer science from the University of Waterloo and is an expert in distributed database systems.
  • Distributed Systems: Today, SQL databases are distributed, allowing you to use more cores and have better parallelism. Overall, this enables more hardware to address a single query all at once providing the best performance. Distributed systems can scale to hundreds of servers or cloud instances, providing a different playing field of performance than single node systems.
  • New Hardware: Modern CPU processor hardware is increasingly more sophisticated. For example, the Intel Advanced Vector Extension model, an implementation of Single Instruction, Multiple Data (SIMD) is widely available in its first iteration AVX-2 and new higher-end processors support the more advanced AVX-512. In short, SIMD allows processors to compute on multiple pieces of data simultaneously increasing parallelism at the processor level. Additionally, GPUs can achieve a similar type of parallelism by coordinating among multiple cores simultaneously.
  • Code Generation: Code generation helps optimize queries and custom functions in a database by converting the original requests into machine code, allowing it to run much faster. However, the true power of code generation comes from generating machine code optimized for a specific query, avoiding the overhead of an interpreter. An interpreter is a piece of code, which can run any query, and as such contains more generalized instructions, which is usually not optimal for the performance of a given specific query.

Historically, traditional databases have had functions that are built-in and fast. But these have been a limited set with rare new additions. With code generation, any new function that a customer deploys can benefit from conversion to machine code. Essentially a database with effective code generation is constantly optimizing on behalf of the database user.

What’s Driving the ML Hype

There is no question that people and companies want to get more out of their data, and today data is simply too large to be analyzed on a human scale. Machine processing, or machine learning, is the only way to glean insights.

Machine learning generally involves processing large amounts of data in a computationally intense manner. Having these underlying technical capabilities in software systems is a critical enabler of machine learning.

Parallelization

Machine learning is not new, but parallelization of ML is new. Historically, ML has been focused on languages such as R and Python which are:

  • Flexible.
  • Relatively easy to learn compared to lower level languages like C++ or Java, though still more difficult to learn than SQL.
  • Dominated by single-threaded processes.

On the other hand, SQL has been:

  • Historically known to be a little less flexible.
  • However, easy to learn and widely deployed in enterprises. SQL is about as close to asking a computer a precisely defined question in English as we have today.
  • Historically had multithreaded capabilities but frequently single server execution.

Today, with advances in distributed SQL Datastores, you can combine the dominance of SQL as a data access language with the performance and capabilities of a scalable parallel system.

Processing Close to the Data

Another development furthering the merging of SQL and ML is that with languages such as R and Python, the processing is happening further away from the data. Data is extracted from a data store, then processed in an external engine or at the data application layer. Spark supports distributed execution of R and Python code on data in memory, but does so on a transient basis and still requires the external persistence of data sources and results.

With a SQL database, the database itself is storing the data and processing the data, so by using SQL operations, you collapse the stack for performance and scalability.

Examples of SQL and ML

There are a few ways to take advantage of SQL and ML. In particular, these approaches focus on the operationalization of machine learning models. The approaches do not cover every implementation of ML models by a longshot, but they do provide a set of new scalability and performance capabilities for real-time applications.

Comparing Vectors with DOT_PRODUCT

The mathematical operator DOT_PRODUCT, sometimes known as SCALAR PRODUCT, compares two vectors and returns a coefficient of similarity. DOT_PRODUCT is not a new function, but implementation of this function in a distributed SQL database is new and provides the ability to co-locate the function with an operational SQL database.

Without co-locating the functionality, you would need to have an external system to match vectors, then look them up in the database. A far more performant approach places the final matching step right there next to the data.

Real-world examples of DOT_PRODUCT include comparing vectors for facial or image recognition. Similarly, the attributes of many goods and services fit perfectly into a vector. Consider all the attributes for a house or a car and they too can easily be converted into vectors.

Scoring

Developing machine learning models typically happens offline, but scoring often happens in real time, providing compelling business value to modern applications.

With the right SQL databases, ML models can be executed as data lands in the system, providing both the raw data point and the ML result point in the same row in the same table.

Using a real-time approach, scoring occurs on the way in, with no second phase needed to run and build scoring. Since the score is there all the time, application complexity is significantly reduced with fewer overall processes to manage.

When combined with a distributed SQL database, the system can easily scale to handle the largest incoming workloads from global applications. Additional benefits accrue from high-performance parallel connectors to message queues such as Kafka and execution engines, such as Spark.

Custom Functions and Procedural SQL

A further area of innovation is the addition of extensibility for distributed databases. Adding extensibility to a database is not new with Oracle supporting PL/SQL and Microsoft supporting T-SQL. However, adding custom functions and procedural SQL to a distributed database is new, providing a range of options. That is the approach with MemSQL’s Massively Parallel SQL, called MPSQL.

Adding extensibility to a database that also supports code generation, including code generation for extensibility functions, delivers the maximum performance possible from compute resources.

For example, with procedural SQL functions, you can implement the popular k-means clustering algorithm in SQL. A simple demonstration of k-means clustering in SQL appears in the second part of this presentation.

Once you build these types of functions into the SQL database, you have the advantage of the underlying programmability. Specifically, you can build custom functions to suit your application. Other databases often have custom functions built-in but without the ability to change or customize them.

Having the underlying programmability is a bigger advantage than simply having the built-in function. This is critical as all data is unique and to achieve the greatest optimizations you need the ability to tinker. Programmatic approaches such as MPSQL provide this.

Going Forward with ML and SQL

Going forward we are likely to see more machine learning used inside the database itself. Once the data is there and users are entering SQL queries, the database will naturally glean more information about each user.

This will drive advances in ease of use. Right now, you must write everything if you want to incorporate ML, but few things are easier than a SQL query.

ML and SQL will continue to overlap in many ways and with a range of frameworks. Currently, we have a proliferation of Spark, TensorFlow, Gluon and others. Eventually, the industry will centralize on fewer frameworks and they will be built into the database. This converging of functions will be easier to use and unleash architectural advances for new applications.

Feature image by Ales Krivec, via Unsplash.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.