Analysis / Contributed / Technology / Top Stories /

How Building the Perfect AI Pipeline Is Like Brewing the Perfect Shot of Espresso

10 Nov 2017 9:11am, by

Scott Clark, co-founder and CEO of SigOpt
Scott Clark is the co-founder and CEO of SigOpt. He has been applying optimal learning techniques in industry and academia for years, from bioinformatics to production advertising systems. Before SigOpt, Scott worked on the ad targeting team at Yelp leading the charge on academic research and outreach with projects like the Yelp Dataset Challenge and open sourcing MOE. Scott holds a PhD in Applied Mathematics and an MS in Computer Science from Cornell University and BS degrees in Mathematics, Physics, and Computational Physics from Oregon State University. Scott was chosen as one of Forbes’ 30 under 30 in 2016.

Building the best AI pipeline is strikingly similar to crafting the perfect shot of espresso. In both cases, there are a multitude of tunable parameters that must be configured before the process even begins that have a huge impact on the end result.

From the water pressure and grind of an espresso bean to the learning rate and the number of hidden layers in a neural network, these configuration parameters can make or break your AI pipeline or perfect morning shot. There are many analogous components between these tunable systems, but in both cases getting to the best result has historically been more art than science. In this post, we’ll discuss the similarities of both processes as well as a better way to tune them and similar complex systems.

Training a neural network is not so different from brewing the perfect espresso (or baking the perfect batch of cookies). Water passes through a network of ground espresso beans at some temperature and pressure, absorbing flavor and caffeine as it subtly transforms the beans and eventually outputs a delicious shot of espresso on the other side. As data passes through a deep learning pipeline it transforms the weights of its neurons, absorbing information and converging to a model that can be applied to tasks as varied as natural language processing or sequence classification.

The water that passes through the espresso beans is similar to the data coursing through a neural network as it is trained. It passes through the system to become something different and desirable at the end, by interacting with and changing the properties of the system itself as it goes. One can tune various aspects of how the water or data interact with the system like pressure and temperature to affect how it is absorbed by the espresso beans or the learning rate and Stochastic Gradient Descent (SGD) parameters in a neural network. These tunable parameters will affect how the network of coffee grounds or neurons are influenced by the water or data.

Water pressure and temperature affect how much flavor and caffeine are added as the water passes through the ground espresso beans.

 

Stochastic Gradient Descent (SGD) parameters affect how each neuron in the deep learning system converges to the optimal weight as the data passes through the neural network.

Furthermore, the espresso beans themselves are similar to the overall architecture of a deep learning pipeline. One variation of espresso bean versus another can have the same difference on the shot as the decision between a convolutional or recurrent neural network in a deep learning pipeline. The grind of the beans and amount used define the lattice over which the water passes, like the architecture parameters in a neural network of hidden layers and number of neurons per layer.

Roasting the beans in different ways will change how the flavor is absorbed by each ground particle of espresso bean, like the activation function in a neural network allows data to have different effects on individual neurons within the network. Different espresso machines have an impact on the shot even with otherwise identical configuration parameters just like deep learning frameworks like MXNet, TensorFlow, and Caffe2 can impart subtle differences onto the trained model.

Time in both systems is analogous, the longer you pump water through the machine the more effect they will have on the eventual shot (in both size and quality), similarly the longer you train a neural network (the number of epochs you train over) will have an effect on what is learned, how much it converges, and what the results are.

The amount of espresso beans and the coarseness of the grind change how the water picks up distinct amounts of caffeine and flavor.

The architecture of the neural network, like the number of hidden layers and number of neurons per layer change what is learned from the data.

Once you’ve settled on a configuration of parameters to try and brewed your espresso or trained your network, the next step is measuring the output to determine whether or not it was a success. The best shot of an espresso makes the perfect tradeoff between flavor and caffeine, the bitterness needs to be just right. Similarly, a deep learning pipeline has specific, sometimes competing metrics, around accuracy, robustness, or speed. Furthermore, the results of brewing an espresso can be post-processed into a latte, cappuccino, or americano, each with their own tunable parameters in the same way that a neural network output can be fed into a larger pipeline for applications like fraud detection or algorithmic trading. Finding the best configuration for these systems as efficiently as possible can result in boosted performance for these models or a better caffeine boost in the morning. All of the parameters interact to influence the desired output in different, often non-intuitive ways.

The black box nature of these methods allows us to consider the espresso brewing problem identically to tuning an AI pipeline.

The art of creating the best espresso has been honed by masters since the 19th century. Similarly, deep learning pipelines are often tuned in practice via expert intuition, especially when popular brute force approaches like grid and random search become intractable due to the sheer quantity of configuration parameters or expense of training the model. Unfortunately, the optimal configuration of these pipelines for different applications can be wildly different on different datasets and in different contexts. So the manual tuning approach often boils down to time-consuming and expensive trial-and-error optimization, often in high dimensions, which wastes precious expert and computational resources.

Bayesian optimization techniques allow for finding the best configurations of these systems in as few attempts as possible. This is done by drawing from academic research in fields like Optimal Learning and Sequential Model-Based Optimization. These techniques trade-off exploration, learning more about how the parameters interact and combine to influence the desired result, and exploitation, using what we already know to drive towards better performance.

The black box nature of these methods allows us to consider the espresso brewing problem identically to tuning an AI pipeline. Black box methods only observe the input to a system (the specific configuration to be evaluated) and the output (the desired objective or set of objectives to be optimized). This allows Bayesian Optimization tools to easily bolt on top of any underlying system without requiring any information about proprietary data or models. In fact, companies like MillerCoors have applied this to another type of brewing.

By getting to the best configurations exponentially faster than common techniques like grid search, and not relying on humans to use intuition to do high dimensional optimization in their head, Bayesian optimization allows you to get to the best version of your model, or perfect espresso shot, faster and cheaper than ever before.

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments