Contributed / Top Stories

4 Challenges to Building Scalable AI/ML Pipelines in the Cloud

19 Jun 2018 7:03am, by

John Morrell, Senior Director, Product Marketing, Datameer
John Morrell is the Senior Director of Product Marketing at Datameer. John is responsible for leading the go to market efforts for the Datameer product family and understanding how customers use Datameer to solve their business problems.

Where the internet allowed computers to talk to each other over vast distances, cloud computing has allowed computers to think together, creating machines with virtually limitless storage and computing capacity. We can access this super-computing power for pennies on the dollar compared to on-premise servers. There’s almost no limit to what we can do.

Artificial intelligence and machine learning are two of the most exciting technologies to find a home on the cloud. The infinite scale, instant provisioning, and flexibility of cloud computing make it the optimal infrastructure of AI/ML modeling and training.

However, cloud computing doesn’t solve all the obstacles with AI and ML. Data preparation and pipelining are still as time-consuming on the cloud as they are on-premise. Data scientists spend a disproportionate amount of time on these prerequisite activities when they should be using their time to analyze the data instead.

There are several challenges with data preparation and pipelining that cloud computing cannot solve on its own. In this article, we’re going to look at those challenges and explore possible solutions.

Challenges of Data Preparation and Pipelining

Data preparation is a critical precursor to effective AI/ML data analysis. Still, companies need to find a more efficient way of accomplishing these tasks so that data scientists can spend time doing what they do best: experimenting and discovering new insights.

There are four key challenges to data preparation and pipelining that we must solve:

  1. Manual data preparation
  2. Removing bias from AI/ML data models
  3. Reusability and Reproducibility
  4. Reimplementation

Manual Data Preparation

80 percent of a data scientist’s time is spent cleaning and preparing data for analysis. This is because most data scientists manually write data preparation scripts in R or Python. Not only is this a slow process, it is also difficult to edit and manage. Any change requires the data scientist carefully rework the code, leaving plenty of room for error along the way.

Removing Bias from AI/ML Data Models

Removing bias from AI and ML data models presents a catch-22 for companies. Training an AI or ML model requires feeding it more data, but as we described above, the preparation of that data is incredibly time-consuming. Companies are forced to choose between time, money, or accuracy when improving their AI/ML models.

Reusability and Reproducibility

To cut down on rework by data scientists, data assets (data models and pipelines) should be built in such a way that they can be reused in the future. Manually writing data preparation scripts makes reusing data assets difficult because data scientists must meticulously comb through the code to make necessary changes.

Companies today also need their data assets to be reproducible for practical and compliance purposes. This means any actions taken with the data should be documented, including where it was moved, how it was transformed, and how it was blended with other datasets. With the arrival of data privacy laws like GDPR, there may come a time where you have to prove what you did with a user’s data and why.


Another major bottleneck in most AI/ML programs is a re-implementation of data models by IT. After data scientists develop a new data model, they must then hand it off to operations, who re-implement it for use at scale. Re-implementation creates a fractured process where no single group is in charge of the outcome, leading to delays and errors. It also requires complex coding leading to longer re-implementation times, and unmaintainable execution models.

Benefits of Data Preparation and Pipelining Platforms

The challenges of data preparation and pipelining stem for trying to do it everything manually. There are incredible tools on the market today that shoulder much of the work for you, letting your data scientists get back to doing what they do best.

Many enterprises are turning to commercial, off-the-shelf data preparation and pipelining platforms to manage their cloud AI and ML programs. These platforms come equipped with a number of capabilities that mitigate the challenges we looked at above.

Let’s look at each challenge again and how data preparation and pipelining platforms solve them.

Challenge #1: Manual Data Preparation

Solution: Agile Data Preparation in Data-Centric Metaphors

Data scientists waste hundreds of hours manually writing data preparation code or scripts. Not only is this code arduous to write, it is also extremely difficult to manage, maintain, and reuse.

Modern, interactive and visual data preparation platforms take the manual work out of writing code for data preparation. These built-in capabilities not only make it easy to cleanse and blend data, but also shape data, apply algorithms, and understand how the various attributes of data impact the problem you’re trying to solve (feature engineering).

Data preparation and pipelining platforms let you prepare the data through a simple, data-centric user interface metaphor. When dealing with potentially billions of records – each with hundreds of attributes – you need an easy way to explore this data without getting lost. A visual exploration interface to explore data at scale lets you “fail fast” by exploring different paths of analysis and easily working your way back to try another.

Challenge #2: Removing Bias from AI/ML Data Models

Solution: Data Blending Directly on the Data Lake

Removing bias from an AI/ML data model requires feeding the model mass amounts of information. Data integration and blending is a relatively quick and straightforward way to create these new datasets for training.

Typically, data scientists would have to extract data from the data lake first and blend it in a separate database. This is an expensive use of resources, especially when your data lake lives in the cloud.

With the right data preparation and pipelining platform, you can blend data right on the data lake without moving it. This makes data blending faster and more iterative, allowing data scientists to experiment with multiple blends in the time it used to take to create just one. The less data is moved around, the faster and easier it is to feed AI and ML data models for training.

Challenge #3: Reusability and Reproducibility

Solution: Automatic metadata capture

In order to create reusable and reproducible data models, data scientists must record all actions performed on the dataset. This trail of information creates a map that helps other stakeholders follow the data flow and understand the logic behind the models.

Data preparation and pipelining platforms automatically capture this information in the form of metadata. Every action– from moving, to blending, to applying algorithms– is recorded and copied into reports for others to view. If you backtrack and change a function in your algorithm, the platform will automatically revises the metadata to reflect this. This is a vastly superior method to recording information by hand. It’s also a much better way to ensure that metadata stays updated and accurate.

Challenge #4: Reimplementation

Solution: Streamlined Operationalization

The final must-have capability is streamlined operationalization. Instead of reimplementation, data preparation and pipelining platforms let data scientists plug in the AI/ML models they produce and run these directly on your data lake at scale. They can operationalize the data models themselves without customer coding or even needing to involve IT.

Streamlined operationalization adds another level of transparency to the data flow as well. By keeping the entire process on one platform, companies can audit and effectively govern their data to stay in compliance with new data privacy laws. It is becoming increasingly valuable to have a data platform that tracks everything in one place.

Building Agile, Scalable Pipelines Starts with the Platform

Data preparation is always a time-consuming task, but the problem is exacerbated when running AI/ML initiatives. Companies can’t afford their best data scientists spending 80 percent of their time on preparation, on top of the cost of moving data to and from the cloud. The amount of data ingested continues to balloon, and the old way of preparing and pipelining is becoming obsolete.

Companies with aspirations to build robust AI/ML programs need to consider using a commercial, off-the-shelf data preparation and pipelining platform. Not only do these platforms make preparation faster and easier, they also keep the company safe by documenting the use of data for compliance. Platforms like Datameer streamline the entire process from preparation to operationalization while saving you time, money, and your best human resources.

Feature image by Drew Farwell on Unsplash.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.