Modal Title
Data / Data Science / Machine Learning / Software Development

Intel oneAPI’s Unified Programming Model for Python Machine Learning

Let's take a closer look at Intel Distribution of Modin and Intel Extension for Scikit-learn.
Feb 4th, 2022 10:45am by
Featued image for: Intel oneAPI’s Unified Programming Model for Python Machine Learning
Feature image via scikit learn.

The popular Scikit-learn Python machine learning toolkit is a simple and powerful framework for classical machine learning. If you are training models based on linear regression, logistic regression, decision tree, or random forest algorithms, Scikit-learn is the first choice.

In classical machine learning, one is expected to perform feature engineering — identifying the right attributes — and handpicking the right algorithms aligned with the business problem. It is the right approach for most problems based on structured data stored in relational databases, spreadsheets, and flat files.

On the other hand, deep learning is a subset of machine learning that relies on large datasets and massive computational power to identify high-level features and hidden patterns in the data. When training models based on unstructured data such as images, video, and audio, deep learning techniques based on well-defined neural network architecture are preferred by ML engineers and researchers.

In addition to Scikit-learn, other advanced AI frameworks such as TensorFlow, PyTorch, Apache MXNet, XGBoost, and others may be used for training models based on structured or unstructured datasets and a wide variety of algorithms that are used as part of deep learning and classical machine learning workflows. ML researchers and engineers prefer versions of these frameworks that have been optimized for accelerated performance. The AI acceleration is delivered by the combination of hardware and software.

Deep learning frameworks such as Apache MXNet, TensorFlow, and PyTorch take advantage of the acceleration software based on NVIDIA CUDA and cuDNN that provide interfaces to the underlying Nvidia GPUs. AMD provides a similar combination through Heterogeneous Interface for Portability (HIP) and ROCm that provide access to AMD GPUs. AI acceleration in these cases squarely focuses on GPU, the software drivers, runtime, and libraries. Deep learning frameworks are tightly integrated with AI acceleration software to speed up the training and inference of deep learning models on the respective GPUs.

While GPUs are used extensively in deep learning training, CPUs are more ubiquitous in the full end-to-end AI workflow: data preprocessing/analytics and machine & deep learning modeling/deployment. In fact, you might be surprised to learn that Intel Xeon Scalable processors are the most widely used server platform from the cloud to the edge for AI.

Intel has been at the forefront of an initiative called oneAPI — a cross-industry, open, standards-based unified programming model targeting multiple architectures including the aforementioned CPUs and GPUs, FPGAs, and other AI accelerators. The oneAPI toolkit is available to developers as a set of toolkits aligned with HPC, AI, IoT, and ray tracing use cases.

Intel oneAPI AI Analytics Toolkit (AI Kit) targets data scientists and AI engineers through familiar Python tools and frameworks. It is part of Intel’s end-to-end suite of AI developer tools and comes with optimized AI frameworks for Scikit-learn, XGBoost, TensorFlow, and PyTorch.

The most interesting components of the AI Kit for developers and data scientists using a machine learning workflow are Intel Distribution of Modin and Intel Extension for Scikit-learn which are highly optimized for the CPU, promising a 10-100X performance boost. The best thing about these frameworks is that they are fully compatible with Pandas and stock Scikit-learn delivering drop-in replacements.

Let’s take a closer look at Intel Distribution of Modin and Intel Extension for Scikit-learn.

Intel Distribution of Modin

Intel Distribution of Modin is a performant, parallel, distributed, pandas-compatible DataFrame acceleration system that is designed around enabling data scientists to be more productive. This library is fully compatible with the Pandas API. It is powered by OmniSci in the back end and provides accelerated analytics on Intel platforms.

Modin is compatible with Pandas while enabling distributed data processing through Ray and Dask. It is a drop-in replacement for Pandas that transforms single-threaded Pandas into multithreaded ones, using all the CPU cores and instantaneously speeding up the data processing workflows. Modin is especially good on large datasets, where pandas will either run out of memory or become extremely slow.

Modin also has a rich frontend supporting SQL, spreadsheets, and Jupyter notebooks.

Data scientists can easily switch to Modin to take advantage of parallelized data processing capabilities while using the familiar Pandas API.

Installing Modin is simple. It’s available through the Conda package manager of Intel oneAPI AI Analytics Toolkit.

The code snippet below shows how simple it is to use Modin:

Intel Extension for Scikit-learn

The Intel Extension for Scikit-learn provides optimized implementations of many scikit-learn algorithms, which are consistent with the original version and provide faster results. The package simply reverts to Scikit-learn’s original behavior when you make use of algorithms or parameters not supported by the extension, which delivers a seamless experience to developers. With no need to rewrite new code, your ML application works as before or even faster.

Intel Extension for Scikit-learn supports popular classical machine learning algorithms often used by Scikit-learn developers.

The acceleration in learning speed is achieved by patching, which replaces the stock scikit-learn algorithms with their optimized versions provided by the extension. Installation of the module can be done through conda or pip.

As a drop-in replacement, using Intel Extension for Scikit-learn is straightforward. Just add the below lines to your code:

In an upcoming tutorial, I will demonstrate how to install and use Intel oneAPI AI Analytics Toolkit for training a linear regression model. Stay tuned.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.