Tutorial: Speed ML Training with the Intel oneAPI AI Analytics Toolkit

In the last post, I introduced Intel Distribution of Modin and Intel Extension for Scikit-learn, integral parts of the Intel oneAPI AI Analytics Toolkit, and the overall Intel AI Software suite.
Let’s take a closer look at Modin and Scikit-learn extensions through this tutorial. The objective of this guide is to highlight how Modin and Scikit-learn extensions are a drop-in replacement for stock Pandas and Scikit-learn libraries. You can try this tutorial either in Intel DevCloud or your workstation.
For this tutorial, I provisioned an e2-standard-4 VM on Google Compute Engine with 4 vCPUs and 16GB RAM based on the Intel Broadwell platform. It comes with Python 3.8 preinstalled which I used as the runtime for this project.
We will train a model to detect a fraudulent transaction based on the Fraud Transaction Detection dataset from Kaggle. It’s a ~500MB CSV file with over 6 million rows of data making it an ideal candidate for Modin. This gives us a chance to compare the load times of Modin vs. Pandas. Before starting the project, download the dataset and copy it to the training environment.
The training algorithm is based on Nearest Neighbors, an unsupervised machine learning technique to train both classification and regression models. We will train the model twice with stock Scikit-learn and Intel Extension for Scikit-learn to measure the speed and performance.
Step 1: Configuring the Environment
Let’s start by installing pip
and the required modules.
1 2 |
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python get-pip.py |
Now, install Intel Distribution of Modin, Intel Extension for Scikit-learn, and Jupyter.
1 2 3 |
pip install scikit-learn-intelex pip install modin[all] pip install jupyter |
Launch Jupyter Notebook and access it from the browser.
1 |
jupyter notebook --ip=0.0.0.0 --port=80 |
Step 2: Loading the Dataset and Measuring Performance
With the CSV file uploaded to your training environment, let’s load it into Modin and Pandas.
1 2 3 4 5 6 7 8 9 10 11 12 |
csv='PS_20174392719_1491204439457_log.csv' import pandas as pd %timeit pd.read_csv(csv) import modin.pandas as pd import os from distributed import Client client = Client() os.environ["MODIN_ENGINE"] = "dask" %timeit pd.read_csv(csv) |
As we load the dataset, we also measure the time taken by adding the %timeit
magic function at the beginning of the cell.
In my environment, Pandas took ~12 seconds while Modin loaded the same dataset in ~6 seconds.
Intel Distribution of Modin accelerates loading the dataset with 2x speed. When using large datasets, Modin delivers even more significant performance improvements.
Step 3: Preparing and Preprocessing the Dataset
Irrespective of how we loaded the dataset, we need to prepare and preprocess it to make it useful for the training.
First, we will drop the columns that are not relevant and useful.
1 2 3 4 5 |
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder df=pd.read_csv(csv) df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1) |
The type column in the dataset has five categories:
● CASH-IN
● CASH-OUT
● DEBIT
● PAYMENT
● TRANSFER
Let’s encode them into integers.
1 2 3 |
df['type'] = df['type'].astype('category') type_encode = LabelEncoder() df['type'] = type_encode.fit_transform(df.type) |
Finally, we will perform One Hot Encoding to convert them into categorical columns and append them to the original dataset, and delete the original column.
1 2 3 4 5 |
type_one_hot = OneHotEncoder() type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray() ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])]) df = pd.concat([df, ohe_variable], axis=1) df = df.drop('type', axis = 1) |
Since some of the values in the dataset are null, we will perform data imputation by replacing them with zeros.
1 |
df = df.fillna(0) |
The dataset is now ready for training.
Step 4: Training the Model and Measuring the Performance
Before kicking off the training process, let’s separate the features and labels and then split the data into train and test datasets.
1 2 3 4 5 |
features = df.drop('isFraud', axis = 1).values target = df['isFraud'].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, stratify = target) |
This creates a test dataset with 30% of data and remaining for training.
First, let’s train the model with Sckit-learn and measure the performance.
1 2 3 |
from sklearn.neighbors import NearestNeighbors knn_classifier = NearestNeighbors(n_neighbors=3) %timeit knn_classifier.fit(X_train, y_train) |
Once it is done, we will repeat the step with Intel Extension for Scikit-learn. Notice that we are explicitly loading the sklearnex
module and importing NearestNeighbors
.
1 2 3 |
from sklearnex.neighbors import NearestNeighbors knn_classifier = NearestNeighbors(n_neighbors=3) %timeit knn_classifier.fit(X_train, y_train) |
In my environment, stock scikit-learn took 23.8 seconds while Intel Extension for Scikit-learn finished training in only 5.72 seconds, a speedup of over 4X. Though the results may vary on your machine, it is evident that Intel Extension for Scikit-learn is significantly faster than stock Scikit-learn. It accelerates training on general-purpose x86 CPUs without the need for expensive AI accelerators such as GPUs and FPGAs.