Data / Machine Learning

A Tutorial Introduction to Google Vertex AI AutoML: Data Preparation

18 Jun 2021 8:25am, by

This post is the first in a two-part series exploring Google’s newly-launched Vertex AI, a unified machine learning and deep learning platform. This post delves into data preparation. Check back on Monday for the second installment on the training and inference process.

Google’s Vertex AI is a unified machine learning and deep learning platform that supports AutoML models and custom models. In this tutorial, we will train an image classification model to detect face masks with Vertex AI AutoML. For an introduction to Vertex AI, read this article I published last week at The New Stack.

To complete this tutorial, you need an active Google Cloud subscription and Google Cloud SDK installed on your workstation.

There are three steps involved in training this model: dataset creation, training, and inference.

Dataset creation involves uploading the images and labeling them. Since we are using AutoML, training needs minimal intervention. We don’t need to write code or perform steps like hyperparameter tuning. When the training is done, we can download the model for deployment in edge devices or host it for performing inference.

In the first part of this tutorial, we will focus on creating the dataset. For this tutorial, we will use the raw dataset of faces with mask and without mask created by Prajna Bhandary.

She used image augmentation techniques to generate 600+ images for each class.

While this is not the most comprehensive dataset, it makes a good choice for AutoML which can train models with a lesser number of images.

We will upload these images to Google Cloud Storage bucket with two folders — mask and no-mask. A CSV file with the path of each image and the label will be uploaded to the same bucket which becomes the input for Vertex AI.

Let’s create the Google Cloud Storage bucket.

Feel free to change the values to reflect your bucket name and the region. At the time of launch, Vertex AI AutoML is available only in US-CENTRAL1 (Iowa) and EUROPE-WEST4 (Netherlands) regions.

We will now start uploading the images to the above bucket.

Clone the GitHub repository on your local machine.

Navigate to the data directory and run the following commands:

To upload images simultaneously from both the directories, run the commands in two different terminal windows.

Check the Google Cloud Console and browse the folders.

Once the images are uploaded, we need to generate a CSV file with the path and label of each image.

We will run a simple BASH script for this task.

This populates the file, mask-ds.csv with entries that looks like this:

Let’s repeat this for the second folder to generate the path and label for no-mask.

This will append lines to the CSV file with the path of images with no mask.

Finally, we need to upload the CSV file to the bucket.

The CSV file becomes the critical input to Vertex AI AutoML to create the final dataset.

Running the command, gsutil ls gs://$BUCKET confirms that the CSV file is successfully uploaded to Google Cloud Storage bucket.

With the data uploaded to cloud storage, let’s turn that into a Vertex AI dataset.

Access the Vertex AI Dashboard in the Google Cloud Console and enable the API. Choose the region and click on create dataset:

Give the dataset a name, choose image classification with a single label, and click on create:

In the next section, choose select import files from Cloud Storage:

Browse the Cloud Storage bucket and select the CSV file uploaded earlier, and click on continue:

The import process takes a few minutes. When it completes, you are taken to the next page that shows all of the images identified from the dataset, both labeled and unlabeled images:

You may see some warnings and errors during the import process due to duplicate images found by Vertex AI. They can be safely ignored.

We are now ready to kick off the training. Stay tuned for the next part of the tutorial for a walkthrough of the training and inference process.

Google image par Sasin Tipchai de Pixabay

A newsletter digest of the week’s most important stories & analyses.