Amazon SageMaker Studio Lab from the Eyes of an MLOps Engineer

At the Amazon Web Services‘ re:Invent 2021 conference, an announcement from Swami Sivasubramanian, Vice President for Amazon AI, caught my attention — the launch of Amazon SageMaker Studio Lab.
Since the initial launch of the Amazon Machine Learning service at re:Invent 2015, AWS has constantly improved the managed ML platform and tools. The launch of the SageMaker platform, followed by the addition of services such as Autopilot and JumpStart, made AWS the preferred cloud for building and deploying machine learning and deep learning. The APIs have matured, and so are the infrastructure and platform services.
Amazon SageMaker Studio Lab is unique in many aspects. Firstly, this is a standalone service with no dependencies on AWS. Anyone with an email account can sign up for the service. Secondly, it’s completely free. Amazon has opened up an IDE and environment for building machine learning models with no strings attached. This may be the first AWS service that lives outside of the IAM realm with an infinite number of free tier hours.
Except for the branding, the service has almost nothing to do with SageMaker. The environment is based on the popular and familiar JupyterLab notebooks. JupyterLab is the only commonality between Studio Lab and Studio available from the AWS Console.
During the exploration, I discovered the splash screen that hinted at Amazon’s attempt of branding it as SageMaker Lite, which honestly is not a bad idea.
The service is available for anyone to signup. But you will be put on the waitlist before gaining access to it. Once approved, you can log in and start training models. I applied for it immediately after the announcement and the request was approved within hours.
Compared to Google Colab, you feel at home with SageMaker Studio Lab. That’s largely due to the availability of native JupyterLab notebooks, and more importantly, the dedicated storage for your datasets, notebooks, and models.
Having spent a long time provisioning environments for training, experimentation, and deployment of ML models, I was curious to see what went behind the scenes of SageMaker Studio Lab.
Under the Hood of SageMaker Studio Lab
Simply put, SageMaker Studio Lab is a JupyterLab application running in a pre-provisioned Amazon EC2 instance. Imagine you have an EC2 instance that’s stopped but starts each time you want to experiment with your machine learning project. The only difference with SageMaker Studio Lab is that it’s forcibly stopped at the top of the 12th hour if your session is based on CPU or at the fourth hour if it’s the GPU.
At the time of writing, the service provisions all the resources within Ohio (US-EAST-2) region.
When you start a CPU-based session, JupyterLab runs within a T3.xlarge EC2 instance. According to the official EC2 specifications, this instance type comes with four vCPUs and 16GB RAM.
After logging into the environment, I checked the number of CPUs and the available memory, and it’s indeed a T3.xlarge machine.
In the case of a GPU, you get a G4dn.xlarge instance powered by an NVIDIA T4 TensorCore GPU, which comes with 2560 CUDA Cores, 320 Tensor Cores, and 16GB of memory.
Running the nvidia-smi
command confirms the presence of a T4 GPU.
Interestingly, the NVIDIA T4 GPUs are meant for inference and not for training. But, I don’t complain when it comes for free.
The GPU and CPU runtimes get 15GB of persistent storage, good enough to store large datasets, notebooks, and models.
Does this mean every user of SageMaker Studio Lab gets a dedicated T3.xlarge/G4dn.xlarge? No chance. AWS is obviously putting Docker to good use to pack multiple containers into one EC2 instance. Essentially, each session is mapped to a container.
In terms of the operating system, it is an Amazon Linux 2 AMI running the JupyterLab container image.
To ensure that the JupyterLab process is running all the time, it’s wrapped inside supervisord
.
When you push the Start Runtime button, SageMaker Studio Lab finds an EC2 instance from a warm pool, and schedules the container, and attaches the storage to it. I am not sure if AWS is leveraging Spot instances for the service.
Talking about the storage, the environment is backed by NVMe disks. The device /dev/nvme2n1
is mapped to the home directory, /home/studio-lab-user
I got a feeling that the service runs on top of Fargate with Amazon ECS to orchestrate the lifecycle of containers. It won’t be surprising if this use case is presented at a future re:Invent conference as a case study for Fargate Spot.
The shell runs as studio-lab-user
with no access to the root. Obviously, you cannot use sudo
within the shell. The package managers including yum
are disabled. The only tools you find are pip
, conda
, and the AWS CLI.
Of course, these details don’t matter much to a data scientist or an ML engineer using the SageMaker Studio Lab for training and experimentation. With inbuilt support for git and the availability of the AWS CLI, it’s easy to connect the platform with external environments, including Amazon SageMaker Studio.
Having built the environment from the ground up, I know the pain involved in configuring a multitenant, GPU-based ML testbed. Thank you AWS for giving us a data science environment that’s usable, complete, powerful, free, and most importantly, accessible to everyone.
I trained a Convolutional Neural Network (CNN) for image classification and then deployed the model in SageMaker, exposing it through the recently announced Serverless Inference endpoint. In the next part of this series, I will walk you through all the steps involved in it. Stay tuned.