Meet FloydHub: The Heroku of Data Science
This article is a part of the series where we explore cloud-based machine learning services. After covering Azure ML Services, Google Cloud ML Engine, Amazon SageMaker, IBM Watson Studio Cloud and Paperspace Gradient, we will take a closer look at FloydHub, another popular ML as a Service platform.
Ask any data science team on what frustrates them the most, and the common answer is environment configuration and management.
The best thing about data science and machine learning domain is that the tools, frameworks, and libraries are 100 percent open source. While this is great for the community, the flipside is the fragmentation. Developers and data scientists often get overwhelmed by the choice of tools. Add dependencies, GPUs and high-end CPUs, and possible conflicts with each version of the tool, it quickly turns into a configuration nightmare.
Contrary to the general perception, dealing with machine learning and artificial intelligence is not just about choosing and optimizing the most sophisticated algorithm. Interestingly, that takes only 20 percent of the overall effort required for a successful implementation. The remaining 80 percent deals with data engineering, data preparation, infrastructure provisioning, configuration management, environment management, artifact versioning, model deployment, and lifecycle management.
For a successful implementation of a machine learning project, an organization should hire data engineers, DevOps professionals, data scientists, and developers. Their collaboration is critical to the long-term success of the organization.
Contrary to the general perception, dealing with machine learning and artificial intelligence is not just about choosing and optimizing the most sophisticated algorithm.
Not every company can afford to invest in these teams. There may be just a couple of data scientists and traditional developers to tackle the entire lifecycle of machine learning models. The new breed of ML PaaS offerings precisely addresses this gap by exposing pre-configured and customizable environments, automated model training, and scalable model hosting. They let the team focus on the core business problem instead of getting lost with the myriad choice of tools and frameworks.
FloydHub, a young startup from the Bay Area is set out to solve the problems and challenges faced by data scientists. The founders call the platform as Heroku for data science, which is certainly an apt description.
The platform is an abstraction of Docker containers running on Amazon EC2 instances which exposes a simple API to perform most of the tasks involved in developing machine learning models. Similar to Heroku where developers upload the code and leave the rest to the PaaS, FloydHub expects users to upload datasets, code for training the model, and code for exposing the trained model for inference.
Users can quickly initiate a training job with a pre-defined configuration setting or choose to provide a YAML file with custom configuration. FloydHub will take over the responsibility of creating the custom environment and running the code within it.
Let’s look at the core components of FloydHub:
FloydHub Projects act as the boundary for all the assets that belong to a specific project. It will hold the code, versioned experiments, output files, logs, and complete history of a job. It may also contain Jupyter Notebooks created in Workspaces.
The very first step to using FloydHub is creating the project.
FloydHub provides a strong isolation between data and code. Since datasets are reused across multiple projects, it makes sense to keep them in a separate but centrally accessible location. Any project can gain access to the dataset uploaded into their account. There are some public datasets such as MNIST and VOC for common ML experiments.
The best thing about datasets in FloydHub is that they are versioned. This is very helpful during data preparation and data engineering phase where the original dataset goes through multiple transformations. Developers can easily access historical datasets by referring to the version.
FloydHub CLI can be used to upload datasets from local workstations. I couldn’t find tools for bulk upload or the ability to directly import datasets from publicly accessible locations such as S3 Buckets.
Once a project is created and the dataset is uploaded, the obvious next step is to kick off the training job. FloydHub has a simple and intuitive workflow to initiate a training job.
The Job is where the rubber meets the road. Developers are expected to write and test the Python code on their local machine before creating a job. When it’s time to run the training job at scale, they simply choose a predefined environment such as TensorFlow, Caffe, and PyTorch along with an instance type based on either a CPU or GPU. They can also point to the location of the dataset that is already uploaded.
FloydHub uploads the code, injects that into a pre-configured container image, and launches the container in the target environment. GPU-based jobs are packaged as NVIDIA-DOCKER containers that can take advantage of NVIDIA K80 or P100 GPUs. The job can be monitored through the streaming logs emitted by the local CLI or from the web interface. As soon as the job is done, FloydHub automatically terminates the container and stops the job. The files generated by the Job can be downloaded from the web portal.
If a Job needs custom packages and dependencies, they can be added to a file called floyd_requirements.txt that follows the same format as Python’s requirements.txt. This enables users to precisely define the versions and frameworks needed by the Job.
A Job can have two separate lives defined as a mode – training and serving. The default mode is training but it can also be serving mode that can host a trained model for inference. FloydHub expects you to include a file called app.py that contains the boilerplate code to deserialize the model, and exposing it as a REST endpoint. When a Job is launched with the switch –mode serve, it will run continuously till is manually terminated.
Below screenshot shows how simple it is to use launch a Job in serving mode and accessing it for inference. The only gap that I see is that the REST endpoint is not secure. FloydHub should include at least an API key that can be used with HTTP basic authentication.
I really liked the simplicity of the FloydHub CLI to run a job. It has just enough switches to define what you expect to run a job.
With Jupyter Notebooks becoming the gold standard for data science IDE, FloydHub added support for them through Workspaces. Apart from running a Notebook, a Workspace is almost like a virtual machine where users can access the shell.
Existing dataset locations can be mounted as directories that become visible to the Notebook. Files from the local machines can be directly uploaded to a Workspace to access from a Jupyter Notebook.
In my evaluation, I found that Workspaces are not thoroughly integrated with Jobs. For example, Jobs and Workspaces don’t share common storage making it difficult to move and reuse model artifacts such as checkpoint files.
Ideally, Workspaces should become an alternate input to the Job. Users should either initiate a Job from the local machine through the CLI or click a button in the Workspace UI. Irrespective of the location, Jobs should be treated the same. This integration will bring a smoother workflow along with a consistent experience.
FloydHub scores high for its minimalistic and simple approach of dealing with ML experimentation and model management. It’s a powerful PaaS for running simple, classical ML models based on Scikit-learn or complex deep learning models based on TensorFlow or Caffe.