How We Built an MLOps Platform and the Lessons We Learned
Sometimes the best solutions are the ones that come together unexpectedly. That’s always been true in the open source community. We’ve become accustomed to witnessing some of the most innovative technologies emerge from simple projects originally intended to do one thing but then ended up doing so much more.
Open Data Hub is an example of a project that started off as a simple data storage solution but has since evolved into something far more ambitious — a commercialized offering called Red Hat OpenShift Data Science.
When Red Hat first created Open Data Hub, we were using it internally to store large amounts of data so engineers could understand data from container image build logs. We then opened the project up to data scientists to create models from the data, detect anomalies in builds and identify the root causes of build failures. Soon, though, we began to understand that the potential for Open Data Hub was far bigger than we had thought.
This is the story of how Open Data Hub went from being an internal storage project to a commercial MLOps platform, and the lessons we learned along the way.
In the Beginning, There Was Data
We created Open Data Hub in early 2018 as a reference architecture for data scientists to access data, computational resources and modeling frameworks. We started with Ceph for object storage, Kafka for streaming data, Spark running on Kubernetes and Jupyter Notebooks to provide machine learning on top of the data platform for analyzing internal engineering builds.
The project worked great — so great that when some of our customers learned about it, they began to wonder why we weren’t offering it to them. So, we made the decision to commercialize Open Data Hub.
As more and more models were being developed, being able to operationalize these models became more important. That meant adding new features.
“Jupyter and the rest are nice,” our customers said, “but what we really need is a true MLOps platform to help us bring models into production.”
They needed more tools — not just ones for data scientists, but tools that developers and operators could use too. And they needed those tools to be accessible to everyone.
It was early days, but our customers were already pushing us to get creative and think beyond what we had originally intended. They wanted us to iterate and innovate, to embody what open source is and always will be about.
Betting on Collaboration
Once the decision was made to expand Open Data Hub, the questions started flowing. One of the first ones was, “What type of platform should we use as our foundation?”
We knew we wanted to build the project with technologies that would provide the most flexibility. The concept of a hybrid cloud had been around since about 2011, but it really started to gain traction in enterprises in 2018. We understood that the platform needed to be deployable on and across any cloud environment, not tied to any proprietary cloud.
But, we also needed a system that would make it easier for different teams to automate application development, deployment and management. The foundation we were using for Open Data Hub internally was Red Hat OpenShift, our application platform based on Kubernetes. That worked well, but we didn’t know if we would be able to successfully leverage the platform as a single source of collaboration between data scientists, operations managers and developers, since it was not initially intended for that purpose.
We made a bet that we could pull it off. But we knew almost immediately that we couldn’t do it ourselves.
Learning We Couldn’t Build It Alone
As we started to build out the platform, we began to realize some of the things we needed but did not yet have. Turns out, it was quite the list.
Most notably, we did not have a lot of curated upstream products that pull data from SQL, model registries and other sources. And since Open Data Hub was originally conceived as a storage solution, it didn’t include many of the tools that developers and operations professionals need. It was still too small; we needed to expand it.
We identified the types of tools we needed to meet our goal of creating a collaborative ML platform and began making them available through Open Data Hub. Many of these were open source projects themselves (TensorFlow, PyTorch, KServe, etc.); others were the result of partnerships (Starburst’s commercial products based on Trino, Pachyderm and Intel’s OpenVINO, for example).
The structure of Red Hat OpenShift Data Science was beginning to take shape. We started to fill it in with other features, including the ability to enable custom notebooks and model-serving engines. It started as a managed cloud service, but we quickly learned that many customers wanted a traditional software option too. Thanks to the agnostic nature of OpenShift we were able to pivot and provide them with a hybrid platform that runs on-premises and across all clouds.
Finally, running Red Hat OpenShift Data Science on Red Hat OpenShift enabled customers to add their own tools into the same cluster. They could essentially color outside the lines by bringing in home-grown applications, independent software vendor (ISV)-supported software, MLOps-related tooling and thousands of other open source solutions.
In short, our customers didn’t limit us. We didn’t want to limit them, either.
Not All Open Source Projects Are Created Equal
Not every organization lives and breathes open source. Most companies don’t have the time to worry about whether a certain open source technology is enterprise-ready or has the appropriate security protocols in place.
When we created Red Hat OpenShift Data Science, we made a commitment to unburden customers of those challenges. The tools we included in the platform are Red Hat-tested versions of open source technologies. Customers can still use whatever open source tools they want, but they know that the ones integrated into Red Hat OpenShift Data Science have been well-vetted and curated.
Kubeflow is a great example. Three years ago, we believed Kubeflow wasn’t standardized and mature enough to be considered a good enterprise option. We incorporated components of Kubeflow into Red Hat OpenShift Data Science and created enterprise-grade model serving and pipeline features.
All of this was more work for us than we ever intended when we first conceived of Open Data Hub. But we felt it was our responsibility to be good stewards of both the open source community and our customers.
It’s incredible to think of how Red Hat OpenShift Data Science has grown since that “small” internal project in 2018. Recently, we announced new ways to use the power of AI in the MLOps platform, and we are continuing to explore other innovative uses of AI, including how the technology can be used to translate human text into IT automation playbooks.
True to our roots, we will always follow an open source development model in which future technologies are put into the upstream Open Data Hub first before being incorporated into Red Hat OpenShift Data Science. A good example is the infrastructure stack being developed to run foundation models using technologies like Ray and CodeFlare.
Most importantly, we will continue to learn and experiment. Our customers taught us that an open source project can be more than what it was originally intended to be. There’s no reason to think that Red Hat OpenShift Data Science can’t be more than it is today. We’ll build it out, push it forward and discover new capabilities and uses.