Today on The New Stack Context we talk with Garima Kapoor, COO and co-founder of MinIO, about using Spark at scale for Artificial Intelligence and Machine Learning (AI/ML) workloads on Kubernetes.
The Apache and Hadoop ecosystem hasn’t had much overlap with Kubernetes in the past, but as we learned at KubeCon in Seattle last November, that is quickly changing. As Iguazio’s Yaron Haviv wrote in a contributed article on TNS titled “Will Kubernetes Sink the Hadoop Ship?”
“Early adopters are realizing that they can run their big data stack (Spark, Presto, Kafka, etc.) on Kubernetes in a much simpler manner. Furthermore, they can run all of the cool post-Hadoop AI and data science tools like Jupyter, TensorFlow, PyTorch or custom Docker containers on the same cluster.”
Fast forward to now and we are approaching the Spark + AI Summit which Databricks is putting on in San Francisco next week and we are curious… How is Spark being used in cloud native architectures these days, with the likes of MinIO — the open source, container native object store — to, say, create machine learning data pipelines on Kubernetes? What is driving this trend to high-performance object stores? Kapoor breaks down the trends in the first half of the show.
Later in the show, Joab Jackson, The New Stack’s managing editor, gives us the highlights from the O’Reilly AI Conference in New York this week. Now that machine learning has firmly entered the corporate world, we need to find ways of making it robust, reliable and secure, advocated a number of speakers at that event.
At the conference, it was Massachusetts Institute of Technology faculty member Aleksander Madry who first called for AI 2.0 (though the term is probably inevitable in this industry, we suppose). Today’s AI is not nearly robust enough, insufficiently secure, and still way too unpredictable. The next generation of the technology must be “much more aligned with what we humans see as significant,” he said during his keynote.
And indeed, many of the talks, presentations and sponsor booths were centered around the idea of making AI more mature. In one presentation, Microsoft data scientists Fidan Boylu Uz and Mathew Salvaris demonstrated three ways to do Kubernetes-based Deep Learning in a production setting. One approach involved using Kubectl as a launching point — this approach offers the most flexibility for those who know how to manage K8s. Another method would be to use KubeFlow, a Google project to package the whole AI pipeline. This approach would be best for a research scientist who just wants to use their favorite libraries, such as TensorFlow and PyTorch. And lastly, was, of course, the Microsoft AzureML service, which was the easiest to deploy, as it does a lot of the configuration and build work itself, though, unlike the other approaches, you are limited to using Microsoft Azure as your cloud.
Still, social media companies are getting immense criticism for how their AI algorithms tend to surface more extreme, and outright toxic content. There still needs to be a reconciliation between what the machines suggest as the best answers, and what we humans consider acceptable. One factor could be a lack of diversity in the workforce. The less inclusive — and more homogeneous — the development team is that is building AI, the more likely the AI will contain unintentional biases, Dataiku’s Kurt Muehmel pointed out in his own talk on AI Ethics.