Elyra, the artificial intelligence (AI) toolkit first released by IBM in early 2020, helps data scientists with the often difficult process of building AI pipelines. As they wrote in the tool’s introductory post, “Building an AI pipeline for a model is hard. Breaking down and modularizing a pipeline is harder.” A data pipeline can include a number of steps, some relying on others, and creating this pipeline can lie outside the core skills needed for data science. Elyra solves this by offering a visual interface that turns creating and altering data pipelines into a familiar experience.
Patrick Titzler, a developer advocate at the Center for Open-Source Data and AI Technologies at IBM, explained that Elyra lets users assemble basic building blocks — Jupyter notebooks, Python scripts, and R scripts — into a pipeline that lets them perform tasks in sequence, in parallel, or otherwise.
“If you go through a machine learning workflow, you might have to load the data, analyze the data, cleanse it, then build the model, train the model, tune the model. And then you might have to go back when the results don’t really meet your expectations,” said Titzler. “With a pipeline editor, you can create those pipelines using simple drag and drop and then configure the nodes in that pipeline. So it speeds up your development because you don’t have to write any custom code to run all of those components or nodes in the pipeline. Plus, it enables people to actually do these things without necessarily having a deep domain expertise.”
This has been arguably the project’s most important feature, but the building blocks remained limited to those three types. With the recently released Elyra 3.3, however, users can create pipelines using custom components, a feature that Titzler wrote was “a major milestone on our roadmap.”
Previously, Elyra users could string together their own Jupyter notebooks or scripts, but they didn’t have access to external components, such as those available in Kubeflow Pipelines or Apache Airflow, the two platforms for running pipelines currently supported by Elyra. For example, the following image shows the components that are now available with Kubeflow Pipelines in Elyra, which includes things like creating a dataset volume or counting rows.
Another example of a set of components that can be added to Elyra with these changes is the Machine Learning Exchange, which provides an open source Data and AI assets catalog and execution engine for Kubeflow Pipelines. Titzler also points to the Component Library for AI, Machine Learning, ETL, and Data Science (CLAIMED) as an example. CLAIMED is a set of Jupyter notebooks that implement tasks such as data loading, data transformation, or model training, and can be used in Elyra as of this last release. CLAIMED can now be used by simply cloning the CLAIMED repository, and then the pipelines in that repository can then be opened in the pipeline editor for immediate use.
Titzler cautions that custom components differ from other components in Elyra in a few ways. First, they are runtime specific and often use runtime-specific mechanisms to exchange data with other components, instead of the S3-compatible storage used by generic components. They also need to be managed separately and are black boxes. While the Visual Pipeline Editor can expose their input and output, it does not have access to the functionality itself, necessarily.
Currently, Elyra only has support for local execution, Kubeflow Pipelines, and Apache Airflow, in terms of pipeline orchestration, but Titzler says that the community at large is in the process of adding others. He says that he has heard of interest in both Ray and Argo, but that any movement in those directions currently depends on the efforts of the community.
Looking forward, Titzler says that the project has “a big list of wishes that have come from various sources” but that improvements to the visual editor and increasing usability were among the current aspirations.
IBM is a sponsor of The New Stack.