How Apache Airflow Better Manages Machine Learning Pipelines
VANCOUVER — What is apparent with Apache Airflow, the open source project for building pipelines in machine learning? The experience is getting even easier, as illustrated in a discussion on The New Stack Makers with three technologists from Amazon Web Services.
Apache Airflow is a Python-based platform to programmatically author, schedule and monitor workflows. It is well-suited to machine learning for building pipelines, managing data, training models, and deploying them.
Airflow is generic enough for the whole pipeline in machine learning. Airflow fetches data and performs extraction, transformation and loading (ETL). It tags the data, does the training, deploys the model, tests it and sends it to production.
In an On the Road episode of Makers recorded at the Linux Foundation’s Open Source Summit North America, our guests, who all work with the AWS Managed Service for Airflow team, reflected on the work on Apache Airflow to improve the overall experience:
Dennis Ferruzzi, a software developer at AWS, is an Airflow contributor working on project API-49, which will update Airflow’s logging and metrics backend to the OpenTelemetry standard. The API will allow for more granular metrics and better visibility into Airflow environments.
Niko Oliveira, a senior software development engineer at AWS, is a committer/maintainer for Apache Airflow. He spends much time reviewing, approving and merging pull requests. A recent project included writing and implementing AIP-51 (Airflow Improvement Proposal), which modifies and updates the Executor interface in Airflow. It allows Airflow to be a more pluggable architecture, which makes it easier for users to build and write their own Airflow Executors.
Raphaël Vandon, a senior software engineer at AWS, is an Apache Airflow contributor working on performance improvements for Airflow and leveraging async capabilities in AWS Operators, the part of Airflow that allows for seamless interactions with AWS.
“The beautiful thing about Airflow, that has made it so popular is that it’s so easy,” Oliveira said. “For one, it’s Python. Python is easy to learn and pick up. And two, we have this operator ecosystem. So companies like AWS, and Google and Databricks, are all contributing these operators, which really wrap their underlying SDK.”
‘That Blueprint Exists for Everyone’
Operators are like generic building blocks. Each operator does one specific task, Ferruzzi said.
“You just chain them together in different ways,” he said. “So, for example, there’s an operator to write data to [Amazon Simple Storage Service]. And then there’s an operator that will send the data to an SQL server or something like that. And basically, the community develops and contributes to these operators so that the users, in the end, are basically saying the task I want to do is pull data from here. So I’m going to use that operator, and then I want to send the data somewhere else.
“So I’m going to go and look at, say, the Google Cloud operators and find one that fits what I want to do there. It’s cross-cloud. You can interact with so many different services and cloud providers. And it’s just growing. We’re at 2,500 contributors now, I believe. And it’s just like people find a need, and they contribute it back. And now that block, that blueprint exists for everyone.”
Airflow 2.6 has an alpha for sensors, Vandon said. Sensors are operators that wait for something to happen. There are also notifiers, which get placed at the end of the workflow. They act depending on the success (or not) of the workflow.
As Vandon said, “It’s just making things simpler for users.”