Airflow, a Workflow Orchestrator for Big Data
The Apache Software Foundation’s latest top-level project, Airflow, workflow automation and scheduling stem for Big Data processing pipelines, already is in use at more than 200 organizations, including Adobe, Airbnb, Paypal, Square, Twitter and United Airlines.
“Apache Airflow has quickly become the de facto standard for workflow orchestration,” said Bolke de Bruin, vice president of Apache Airflow. “Airflow has gained adoption among developers and data scientists alike thanks to its focus on configuration as code.”
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative, according to the project’s GitHub page. Airflow provides smart scheduling, database and dependency management, error handling and logging. It touts command-line utilities for performing complex surgeries on DAGs and the user interface for providing visibility into pipelines running in production, making it easy to monitor progress and troubleshoot issues.
Maxime Beauchemin created Airflow in 2014 at Airbnb. It entered the ASF incubator in March 2016. It’s designed to be dynamic, extensible, lean and explicit, and scalable for processing pipelines of hundreds of petabytes.
With Airflow, users can create workflows as directed acyclic graphs (DAGs) to automate scripts to perform tasks. Though based in Python, it can execute programs in other languages as well. The Airflow scheduler executes tasks on an array of workers while following the specified dependencies.
DAG operators define individual tasks to be performed, though custom operators can be created.
The three main types of operators are:
- Those that perform an action, or tell another system to perform an action
- Transfer operators that move data from one system to another
- Sensors that keep running until a certain criterion is met, such as a specific file landing in HDFS, a partition appearing in Hive, or a specific time of the day.
In an introduction to the technology, Matt Davis, a senior software engineer at Clover Health, explains that it enables multisystem workflows to be executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others.
Its ability to work in languages other than Python makes it easy to integrate with other systems including AWS S3, Docker, Apache Hadoop HDFS, Apache Hive, Kubernetes, MySQL, Postgres, Apache Zeppelin, and more.
“Airflow has been a part of all our Data pipelines created in past two years acting as the ringmaster and taming our Machine Learning and ETL Pipelines. It has helped us create a single view for our client’s entire data ecosystem. Airflow’s Data-aware scheduling and error-handling helped automate entire report-generation processes reliably without any human intervention,” said Kaxil Naik, data engineer at Data Reply, who pointed out that its configuration-as-a-code paradigm makes it easy for non-technical people to use without a steep learning curve.
However, Airflow is not a data-streaming solution such as Spark Streaming or Storm, the documentation notes. It is more comparable to Oozie, Azkaban, Pinball, or Luigi.
Workflows are expected to be mostly static or slowly changing. They should look similar from one run to the next — slightly more dynamic than a database structure.
It comes out of the box with an SQLite database that helps users get up and running quickly, providing a tour of the UI and command line utilities.
“At Qubole, not only are we a provider, but also a big consumer of Airflow as well,” said innerspring manager Sumit Maheshwari. Qubole offers Airflow as a managed service.
The company’s “Insight and Recommendations” platform is built around Airflow. It processes billions of events each month from hundreds of enterprises and generates insights on Big Data systems such as Apache Hadoop, Apache Spark, and Presto.
“We are very impressed by the simplicity of Airflow and ease at which it can be integrated with other solutions like clouds, monitoring systems or various data sources.”
Cincinnati-based Astronomer built its platform on top of Airflow. In addition, Google launched Cloud Composer, a managed Airflow service, in beta last May. And Amazon has integrated its managed machine-learning-workflow service Sagemaker with Airflow.