Data / Technology

DolphinScheduler Tames Complex Data Workflows

20 Apr 2021 8:53am, by

Like many IT projects, a new Apache Software Foundation top-level project, DolphinScheduler, grew out of frustration.

“We found it is very hard for data scientists and data developers to create a data-workflow job by using code. We had more than 30,000 jobs running in the multi data center in one night, and one master architect. Often something went wrong due to network jitter or server workload, [and] we had to wake up at night to solve the problem,” wrote Lidong Dai and William Guo of the Apache DolphinScheduler Project Management Committee, in an email. “We tried many data workflow projects, but none of them could solve our problem.”

It was designed to be:

  • Cloud native — support multicloud/data center workflow management, Kubernetes and Docker deployment and custom task types, distributed scheduling, with overall scheduling capability increased linearly with the scale of the cluster.
  • Highly reliable — with decentralized multimaster and multiworker, high availability, supported by itself and overload processing.
  • User friendly — all process definition operations are visualized, with key information defined at a glance, one-click deployment.
  • Supporting rich scenarios — including streaming, pause, recover operation, multitenant, and additional task types such as Spark, Hive, MapReduce, shell, Python, Flink, sub-process and more.

“We have a slogan for Apache DolphinScheduler: ‘More efficient for data workflow development in daylight, and less effort for maintenance at night.’ When we will put the project online, it really improved the ETL and data scientist’s team efficiency, and we can sleep tight at night,” they wrote.

Largely based in China, DolphinScheduler is used by Budweiser, China Unicom, IDG Capital, IBM China, Lenovo, Nokia China and others.

Non-Central Design

The project started at Analysys Mason in December 2017. It entered the Apache Incubator in August 2019.

It enables users to associate tasks according to their dependencies in a directed acyclic graph (DAG) to visualize the running state of the task in real-time. They can set the priority of tasks, including task failover and task timeout alarm or failure.

It employs a master/worker approach with a distributed, non-central design. Both use Apache ZooKeeper for cluster management, fault tolerance, event monitoring and distributed locking.

Users can just drag and drop to create a complex data workflow by using the DAG user interface to set trigger conditions and scheduler time. They also can preset several solutions for error code, and DolphinScheduler will automatically run it if some error occurs. There’s also a sub-workflow to support complex workflow.

Multimaster architects can support multicloud or multi data centers but also capability increased linearly. This design increases concurrency dramatically. It supports multitenancy and multiple data sources. In users’ performance tests, DolphinScheduler can support the triggering of 100,000 jobs, they wrote.

It enables many-to-one or one-to-one mapping relationships through tenants and Hadoop users to support scheduling large data jobs. Supporting distributed scheduling, the overall scheduling capability will increase linearly with the scale of the cluster.

The task queue allows the number of tasks scheduled on a single machine to be flexibly configured. High tolerance for the number of tasks cached in the task queue can prevent machine jam.

Growing Project

DolphinScheduler competes with the likes of Apache Oozie, a workflow scheduler for Hadoop; open source Azkaban; and Apache Airflow. It touts high scalability, deep integration with Hadoop and low cost.

T3-Travel choose DolphinScheduler as its big data infrastructure for its multimaster and DAG UI design, they said. Likewise, China Unicom, with a data platform team supporting more than 300,000 jobs and more than 500 data developers and data scientists, migrated to the technology for its stability and scalability.

“JD Logistics uses Apache DolphinScheduler as a stable and powerful platform to connect and control the data flow from various data sources in JDL, such as SAP Hana and Hadoop. It offers open API, easy plug-in and stable data flow development and scheduler environment,” said Xide Gu, architect at JD Logistics.

While in the Apache Incubator, the number of repository code contributors grew to 197, with more than 4,000 users around the world and more than 400 enterprises using Apache DolphinScheduler in production environments.

Dai and Guo outlined the road forward for the project in this way:

1: Moving to a microkernel plug-in architecture

The kernel is only responsible for managing the lifecycle of the plug-ins and should not be constantly modified due to the expansion of the system functionality. So this is a project for the future. The plug-ins contain specific functions or can expand the functionality of the core system, so users only need to select the plug-in they need. This could improve the scalability, ease of expansion, stability and reduce testing costs of the whole system.

The DolphinScheduler community has many contributors from other communities, including SkyWalking, ShardingSphere, Dubbo, and TubeMq. The scheduling system is closely integrated with other big data ecologies, and the project team hopes that by plugging in the microkernel, experts in various fields can contribute at the lowest cost.

2: Core link optimization

The team wants to introduce a lightweight scheduler to reduce the dependency of external systems on the core link, reducing the strong dependency of components other than the database, and improve the stability of the system. By optimizing the core link execution process, the core link throughput would be improved, performance-wise.

3:  Provide lightweight deployment solutions

The software provides a variety of deployment solutions: standalone, cluster, Docker, Kubernetes, and to facilitate user deployment, it also provides one-click deployment to minimize user time on deployment. It also supports dynamic and fast expansion, so it is easy and convenient for users to expand the capacity.

Considering the cost of server resources for small companies, the team is also planning to provide corresponding solutions. Users can choose the form of embedded services according to the actual resource utilization of other non-core services (API, LOG, etc.), and can deploy LoggerServer and ApiServer together as one service through simple configuration. The core resources will be placed on core services to improve the overall machine utilization. This would be applicable only in the case of small task volume, not recommended for large data volume, which can be judged according to the actual service resource utilization.

Image by Pexels from Pixabay.

A newsletter digest of the week’s most important stories & analyses.