Apache SeaTunnel Integrates Masses of Divergent Data Faster
The latest project to reach top-level status with the Apache Software Foundation (ASF) was designed to solve common problems in data integration. Apache SeaTunnel can ingest and synchronize massive amounts of data from disparate sources faster, greatly reducing the cost of data transfer.
“Currently, the big data ecosystem consists of various data engines, including Hadoop, Hive, Kudu, Kafka, HDFS for big data ecology, MongoDB, Redis, ClickHouse, Doris for the generalized big database ecosystem, AWS S3, Redshift, BigQuery, Snowflake in the cloud, and various data ecosystems like MySQL, PostgreSQL, IoTDB, TDEngine, Salesforce, Workday, etc.,” Debra Chen, community manager for SeaTunnel, wrote in an email message to The New Stack.
“We need a tool to connect these data sources. Apache SeaTunnel serves as a bridge to integrating these complex data sources accurately, in real-time, and with simplicity. It becomes the ‘highway’ for data flow in the big data landscape.”
The open source tool is described as an “ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data.” We’re talking tens of billions of data points a day.
Efficient and Rapid Data Delivery
Begun in 2017 and originally called Waterdrop, the project was renamed in October 2021 and entered the ASF incubator in December the same year. Created by a small group in China, SeaTunnel since has grown to more than 180 contributors around the world.
Built in Java and other languages, and it consists of three main components: source connectors, transfer compute engines and sink connectors. The source connectors read data from the source end (it could be JDBC, binlog, unstructured Kafka or Software as a Service API, or AI data models) and transform the data into a standard format understood by SeaTunnel.
Then the transfer compute engines process and distribute the data (such as data format conversion, tokenization, etc.). Finally, the sink connector transforms the SeaTunnel data format into the format required by the target database for storage.
“Of course, there are also complex high-performance data transfer mechanisms, distributed snapshots, global checkpoints, two-phase commits, etc., to ensure efficient and rapid data delivery to the target end,” Chen said.
SeaTunnel provides a connector API that does not depend on a specific execution engine. While it uses its own SeaTunnel Engine for data synchronization by default, it also supports multiple versions of Spark and Flink. The plug-in design allows users to easily develop their own connector and integrate it into the SeaTunnel project. It currently supports more than 100 connectors.
It supports various synchronization scenarios, such as offline-full synchronization, offline-incremental synchronization, change data capture (CDC), real-time synchronization and full database synchronization.
Enterprises use a variety of technology components and must develop corresponding synchronization programs for different components to complete data integration. Existing data integration and data synchronization tools often require vast computing resources or Java database connectivity connection resources to complete real-time synchronization. SeaTunnel aims to ease these burdens, making data transfer faster, less expensive and more efficient.
New Developments in the Project
In October 2022, SeaTunnel released its major version 2.2.0, introducing SeaTunnel Zeta engine, its data integration-specific computing engine and enabling cross-engine connector support.
Last December it added support for CDC synchronization, and earlier this year added support for Flink 1.15 and Spark 3. The Zeta engine was enhanced to support CDC full-database synchronization, multi-table synchronization, schema evolution and automatic table creation.
The community also recently submitted SeaTunnel-Web, which allows users not only to use SQL-like languages for transformation but also to directly connect different data sources, using a drag-and-drop interface.
“Any open source user can easily extend their own connector for their data source, submit it to the Apache community, and enable more people to use it,” Chen said. “At the same time, you can quickly solve the data integration issues between your enterprise data sources by using connectors contributed by others.”
What’s Ahead for SeaTunnel?
Chen laid out these plans for the project going forward:
- SeaTunnel will further improve the performance and stability of the Zeta engine and fulfill the previously planned features such as data definition language change synchronization, error data handling, flow rate control and multi-table synchronization.
- SeaTunnel-Web will transition from the alpha stage to the release stage, allowing users to define and control the entire synchronization process directly from the interface.
- Cooperation with the artificial general intelligence component will be strengthened. In addition to using ChatGPT to automatically generate connectors, the plan is to enhance the integration of vector databases and plugins for large models, enabling seamless integration of over 100 data sources.
- The relationship with the upstream and downstream ecosystems will be enhanced, integrating and connecting with other Apache ecosystems such as Apache DolphinScheduler and Apache Airflow. Regular communication occurs through emails and issue discussions, and major progress and plans of the project and community are announced through community media channels to maintain openness and transparency.
- After supporting Google Sheets, Feishu (Lark), and Tencent Docs, it will focus on constructing SaaS connectors, such as ChatGPT, Salesforce and Workday.