LinkedIn Unifies Stream and Batch Processing with Apache Beam
By migrating to Apache Beam, social networking service LinkedIn unified its streaming and batch source code files, and reduced data processing time by 94%.
Originally, the job of refreshing data sets, “backfilling,” was first run as a set of stream processing jobs, but the more complex the jobs became, the more problems arose, explained a multi-author LinkedIn blog entry posted Thursday.
Backfills were then processed as batch via a lambda architecture, bringing on a new set of problems — now there were two different code bases, complete with all the challenges that came with owning and maintaining two sets of source code. The lambda architecture was replaced with the Beam API which required only one source code file for batch and streaming. The project was a success and resource use overall dropped by 50%.
Thought leaders and streaming software companies are engaged in a debate over real time vs. batch processing. One side is firmly planted in the idea that software must become more accessible to developers of all skill levels before streaming will truly hit the mainstream. The counterargument says developers must rise to the higher skill level requirements of the inconsistent tech stacks and languages that make up current streaming systems.
That LinkedIn recently reduced its data processing time by 94% by unifying its streaming and batch pipelines with Apache Beam makes a big win for the simplification argument.
The Challenge with Backfills
LinkedIn’s standardization process is the mapping of user data input strings (job titles, skills, education history) to internal IDs. The standardization data is required for search indexing and recommendation models. There are also more advanced AI models used within the pipelines to join complex data (job types and working experience) to standardize data for further usage.
Standardization requires data processing in two methods: real-time computation to reflect immediate updates and periodic backfills to refresh data when new models are introduced. When both real-time and backfills were processed as stream processing they were executed via Apache Samza Runner, which runs Beam pipelines. This worked until the following problems became insurmountable:
- Real-time jobs failed to meet time and resource requirements during backfill processing.
- The target requirement of 900 million profiles at a rate of 40,000/sec required for each backfill job became unreachable as training models grew in complexity.
- Streaming clusters weren’t optimized for the backfill’s spiky resource footprint.
The first optimization moved backfills to batch processing and execute the logic with the lambda architecture. This was operational but not optimal because with the Lambda architecture came the Matryoshka doll of challenges — a second code base. The introduction of a second code base began the requirement for developers to build, learn, and maintain two codebases in two different languages and stacks.
The next iteration of the process brought on the introduction of Apache Beam API. Using Apache Beam meant developers could go back to working on one source code file.
The Solution: Apache Beam
Apache Beam is an open source, unified model for defining batch and streaming data-parallel processing pipelines. Developers can build a program to define the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of beam’s distributed processing backends of which there are several options such as Apache Flink, Spark, and Google Cloud Dataflow.
The unified pipeline is powered by Beam’s Samza and Spark backends in this particular use case. Samza processes two trillion messages daily with large states and fault tolerance. Beam Samza Runner executes the Beam pipeline as a Samza application locally. The Spark backend processes petabytes of data with LinkedIn’s eternal shuffling service and schema metadata store. Beam Apache Spark Runner executes beam pipelines using Spark just as a native spark application does.
How It Works
The Beam pipeline manages a directed acyclic graph of processing logic. The pipeline in the diagram below reads ProfileData, joins the table with a sideTable, applies a user-defined function called Standardizer(), and completes by writing the standardized result to the database. The code snippet is executed by both Samza Cluster and Spark Cluster.
Batch and stream processing jobs accept different inputs and return different outputs even in the instance of Beam when the source code is the same. Streaming inputs originate from unbounded sources such as Kafka and their outputs update the database while batch inputs come from bounded sources like HDFS and produce datasets as outputs.
PTransforms are the out-of-the-box step in the Beam workflow that takes the input from either source and performs the processing function which then produces zero or more outputs. LinkedIn added functionality to further streamline the Beam API in their Unified PTransforms. Unified PTransforms provides two expand() functions for streaming and batch. The pipeline types are detected at runtime and the appropriate expand() is called accordingly.
The original method of performing backfills as a stream processing required over 5,000 GB-Hours of memory and nearly 4,000 hours of CPU time. Those numbers were cut in half after the migration to Beam. The seven hours it took to complete the jobs dropped to a mere 25 minutes after the migration.
Overall this translates to 94% of processing time and 50% of overall resource usage was saved. Eleven times the operating cost was reduced based on the cost-to-serve analysis.
This is but a first step toward a truly end-to-end convergence solution. LinkedIn continues to work toward easing the complexity of working with streaming and batch solutions. Though there is only one source code file, there is still additional complexity caused by different runtime binary stacks (Beam Samza runner in stream and Beam Spark runner in batch) such as learning how to run, tune, and debug both clusters, the operational and maintenance costs of runtime on the two engines, and the maintenance of the two runner codebases.
LinkedIn Senior Software Engineer Yuhong Cheng was the lead author for the LinkedIn post, with Yuhong Cheng, Shangjin Zhang, Xinyu Liu, and Yi Pan serving as co-authors.