How LinkedIn, PayPal Each Beat Database Lag with Home-Built Open Source
At QCon New York last month, PayPal and LinkedIn each unveiled new software projects designed to speed and simplify the movement of data. Both were designed internally to better manage the crushing amount of data and transactions each internet service gets, and both are being released as open source packages.
PayPal’s Hera, released recently as open source, is a database proxy built specifically for pooling and managing database connections for microservices. LinkedIn’s Brooklin, which the company plans to release as opens source shortly, serves as a central data bus for the company, connecting the many back-end databases and data stores with user-facing applications.
Both of these projects should be examined by organizations looking to simplify and speed the data movement between back-end data sources and user-facing applications.
The online payments company Paypal relies heavily on databases: At present, the company has 2,000 database instances holding a total of 74PB of data and the busiest ones field nearly 1 billion calls an hour. As with any online service, response time is crucial, and one of the biggest culprits are database connection times, which can range up to 100ms each. Worse, each database instance has a set number of connections it can handle at once. The more connections a database instance, the worse its performance gets.
One possible way of smoothing over performance is to establish a connection pool, which fields and queues the requests for multiple microservices. This is what PayPal’s Hera does, sitting between the microservices and the databases, pooling the connections, and providing additional services, according to a presentation by PayPal staff engineers Petrica Voicu and Kenneth Kang.
Hera has been running in production for a year, and sits in front of hundreds of database instances.
Hera manages both reads and writes. It picks a particular node to send all the writes to, thereby cutting minimizing a lot of back-and-forth handshaking, thereby speeding traffic for all the other instances that can be focused on fulfilling reads requests. Although the applications themselves don’t need to be modified, some additional functionality is needed to communicate with Hera, to specify which databases shards to communicate. So the team created libraries for C++, Java, Python, Node.js and Go calls.
Hera also comes with some built-in resiliency and recovery safeguards: It can balance connections across the remaining nodes when one goes out. In times of overload, it will immediately terminate the SSL connection, so the app can immediately launch into failure mode, and not leave the user hanging when there is no clear path to the data. It also offers migration capability, facilitating the database shards safely from one node to another, even while transactions are still being processed.
The code for Hera is available on GitHub.
No Sleep Til Brooklin
LinkedIn is no stranger to data streaming latencies either: Eight years ago, the social media giant released as open source Kafka, a streaming data platform that serves as a centralized pub-sub mechanism for creating data streams. Now, a new project from the company, Brooklin, is about connecting multiple data sources to multiple destinations, said Celia Kung, the LinkedIn engineering manager in charge of the pipelines group at the company.
Brooklin addresses the challenge of how to do data streaming to multiple end-nodes when the data is coming from multiple sources. Today, LinkedIn uses Brooklin to support over 200 applications. It passes over 2 trillion messages a day from Kafka alone, spanning tens of thousands of topics. These capabilities can be configured individually by the developer and dynamically deployed — no need to manually change configuration files or manually deploying to the cluster.
First a bit about the name: Brooklin is not misnamed after the New York borough. Instead, it is an aggregation of two words: “brook” (like the stream) and a shortened “LinkedIn.” The software acts like a brook of gently-flowing data, supplying near-real-time apps with material from backend databases. For example, LinkedIn users will automatically see on their own timelines when someone they are linked to gets a new job, or a promotion. Brooklin moves that piece of data from the back-end database to the app that produces each user’s timeline.
While initially LinkedIn stored most all of its user data in Oracle, it since adopted a wider variety of data storage systems, especially since Microsoft purchased LinkedIn in 2006. Now user and operational data is captured in Kafka, MySQL, Microsoft Event Hubs, the company’s own Espresso data store, and Amazon Kinesis, among others. Initially, Kung’s team created a data bus to capture any changes made in the Oracle database, but that code was specific to this one connection. They wanted to build a data bus that could ingest data from multiple sources, and also be consumed by multiple near-real-time applications.
LinkedIn uses Brooklin for a variety of use cases, including change data capture (propagating changes in the database to applications), eliminating the overhead to run queries against the data source itself. Brooklin can also refresh caches and build search indices. It can work as s streaming bridge, able to move data from one source to another, such as from AWS to Microsoft Azure. It can be also be used to re-partition data. It can be used to set up a data warehouse, eliminating the need for extract transform and load (ETL) tools, by shipping data to a Hadoop File System (HDFS).