Streaming Data at LinkedIn: Apache Kafka Reaches 1.1 Trillion Messages Per Day
LinkedIn’s deployment of Apache Kafka has surpassed over 1.1 trillion messages per day, a milestone which presents just another example of how Internet-scale services are changing perspectives about enterprise-level message brokers.
Apache Kafka is a low-latency distributed messaging system. Earlier this year, LinkedIn’s Navina Ramesh wrote for The New Stack about Apache Samza, LinkedIn’s framework for stream processing. She described how, back in 2006, LinkedIn started using Hadoop for batch processing. She wrote that unlike Hadoop, which is optimized for throughput, Kafka is optimized for low-latency messaging:
We built a processing system on top of Kafka, allowing us to react to the messages — to join, filter, and count the messages. The new processing system, Apache Samza, solved our batch processing latency problem and has allowed us to process data in near real time.
The Origin of Apache Kafka
The story behind Kafka and Samaza also speaks to the roots of Confluent, one of the fastest growing startups in the business. In a post yesterday, Confluent co-founder Neha Narkhede wrote about how she, Jay Kreps and Jun Rao created the system which is now Kafka. Prior to Kafka, LinkedIn ran with legacy technology, which made data collection and research difficult. By observing the limitations that arose when working with older machines, Narkhede, Rao and Kreps were able to build Kafka to adjust to today’s technology demands.
Today, Kafka is used in data centers across the world, including those at Netflix, where the ability to sync and access data worldwide is crucial. When Kafka was put into production in July 2010, it was originally used to power user activity data. By 2011, Kafka handled 1 billion messages per day. As time went on, Kafka was used to capture all of LinkedIn’s data. This began as handling small changes, such as when a user updated their profile or added a work connection, eventually growing to encompass segments such as search graphs, LinkedIn’s Hadoop clusters, and its data warehouse.
The Future of Kafka
Eventually, Narkhede, Kreps and Rao departed LinkedIn to found Confluent. According to Narkhede’s post on Confluent, since they began the transition, Kafka downloads have increased by a multiple of seven in the last year. Confluent offers developers a tiered subscription package to deploy Kafka at scale for enterprise-level tasks, though the project itself is open source.
Both LinkedIn and Confluent are active within the open source community, committed to supporting and developing Kafka longterm. LinkedIn’s use of Kafka far surpassed the expectations of its creators, having been the largest project at the time of its inception to offer companies the ability to harness data at-scale reliably.
LinkedIn’s original legacy platform did not allow for the collection of user-activity data or log data, which is crucial to the longterm success of any modern business. Since implementing Kafka, these data sources can be collected, researched and analyzed. Kafka makes data available in real time, allowing for stream processing and analytics to be compiled with ease. LinkedIn continues to develop Kafka to suit its needs, setting custom libraries which layer on top of the original Kafka libraries. Understanding how users navigate, consume, populate and curate content on a platform such as LinkedIn is crucial.
Running Kafka as an alternative to a standard message broker allows for businesses to handle their data infrastructure with a focus on scalability and security. As more applications move into working in the cloud or with containers, harnessing logs and user data is vital to retaining users longterm. Kafka presents itself as a standout choice for enterprises or small businesses to manage their stream data.
Feature image: “LinkedIn Chocolates” by Nan Palmero is licensed under CC BY 2.0.