LinkedIn’s Gobblin: An Open Source Framework for Gobbling Big Data with Ease
The engineering team for social media service LinkedIn first launched Gobblin in 2014 as a universal data ingestion framework for offline big data, running on Hadoop in MapReduce mode. As new capabilities were added to enable the framework to support a spectrum of execution environments and scale to handle a broad range of data velocities, Gobblin quickly evolved from singular data ingestion tool to robust data management ecosystem. Gobblin was made open source in mid-2015, and has grown into a distributed data integration framework simplifying the common aspects of big data, from ingestion to replication to organization, for complete lifecycle management across both streaming and batch environments.
Shortly after Gobblin’s second birthday, the team felt it was ready for the big time: joining other LinkedIn open source projects contributed to the Apache Software Foundation, including the Helix cluster management framework and Kafka distributed streaming platform. Gobblin was accepted into the Apache Incubator Project in February 2017, and spent the year since then completing the internal process. The final step, contributing the actual code, was recently completed, and Gobblin has now became an official Apache entity.
Prior to incubation, Gobblin was already being embraced beyond LinkedIn by companies like Apple and Paypal. Organizations like CERN and Sandia National Laboratories that consume and crunch simply unimaginable amounts of data — 1PB per second, in CERN’s case — also adopted Gobblin to help conduct their research.
The New Stack spoke with Abhishek Tiwari, Staff Software Engineer at LinkedIn, about Gobblin’s journey.
Why Apache, and what exactly is involved the incubation process?
Although Gobblin was already finding success with outside organizations, we believed that becoming an official project at one of the most influential open source organizations on the planet would ensure durability and self-sustenance, as well access to a broader community that could help continue the evolution. Since starting the Apache Incubation process early last year, we already have seen good progress on this front. Apache Gobblin community members have proposed, built, and started to spearhead a few critical developments, including Amazon Web Services mode enhancements and auto-scalability.
First step in the process was the Gobblin incubation proposal, which was unanimously accepted by Apache. Then working with mentors and champions to set up the code donation, and licenses while working with the Microsoft legal team, and setting up Apache infrastructure…all before officially getting incubated.
What factors drove the evolution of Gobblin?
The original idea for Gobblin was to reduce the amount of repeated engineering and operations work in data ingestion across the company, building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. At one point, we were running more than 15 different kinds of pipelines, each with their own idiosyncrasies around error modes, data quality capabilities, scaling, and performance characteristics.
Our guiding vision for Gobblin has been to build a framework that can support data movement across streaming and batch sources and sinks without requiring a specific persistence or storage technology. At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Oracle, LinkedIn Espresso, MySQL, SQL Server, Apache Kafka, patent and publication sources, etc.
Over the years, we’ve made strides in fulfilling that vision, but also grown into adjacent capabilities like end-to-end data management — from raw ingestion to multi-format storage optimizations, fine-grain deletions for compliance, config management, and more. The key aspect differentiating Gobblin from other technologies in the space is that it is intentionally agnostic to the compute and storage environment, but it can execute natively in many environments. So, you don’t HAVE to run Hadoop, or Kafka to be able to use Gobblin, though it can take advantage of these technologies if they are deployed in your company.
Is Gobblin mainly applicable to very large organizations munching serious data, or are there other use cases for different/smaller entities?
We see adoption across the board. Smaller entities often are not very vocal about their adoption because they’re too busy getting their startup off the ground. No matter what the size, the common theme is that the business is data-driven, has multiple data sources and sinks, and has a Lambda architecture — both streaming and batch ecosystem. Some examples of small and medium size companies using Gobblin are Prezi, Trivago and Nerdwallet.
One recent development is Gobblin-as-a-Service. How does that work?
The aim was to build a PaaS (Platform-as-a-Service) for data management that encapsulates and unifies heterogenous data movement and processing deployments (Gobblin or non-Gobblin) behind a service.
As more and more pieces of the data management Swiss army knife came together in Gobblin, the challenge shifted to long lead times due to human involvement in deploying, managing, and operating these pipelines across multi-cluster, multi-region deployments. This led to the central Gobblin devolpment and operations team becoming the bottleneck in rolling out new pipelines. We built Gobblin-as-a-Service (GaaS) to solve for these problems by offering a self-serve programmatic way to develop, deploy and operate data integration pipelines.
For illustration, in the figure below, the component within the red circle is GaaS, whereas rest are different deployments that GaaS is coordinating with for the execution of jobs:
How do microservices and containerization dovetail with Gobblin?
Gobblin-as-a-service takes advantage of the containerization trend by allowing Gobblin jobs to be containerized and run in isolation from other jobs. Similarly, Gobblin’s core engine is built for ingestion in a microservice-based world, with optimizations like pipelining of remote service calls for latency-hiding as well as throttling capabilities in connectors to prevent DDOS of online services from ingestion traffic.
What are some real-world use cases for Gobblin?
- Self-serve: Users can create jobs programmatically through REST APIs or via UI on any Gobblin deployment, leaving operations team to focus on only deployment and upgrades.
- Optimal resource usage: Users can submit jobs and leave it to Gobblin-as-a-Service to optimally choose executor instance and compile logical job into physical jobs to be executed as part of bin-packed multi-tenant job or single tenant job based on resource and SLA constraints.
- Failover and upgrades: The technology executing the job behind GaaS can be transparently swapped out in case of failover or upgrades without affecting the user and without their intervention (e.g., the operations team might replace a Camus deployment with Gobblin without affecting the user).
- Global state: The unifying factor of GaaS across hybrid technology deployments enables operations team to easily monitor and manage the global state of data landscape and lineage in their organization.
How can people join in the community that is growing up around Gobblin?
There are all kinds of ways to get involved in the Apache Gobblin community, including contributing features, testing and bug fixes, evangelizing ideas, or simply helping update the documentation. Also, there will be a Big Data Meetup on January 25 at LinkedIn’s offices in San Francisco with speakers from LinkedIn, Prezi, and other Bay Area companies to talk about the exciting new developments and challenges in the data management and data integration space. Please join us there!