Distributed data systems are used in a variety of settings like online serving, offline analytics, data transport, and search, among other use cases. These start off with a single node solution that provides the core functionality e.g. it can be a database, messaging, search index etc. Building from a single node, requires not just a distributed system but also the different requirements that go with it. With these issues related to scaling in mind, we built Helix, a generic framework for developing distributed systems.
For a long time, a single node solution was sufficient for most use cases but with the advent of big data thats no longer true. Most applications need to run in a distributed setting and with distributed computing come challenges of scalability, partition and fault-tolerance. The evolution of such a system looks like below.
Adding these capabilities is non trivial, error prone and time consuming. Each feature exponentially increases the complexity of the system but these capabilities are critical for production ready systems expected to operate at a large scale. Most systems attempt to go through this evolution but very few end up to the right end of the spectrum. In fact, many systems go through multiple rewrites every time they add a new capability.
At LinkedIn, we were about to embark on a journey to build many such distributed systems, the first of which was Espresso, a NoSQL storage system. While designing the system we observed the need to support common patterns such as tolerance to partition, hardware and software failures, tasks such as bootstrapping, operational issues, load balancing and scaling. All of these were the motivation behind building a generic framework for developing distributed systems which we called Helix.
Distributed System Challenges
Lets go over some of the challenges faced with building a distributed system by taking an example distributed system – search system (Solr or Elasticsearch). The core indexing technology used by these systems is Apache Lucene. Apache Lucene is full text search engine library that provides high-performance indexing and efficient search capabilities. Let’s consider the following example spec.
- 1 Billion documents to be indexed.
- Servers with 48 gigabytes memory.
- Index is too large to fit in a single box.
For the system to be production ready, apart from the core functionality of processing search requests, it must address the following key requirements:
- Partition Management: A common strategy is to split the indexes into multiple partitions and distribute them over several servers to manage workload. As the partitions grow so does the challenge of managing them.
- Fault tolerance: As the number of partitions and servers grow so does the probability of partition load failing or software and hardware failure thus affecting uptime. Building a solution which is fault tolerant soon becomes a critical requirement.
- Scalability: As the number of documents increases over time, the per partition index size also increases. This means that number of partitions that can fit on a given machine reduces over time, it requires one to add more servers and re-distributed the indexes.
- While these requirements must be met another critical part of a software delivery process is configuration management, operability and multitenancy.
Often times a simpler approaches like using static configuration files leads to a manual and error prone approach. Alternatives of centrally managed configurations provide some respite however any inter-dependencies in distribution of the configuration soon render the solution lacking.
As systems scale there arises a need to orchestrate the startup and shutdown of these servers so as to reduce contention and enable a controlled start and bootstrap of servers. Any manual effort there is just a nightmare and demands automated management. As servers work in collaboration it becomes paramount that as they scale to handle requests any failures hardware (disk failures) or software (failure to load a partition) are identified and remedied ASAP to reduce impact on capacity and uptime. An improved MTTR (Mean time to Recovery) can be the difference between a good quarter v/s a bad.
As the number of tenants increase, overhead of running multiple processes for each client becomes substantial. Multitenancy reduces this over head and allows efficient resource utilization by running multiple tenant within a single process. One of the basic requirement to achieve multitenancy is to have the ability to dynamically configure and reconfigure tenants.
In order to solve the above requirements in a generic way, Helix introduced the concept of Augmented Finite State Machine (AFSM). A Finite State Machine (FSM) through states and transitions provides sufficient expressiveness to describe the behavior of the system.
We augmented the state machine with an added dimension of a ‘constraint’ for both states and transitions. Constraints can be specified at the granularity of partition, resource, instance and cluster.
A search system requirement can be described using this AFSM.
- States – These are the set of states the replica(s) of a partition can move through
- OFFLINE – Start State
- ONLINE – End State where a replica is ready to serve requests
- Transitions – An ordered set of valid transitions through various states before serving requests
- OFFLINE – BOOTSTRAP – in this transition the server loads the index for the partition
- BOOTSTRAP – ONLINE
- Constraints – Recall that constraints can be defined at states and transitions
- State Constraint (REPLICA-> MAX PER PARTITION: 3) → indicates that for every partition we need at most 3 replicas.
- Transition Constraint (MAX PER NODE OFFLINE→ BOOTSTRAP:1) → Limits the number of partitions that simultaneously load index on any given server.
- Transition Constraint (MAX PER CLUSTER OFFLINE→ BOOTSTRAP:10) → Limits the number of partitions that simultaneously load index across the entire cluster.
Objectives are used to control the placement of resource partition among the nodes. The objective can include, but not limited to the following:
- Evenly distribute the partitions and replicas among the nodes.
- Don’t place multiple replicas of a partition on the same node.
Helix divides distributed system components into 3 logical roles as shown in the above figure. It’s important to note that these are just logical components and can be physically co-located within the same process. Each of these logical components have an embedded agent that interacts with other components via Zookeeper.
Controller: The controller is the brain of the system and hosts the state machine engine, runs the execution algorithm, and issues transitions against the distributed system.
Participant: A participant executes state transitions and triggers callbacks implemented by the distributed system when the controller initiates a state transition. So for example: In case of search system, the system implements the callbacks such as offlineToBootstrap, BootstrapToOnline and are invoked when controller fires respective transitions.
Spectator: A spectator is an observer of the system state and gets notified on state changes. A spectator may interact with the appropriate participant upon such notifications. For example, a routing/proxy component (a Spectator) can be notified when the routing table has changed, such as when a cluster is expanded or a node fails.
A more detailed discussion on these components can be found here
Usage of Helix in LinkedIn Ecosystem
- Espresso: Espresso is a distributed, timeline consistent, scalable, document store that supports local secondary indexing and local transactions. Espresso runs on a number of storage node servers that store and index data and answer queries. Espresso databases are horizontally partitioned into a number of partitions, with each partition having a certain number of replicas distributed across the storage nodes.
- Databus: Databus provides a common pipeline for transporting events from LinkedIn’s primary databases (Espresso and Oracle Database) to downstream applications like caches and search systems. Databus maintains a buffer on per partition basis. Databus constantly monitors the primary databases for changes in the topology and automatically creates new buffers as new tables are created on the primary database.
- Databus Consumers: Each Databus partition is assigned to a consumer such that partitions are evenly distributed across consumers and each partition is assigned to exactly one consumer at a time. The set of consumers may grow over time, and consumers may leave the group due to planned or unplanned outages. In these cases, partitions must be reassigned, while maintaining balance and the single consumer-per-partition invariant. Data replicator that replicate data to another data center and search indexer nodes that generate search indices are some applications that follow this pattern.
- Seas: (Search As A Service): LinkedIn’s SEAS lets internal customers define custom indexes on a chosen dataset and then makes those indexes searchable via a service API. The index service runs on a cluster. The index is broken into partitions and each partition has a configured number of replicas.
- Pinot (Analytics Platform): Pinot is an online analytics engine used for arbitrary rollup and drill down.
- ETL: We have multiple task execution frameworks built on top of Helix. These are customized as per the use case. For instance, the snapshot generation service converts Espresso data into Avro format and loads it into Hadoop for further analysis.
Helix usage outside of LinkedIn
Since open sourcing Helix several companies outside of LinkedIn are using it
- Instagram in IGDM (Instagram Direct Messaging)
- Box in Box Notes.
- Turn for internal storage.
- jBPM: using Helix to achieve clustering in jBPM.
We have been working on some key features that will allow Helix to be used in systems beyond resource management.
- Task Framework: Ability to schedule ad-hoc tasks and monitor the progress.
- Provisioning: Integration with other resource managers such as YARN/Mesos to support automatic provisioning.
- Helix-IPC: Ability for partitions to communicate amongst each other. This can form the basis to build various replication schemes such as synchronous/asynchronous replication, chained replication and consensus (raft).
Building distributed systems from scratch in non trivial, error prone and time consuming. Its important to build the right set of building blocks that can be leveraged across various systems. At LinkedIn, development of Helix started with Espresso and as we have continued to build variety of distributed systems, that investment has continued to pay off, allowing us to build reliable and operational systems e.g. Pinot leveraged Helix successfully with little effort to reduce the operational overhead.