How Twitter Supersized Search

Anyone who’s lived through the past few years knows how Twitter’s usage can spike in an instant.
The site’s search feature also gets hit hard. For these times of heightened interest, Twitter’s Search Infrastructure engineers have added a proxy, ingestion layer, and a backfill layer to the search system’s architecture in order to reduce latency in the real-time platform.
The Search Infrastructure team is the engineering muscle behind the real-time search platform for Tweets, Users, Direct Messages, and all other searchable content on Twitter. The recent blog post written by Shelby Cohen, Twitter Senior Software Engineer, and Jesse Akes, Twitter Software Engineer, goes into great detail on how an update of the system’s architecture lowers latency under the highest pressure traffic circumstances.
Twitter implemented three custom solutions on top of Elasticsearch to bridge the gap between the search engine and the unique needs that come along with high volume, and unpredictable traffic.
A reverse proxy now sits directly behind the client service and alleviates bottlenecks, the new Ingestion Service adds a new layer of reliability by managing traffic spike overflow, and a backfill service provides a safe way to add missing data to new and existing Elasticsearch clusters.
What Is Elasticsearch?
Elasticsearch is an open source search engine widely used in the industry and known for its distributed nature, speed, scalability and simple REST APIs. It’s based on the Lucene library.
The Search Infrastructure stream exposes all Elasticsearch APIs because they’re “powerful, flexible, and well documented.” Plugins and tooling are then provided to ensure compliance and easy integration with Twitter services.
The New, Layered Architecture
The diagram below illustrates the new architecture. This clearly shows how adding layers around the cluster can help alleviate the challenges of client direct-to-Elasticsearch interaction.
The Proxy
Before any of the new components were added, querying and indexing, request monitoring, and metrics collection all went directly from the client to Elasticsearch. Twitter did implement a Twitter custom plugin for additional security/privacy and additional metrics collection but it didn’t have any bottleneck mitigation capabilities.
In order to alleviate the bottlenecks caused by Twitter scale, Twitter introduced a reverse proxy to act as a front end to the ingestion Service. This simple HTTP service adds an extra layer between Twitter’s customers and their Elasticsearch clusters. The design creates one entry point for all requests, handles all client authentication, improves connectivity and observability, and separates reads and writes.
The Ingestion Service
The Elasticsearch pipeline just doesn’t have the built-in auto-throttling or degradation mechanisms to handle the traffic spikes that result from that high-profile Twitter drama. By default, the common safety mechanism for large traffic spikes (querying, retrying, and back-off) are left to each individual Twitter user’s device to implement. At best it looks like increased indexing and querying latency. At worst it looks like total index/cluster loss.
In response to the unfavorable previous solution, the Search Infrastructure team created the ingestion service which provides a fully dedicated Kafka topic per Elasticsearch cluster with built-in capability to handle these large spikes. Worker clients read from the topic and send requests from the topic to the cluster. The ingestion service batches requests, listens to back-pressure, auto-throttles, and retries with backoff to mitigate traffic spikes to the cluster and avoid overload and keep latency lower.
The Backfill Service
Anytime missing data is added, Twitter considers that as a “backfill.” This includes an empty Elasticsearch cluster, adding or updating schema fields, and late-arriving data. Backfills are upwards of hundreds of terabytes of data, millions and at times, billions of documents. The throughput is much higher than the typical ingestion pipeline meaning backfills take a very long time. The landscape is rich with opportunities for things to go wrong.
Twitter’s old workflow had no built-in guardrails for the data flowing through, and it did, as quickly as it could. And because of this, Twitter was unable to safely backfill a live cluster as either an index would fall behind or simply get crushed under the pressure from all the connections and incoming data. Best case there were query performance issues. Worst case the entire cluster would die. These failures took a long time to surface.
The backfill service addresses the above issue by loading large amounts of data safely and efficiently into an Elasticsearch index in three stages. The sink (similar entry point to the old service) takes in a stream of data to index, converts into indexing requests, and partitions and stages in temporary storage then initiate a call to the backfill orchestrator.
The backfill orchestrator, the brains of the backfill service, connects information from the sink to internal Twitter service environments where Elasticsearch clusters are hosted. Using information from the sink, the orchestrator launches a dynamic number of workers to begin the backfill. The backfill workers are small distributed applications that read the index requests from storage and index the data into the cluster.
The sink partitions index data and saves it on temporary storage for re-running and resuming backfills. If the cluster fails, Twitter doesn’t need to re-run jobs to prepare the data again. The workers automatically respond to back-pressure by rate limiting and retrying individual document failures inside a bulked request. This greatly reduces the impact of query performance on a live cluster and the chance of a total index failure.
The backfill can happen across all of Twitter’s data centers at once — less overhead and less effort to accomplish one backfill in multiple regions.
Conclusion
Twitter’s scale can push Elasticsearch to the max. By adding custom solutions that address each challenge and target specific needs, the data flow can be refined to something more suitable for Elasticsearch’s capacity.