Twitter Turns to Google Cloud for Real-Time Data Stream Analysis
Anytime someone takes an action on Twitter, from something as small as a click or a scroll to something as large as signing up or tweeting, Twitter logs it.
As a result, the social media service generates an enormous amount of data — “Tens of thousands of nodes, hundreds of petabytes, and trillions of events per day,” noted Daniel Templeton in his recent blog post. Engineers and data scientists look to turn each piece of user data into insight that could potentially make the “platform a better place for end customers,” says Templeton.
To date, the company has used mainly a batch processing system to take a first pass at the data, surprisingly. But a new company initiative born from an internal hacking application, Twitter Sparrow reduced the amount of time it takes for the user data to get to Twitter’s data science engineers — from hours to minutes and even seconds — and allowed for real-time data delivery for study and evaluation. Previously this data was sent in a batch system, prioritizing throughput and sacrificing latency.
The supporting streaming infrastructure was built in large part on Google Cloud. “The streaming first architecture on a scale of this magnitude pushed the boundaries of what was possible on Google Cloud,” Templeton said.
Before Twitter Sparrow
Before Twitter Sparrow, the log ingestion pipeline workflow included a batching system causing data science engineers to wait several hours for fresh customer events data. This model was optimized for throughput rather than latency.
The pipeline itself holds iterations from hundreds of millions of customers that connect on Twitter through a variety of services, including web and mobile. This is where interactions between users, new signups, tweets, and everything in between are sent to data scientists for evaluation and study so they can “better understand our customers and see the public conversation,” says Templeton.
With the ever-changing needs of the company and the technological landscape around Twitter, real-time data delivery became a priority in order to “move at the speed of Twitter. We want Twitter products to develop fast, understand things fast, fail fast, if something doesn’t work, and reiterate fast,” Praveen Killamsetti (staff software engineer at Twitter) said.
A visual representation of the pipeline before Sparrow
Sparrow started as a Twitter Hack Week project. The initial challenge Sparrow’s engineers were trying to solve was (1) how to shorten the amount of time it took to record and deliver data in the log pipeline and (2) deliver the log data generated from the tens of thousands of microservices spread across different data centers into the hands of analysts and engineers more efficiently.
Their take on how to do this consisted of redesigning the batch ingestion pipeline with streaming technologies. Streaming data pipelines provide data scientists access to fresh data in real time. Batch data can be simpler and more efficient for historical data processing (i.e. running reports over a longer period of time). So to fully meet the needs of the moment, the new pipeline was designed to include both batch and real time streaming.
Twitter engineers developed a Streaming Event Aggregator that collected log events from services and passed them to a message queue like Apache Kafka or Google Pub/Sub. The Streaming Event Processor then uses Apache Beam and Google Dataflow to read events from the message queue, applies transformations or format conversations, and streams the events into downstream storage systems such as Google BigQuery and Google Cloud Storage.
The new pipeline is now optimized for throughput and latency and delivers data in real time, seconds in some cases, to Twitter’s data science engineers for evaluation. The effort to scale the new pipeline and all supporting infrastructure was enormous but it seems to be worth it as the original objectives Twitter Sparrow set out to achieve were met.
Twitter’s data science community can now answer questions they previously could not. One specific example listed in Templeton’s article is analyzing Twitter’s user behavior in real time which currently is absolutely necessary for the company, Templeton said.