Going Real Time in AdTech: A Batch-to-Streaming Journey
Data collaboration platform LiveRamp helps companies build enduring brand and business value with technology solutions for customer intelligence, identity enhancement, cross-screen measurement, media networks and more.
I first joined LiveRamp 15 years ago as an intern. Today, I am responsible for LiveRamp’s data platform architecture, with particular emphasis on ingestion and activation — in other words, all the data coming into and out of LiveRamp. This includes customer data that is processed, masked and made available for hundreds of downstream systems.
Part of the value LiveRamp adds is connecting consumer data with durable and privacy-conscious identifiers. There is a lot of complexity in this: You might have offline attributes like names, addresses and phone numbers, and these can change over time. The same is true for online identifiers like IP addresses and device IDs. LiveRamp replaces these changing and inconsistent attribute sets with durable, secure and pseudonymized identifiers that are sustainable against evolving regulations and privacy policies.
The LiveRamp data collaboration platform enables privacy-centric customer engagement.
From Batch to Streaming
While real-time data streaming with gigabytes per second throughput is the direction all our workloads are headed, currently we still manage a large number of batch processing systems built on Hadoop and Spark. That’s because the AdTech ecosystem in which a large portion of our business operates has traditionally been oriented around encrypted batch files, where systems read one record at a time.
However, the industry is changing rapidly. Our customers increasingly need real-time solutions for time-sensitive problems like cart abandonment and ad suppression. They also need to prepare for industry headwinds such as Google Chrome’s phasing out of third-party cookies in 2024, which will drive ever more demand for more optimized identifiers.
So we embarked on a journey to modernize our data infrastructure from batch systems to streaming data systems. We knew it was going to be a gradual process, and we started with a newer use case, our pixel traffic application — a system that helps customers understand users’ web and mobile traffic trends. We also recognized that we would need to select the streaming data platform that would drive our real-time transformation.
Our Streaming Data Platform of Choice: Redpanda
LiveRamp developed our own batch systems, but in our migration to real-time data pipelines we wanted to enable faster development and more efficient collaboration with partners. This led us to seek a partner with a robust streaming data ecosystem around its APIs.
Redpanda fulfilled this need with a Kafka API-compatible platform offering everything needed to stream data — brokers, HTTP proxy, schema registry, Raft consensus and cluster balancing. And because Redpanda consumes about 1/3rd of the compute resources as other vendors due to its lean design, it’s much more cost-efficient.
In addition to simplicity and cost-efficiency, Redpanda brought other benefits to the table including:
- Platform neutrality: Everyone in our industry is trying to minimize data movement, so we need to be able to deploy solutions in all the major clouds to be where our customers are. Redpanda offers platform neutrality and deployment flexibility: We can run it self-managed in VMs or containers on our cloud of choice, or we can use it as a fully managed cloud service with Redpanda Cloud.
- Data privacy: Keeping data within our network boundaries is a key requirement, but we also wanted a fully managed experience. Redpanda’s Bring Your Own Cloud (BYOC) model helps us do that. With Redpanda BYOC, LiveRamp’s clusters remain in our own Virtual Private Cloud (VPC), while Redpanda manages the provisioning, monitoring and maintenance via their secure agent. Data never leaves our VPC.
- Performance: Our industry is extremely latency-sensitive, so we wanted a solution with performance at its core. Redpanda is built in C++ using the Seastar framework, with a thread-per-core implementation that ekes optimal performance out of modern hardware. It minimizes thread switching, bypasses the Linux page cache, maximizes parallel processing and uses direct memory access to make asynchronous disk more IO efficient. Our benchmarking tested Redpanda with 2 billion messages loaded in 50 minutes at the rate of 750,000 unique messages per second on a six-node, n1-standard-16 cluster. We were able to achieve a cluster-size read throughput of 2.9 GBps and an average write throughput of 1.2 GBps, well over our requirement of 386 MBps. We were blown away by the results.
- Partnership: With our steep data privacy and performance requirements, we needed a provider who was also a close partner. The Redpanda team was willing to work closely with us, and as a result, a lot of our feedback and workarounds made it into the product itself. For example, we built a Terraform wrapper layer to automate the deployment of Redpanda in our environment, and our work was later incorporated by the Redpanda team into its own BYOC deployment architecture.
How It’s Going
We started with a system that collects pixel traffic for mobile applications. When that implementation proved successful, we next adopted it for our application monitoring tooling and saw improved reliability. We are now expanding our use of Redpanda across the organization to gradually support all incoming and outgoing data across the more than five hundred different platform integrations we manage.
Redpanda currently sits on the edges of LiveRamp. The producers read from files and put messages on Redpanda, and consumers read the data and make outbound calls. Consumers can be simple Go or Java processes that can massage data for specific platforms like Facebook. In this way, we store data from the edges and use it where it makes sense. We also push data from Redpanda to our internal warehouse built on SingleStore, where it becomes available for analytical and measurement use cases.
This event-driven architecture eliminates the complexity of our legacy batch system, which requires a complex architecture for parallel processing. As a result, we’ve boosted engineer productivity and enhanced the maintainability of our simplified codebase. We’ve also significantly lowered our infrastructure costs and reduced our carbon footprint as a result of Redpanda’s hardware-efficient design.
What’s In Store: A World with No Cookies, Wasm, Simpler Streaming
If there’s a constant in our industry, it’s change. It’s been widely reported that Chrome is phasing out third-party cookies in 2024. LiveRamp has been preparing for this future for more than five years by helping publishers, marketers and the ecosystem at large transition to addressable audiences without relying on third-party cookies or mobile identifiers. Redpanda will remain a key partner for safely syncing data as LiveRamp advances its offline reference graph, which has thousands of data sources and over a billion syncs a day.
As streaming data continues to become the heart of our infrastructure, we’re looking to further simplify operations with the use of WebAssembly (Wasm). With Wasm transformations directly in Redpanda, we’ll be able to read data, prepare messages and make API calls without the “data ping pong.”
In the meantime, we’re excited to continue on our batch-to-streaming journey and help pioneer the real-time evolution of the AdTech industry.