Cloud Services / Data

Rudderstack’s Smart Data Pipeline Could Help Move, Transform, and Store Customer Data

16 Feb 2021 12:50pm, by

Building applications in the cloud native paradigm implies many things — development agility, high availability, and scalability of compute and storage core among them. Cloud native applications are also API driven, with microservices interacting via APIs. Modern web applications take a similar form, with individual software-as-a-service web-based applications acting like microservices, communicating with each other and sharing information via APIs as well. They also enjoy the ability to separate storage and compute, just as with cloud native applications, and this can be a defining feature. A modern organization providing a service on the internet might have data sources in numerous and disparate services, often siloed and inaccessible. Rather than dedicating developer effort toward uncovering and connecting that data, data pipelines offer a method for not only collecting that data, but transforming it as it moves from one cloud-based application to another, on its way to a data warehouse, where it can be further leveraged.

Rudderstack provides a “smart customer data pipeline” that looks to alleviate developers of this effort, offering software development kits (SDKs) for nearly a dozen different languages and frameworks, and plugins for more than 70 data destinations and sources, including data lakes, analytics tools, CRMs, advertising platforms and more. With Rudderstack, users take event stream data from various sources to create identity graphs of customers, deliver personalized experiences, and assist with tracking users through sales funnels and otherwise.

Gavin Johnson, a product marketer with Rudderstack, explained in an interview how Rudderstack helps users move and transform data in the effort of building a customer identity graph.

“We help people build real-time streaming and ELT [extract, load, and transform] pipelines for their customer data. We store their customer data, if they choose to. We build their identity graph, so they can do identity resolution in their data warehouse as well, and then we give them the tools to be able to take their analysis that they do in their data warehouse and send that analysis and the insights from it to any of the other tools so they can make smart actions based on it,” said Johnson. “The general use case that people use this for are data streaming and warehousing — taking the event data from their website, using our instrumentation on their website in their applications, sending that event data to their warehouse, but also sending it to other applications like Google Analytics, Firebase, and Salesforce.”

Johnson further described how data pipelines replace developer efforts and take advantage of this separation of storage and compute, by e-mail:

“Data Engineers are being asked to build integrations out of their data warehouses to many of the same tools (e.g. warehouse -> CRM, warehouse -> Marketing Automation), because the ability to data warehouse and then apply and automate data modeling has expanded greatly. Snowflake and, generally, the separation of compute and storage in cloud data warehouses has made it extremely inexpensive to store a lot of data that you don’t access frequently,” Johnson wrote in an email.

Tools built on top of them like dbt make it easy to consistently, repeatedly apply a non-destructive data model to the data in your warehouse. So you can automate things like audience segmentation or lead scoring, and the Marketing and Sales teams that could really use that modeled data want it in their systems. Generally, these integrations are called data pipelines.”

Built to run natively on Kubernetes and written in Go, Rudderstack is, at its core, an open source project licensed under AGPL. For those who don’t want to install and manage the tool, Rudderstack is also a SaaS offering, still delivering the scalability of Kubernetes on the backend, but providing a web-based GUI experience on the frontend. Rudderstack consists of several core functional components that provide its data pipeline functionality: instrumentation to gather event stream data, ELT pipelines to gather non-event data at predetermined periods, and the ability to transform data, whether at the data warehouse or while in transit.

On this last point, the company offers two distinct features. First, transformations are Javascript functions that manipulate live event stream data as it flows through Rudderstack, enabling users to do things like mask personally identifiable information (PII), filter out certain events, and generally change data before it arrives at its destination. Then, the recently introduced Warehouse Actions allows data engineers to use their warehouse as a data source for their whole customer data stack, analyzing and manipulating data at the warehouse before sending out to other destinations.

Regarding the warehousing of data, Johnson points out that Rudderstack does not host the data, rather allowing its users to choose their pre-existing data warehouse. This, he says, helps both with security and regulation compliance issues that might otherwise surface.

“Our opinion is that if it’s your customer data, it should be yours. You shouldn’t have to have that live in somebody else’s infrastructure. That opens you up to security risks. And frankly, because it’s customer data, there are a lot of people that are very, very sensitive to that living in vendor’s infrastructure. So instead of building your customer data warehouse or your identity graph on our infrastructure, we integrate with cloud data warehouses and noncloud data warehouses like Postgres, Snowflake, BigQuery, and Amazon RedShift,” said Johnson. “We don’t need to sit in the middle like that. You’re unnecessarily replicating data and opening up more exposure to risk for your customers’ data. It’s unnecessary and frankly, it’s just gonna cost you more money in the long run.”

For LoveHolidays.com, a travel booking site and Rudderstack customer of six months, Rudderstack serves to replace more manual processes previously based on Google Analytics. According to David Annez, head of engineering at LoveHolidays, the company uses Rudderstack to analyze every aspect of user flow and metrics, including A/B testing, conversions, and any other metric they may want to examine.

“We use Rudderstack as the collection of all of the events of a customer that lands on our site. That’s how we identify whether you have come from Google Ads, or come direct, and all of that data Rudderstack then pumps that into our data store, which is Google Cloud BigQuery, in this instance. We use models and machine learning as well on top of that to then essentially output our attribution model,” explained Annez.

Not only did Rudderstack help automate a more cumbersome process, said Annez, but it also provided a finer granularity to customer information. Using the ELT pipeline, raw Google Analytics data could be gathered far more frequently, rather than daily, as had previously been the case.

“Google would take too long actually dumping all of that data to us. Rudderstack just completely changed that for us,” said Annez. “When we changed over to Rudderstack, we can enhance the data with a lot more information. And, additionally, it’s available to us to analyze within fifteen minutes of a user having done that thing on a site, which gives you a much, much greater granularity, and then starts letting you do intra-day reporting versus daily reporting.”

While LoveHolidays toyed with the idea of running Rudderstack natively on its own Kubernetes cluster, it decided to outsource that aspect and go with the SaaS version instead. Running Rudderstack on Kubernetes involved only installing the application with a Helm chart, said Annez. For now, LoveHolidays is using Rudderstack primarily for these sorts of user tracking features, but Annez said further personalization efforts were in the works, with the company collaborating with Rudderstack on making that happen. Regarding potential use cases, Johnson commented that there are many beyond those they might see and know of already.

“When people ask about use cases, we have a list of use cases that can walk them through, but we don’t even know all the use cases. There’s just so many things that these data scientists and data engineers are able to do that they can’t effectively use other tools right now without it being a very manual process,” said Johnson.

A newsletter digest of the week’s most important stories & analyses.