Data / Open Source / Contributed

Redefine Customer Data Analytics Using an Open Source Stack

6 Jan 2021 10:39am, by
Developer Advocate
Nica Fee helps teams adopt serverless and optimize their costs on AWS. She is a Serverless Developer Advocate for New Relic.

In this post, we will talk about how you can build your entire customer stack using open source tools without having to compromise with the security of your data or the time taken to churn effective analytics from your customer data.

Today, data is the fuel that drives key operational decisions in an organization. As your data volume grows, however, managing it becomes increasingly tricky. It also becomes equally challenging to retrieve insights from all the data that comes in, and only a part of the data is analyzed, resulting in an incomplete analysis. Having a robust data infrastructure with tools that let you easily manage data at scale and leverage it for efficient analytics is more important now than ever. This is also the reason why more and more companies are turning towards using an analytics stack.

A data analytics stack enables teams across an organization to look at important metrics and make data-driven decisions. It integrates different technologies needed to efficiently collect, store, transform, and analyze your data to derive critical insights from it.

When it comes to using an analytics stack, businesses are often faced with two choices — buy a proprietary tool, or build an open-source analytics stack from scratch. While the proprietary tools offer best-in-class analytics and data management services, they also have some major downsides that include premium pricing plans, vendor lock-in, and limited flexibility.

For these reasons, many companies prefer to build an open-source analytics stack that caters to their specific business needs.

Why an Open Source Analytics Stack?

An open source analytics stack offers some very important advantages as opposed to using proprietary analytics tools.

Businesses are often budget-challenged, and open source solutions allow them to start small and scale while exploring other open source solutions. The enterprise versions of these open source products are also fairly priced as compared to the proprietary solutions.

Open source products offer better flexibility in terms of the tools you use to build your stack. This encourages teams to innovate and gives them the freedom to leverage better features, which are otherwise paid in enterprise versions. Also, as your open source product runs within your cloud or on-prem environment, you can fully control your data. You can implement a set of protocols that decide who can access this data and when.

Proprietary tools make us heavily dependent on the vendors for updates, bug fixes, and more. On the other hand, an open source community of developers manages the open source product in the analytics stack, so updates and bug fixes are rolled out much faster without relying on an individual or a group of developers.

We’ve seen how choosing open source analytics will be a better option to work with your customer data, which lets the engineering team focus on building better products.

What does a great open source analytics stack look like?

A great analytics stack should be able to:

  • Integrate data (in different formats) sitting within multiple platforms
  • Ingest data into a storage system (a data warehouse)
  • Clean and Transform data for different use cases
  • Use transformed data for analytics like visualization or machine learning

Here’s how an ideal open source analytics stack would look like:

Our goal is to help you understand how replacing your entire data analytics stack with completely open source solutions can help your businesses scale with minimal costs and a high level of security.

What Is an Open Source Analytics Stack Made of?

Almost all data analytics systems follow the same basic approach for setting up their analytics stack: data collection, data processing, and data analytics. The tools used to perform each of these approaches form the analytics stack. An open source analytics stack is no different, just that it uses Open source tools to obtain the same results that proprietary tools offer with even better functionalities.

Let’s understand each of the processes in detail and how open source tools contribute to each process in the open source analytics stack.

Data Ingestion and Transformation

The primary step for collecting your data for analytics is to ingest it from all your sources including your in-house applications, SaaS tools, data from your IoT devices, and all other sources. Various tools are available to make this process a seamless experience.

ETL vs ELT

Until recently, data ingestion followed a simple ETL (Extract, Transform, and Load) process in which data was collected from source, realigned to fit the properties of a destination system or business requirements, and then loaded to that system. Creating in-house ETL tools would mean taking developers away from the user-facing products which puts the accuracy, availability, and consistency of the analytics environment at risk. While commercially packaged ETL solutions are available, an open-source alternative is a great option. One such example is Singer, an open-source ETL tool used to program connectors for sending data between any custom sources and targets like web APIs and files.

Due to the rise in cloud-based data warehouses, businesses can directly load all the raw data into the data warehouse without prior transformations. This process is known as ELT (Extract, Load, Transform) and gives data and analytics teams freedom to develop ad-hoc transformations based on their particular needs. ELT became popular as the cloud’s processing power and scale could be used to transform the data. DBT is a popular open source tool recommended for ELT and allows businesses to transform data in their warehouses more effectively.

Real-time Data Streams

With the increase in real-time data streams and event streams, certain use cases such as financial services risk reporting or detecting a credit card fraud require access to real-time data. Real-time streams can be obtained using a stream processing framework like Apache Kafka. The focus is to direct the stream of data from various sources into reliable queues where data can be automatically transformed, stored, analyzed and reported concurrently.

Customer Data Platform (CDP)

Talking about successful data ingestion tools, most businesses rely increasingly on different Customer Data Platforms (CDPs) that track, collect, and ingest data from multiple sources and systems into a single platform to get a unified customer view. Apache Unomi is a perfect example of an open source CDP that ingests data and collects it at one place.

However, traditional CDPs have revolutionized and are now designed for the needs of today’s marketers. Modern CDPs like Snowplow and RudderStack ingest data from a multitude of sources and also route them to databases or your preferred destinations for your activation use-cases.

Data Warehouses

This is the next important piece of the analytics stack. Data Warehouses act like a common repository for companies to store data collected from different sources where it can be transformed or combined for different use cases. Data warehouses store both raw and transformed data and can be easily accessed to all employees within an organization. Traditional databases were designed to store data based on specific domains like finance, human resources, and so on, which resulted in huge data silos and disconnected data within the data warehouse. Over the years, as cloud data warehousing has taken roots, more and more companies are migrating from on-premise to modern data warehouse.

Moreover, using open source warehouse tools can allow unlocking additional insights from your data in real-time and with lesser cost. PostgreSQL is a popular example of an efficient and low-cost data warehousing solution. Another example is ClickHouse that allows generating analytical reports from data in real-time.

Data Consumers

After your data is ingested and transformed, it is sent to different platforms to leverage cutting edge analytics and get more out of your data. There are various tools available for your different analytics needs. Proprietary tools do not allow you to fully leverage your data without buying their enterprise version. We have curated a few open source tools that will fit right for different analytics on your data.

Matomo is an open source web analytics tool and calls itself a Google Analytics alternative. Matomo gives you valuable insights into your website’s visitors, marketing campaigns etc., making it easy to optimize your strategy and online experience of your visitors.

The self-hosted PostHog is an excellent open source alternative for product analytics and can be easily integrated into your infrastructure. You can easily analyze how customers interact with your product, the user traffic, and ways to improve your user retention.

Countly is also an open source product analytics platform that heavily targets marketing organizations. It helps marketers track website information (website transactions, campaigns and sources that led visitors to the website, etc.). Countly also collects real-time mobile analytics metrics like active users, time spent in-app, customer location, etc. in a unified view on your dashboard.

Business Intelligence

Business intelligence has become prevalent in nearly every organization to get a regular health check on their business operations. BI provides businesses with excellent ways to analyze their historical data, apply learnings to their current operations, and make better-informed business decisions for their future. Every business is different with different goals, so choosing a BI tool that exactly fits the use case is essential.

With self-service dashboards, business leaders can fully leverage BI tools to understand the impact of their decisions on the business. BI tools also provide ad-hoc analysis with customizable features such as data filters and group data to find interesting trends. Open source BI platforms such as Apache SuperSet and Metabase are easy to deploy without IT involvement. Metabase allows you to ask questions about your data and shares data visualizations as output. Similarly, Apache SuperSet helps businesses explore and visualize data from simple line charts to detailed geospatial charts. Businesses can easily connect these tools to any set of transformed data within the warehouse to obtain desired results.

Using Machine Learning for Analytics

This advanced set of analytics may not be implemented by many data companies full-fledged, but if utilized, they can add value to your data. Machine Learning (ML) allows you to input transformed or modeled data into platforms such as KNIME, deployed on open source tools like R, Python, and so on, to train, evaluate, and deploy models. These models integrated with the company’s existing products for customer-facing features like a recommendation engine and other ML/AI use cases.

Conclusion

Migrating from tools you have worked with to a completely open source stack can be challenging. However, as data evolves, businesses evolve and the needs change. You will have to look for a new tool to scale and grow. We recommend you try implementing open source tools as they are extremely reliable with added advantages.

Feature image by David Mark de Pixabay

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Metabase, Real.

A newsletter digest of the week’s most important stories & analyses.