Data / Development / Machine Learning

Explore and Visualize Data the Apache Superset Way

11 Feb 2021 12:00pm, by

For those working on Superset, the Apache Software Foundation’s new top-level project, graduating from the Incubator wasn’t foremost in their minds.

“That was quite a long road for us,” said Maxime Beauchemin, the project’s vice president, of the business intelligence tool, which entered the Apache Incubator in 2017. “Graduating was not necessarily a priority for the members in the project. The goal was really to push the project forward. We also wanted it to stay in this pre-1.0 phase where we’re still actively developing the software.”

“I think we’ve really set a really solid foundation for the software and for the community and for the governance of the community,” Beauchemin added.

Superset enables users to explore data and build visualizations using a no-code visualization builder and SQL editor. It competes with tools like Tableau, Looker, Chartio and others.

Superset was born at Airbnb at a three-day hackathon in 2016 as a challenge to build a front end for the Apache Druid database. Druid is a real-time, in-memory database that’s super-fast, Beauchemin said, but he quickly realized Superset needed to speak SQL. Today, Superset supports a range of databases, including MySQL, Presto, Hive, Postgres, Dremio, Snowflake, Teradata and other data sources at petabyte scale.

Beauchemin, then working at Airbnb, said he extended the SQL support over a weekend.

“That opened up all sorts of new connectors,” he said. “And since the Superset back end is written in Python — it’s essentially a Flask application to the Python back end … there’s a lot of driver support for all databases within Python. So we took advantage of that.”

Superset features include:

  • An intuitive interface for visualizing datasets and crafting interactive dashboards.
  • A SQL IDE for preparing data for visualization, including a rich metadata browser.
  • A lightweight semantic layer enabling data analysts to quickly define custom dimensions and metrics.
  • Seamless, in-memory asynchronous caching and queries.
  • An extensible security model that allows configuration of very intricate rules on who can access which product features and datasets.
  • An API for programmatic customization.
  • Cloud native architecture specifically designed for scale.
  • Notification alerts and scheduled reports

It includes a visualization picker, which allows the user to click on one type of visualization, then easily switch to a different one simply by clicking on it.

You have a link to the data set that’s underlying this particular visualization. You can add different columns and metrics. In the query panel, you can define the search parameters, breaking data down by different metrics or a different time granularity.

“We’re interested in making it easy for people to get to a dashboard quickly. So we really care about metrics, like how long it takes for people to ramp up on product and to create their first dashboard,” Beauchemin said. “ So we’re spending a lot of cycles thinking about time to value and how to deliver a maximum of  the common features without confusing people with the long tail of all the features that may exist in something like Tableau.”

The project is also focused on the tool’s different users.

“I think I think from a product standpoint, part of our philosophy is to offer a full range to cater to different to all levels of sophistication as data teams. So if you’re a business, if you’re an executive, maybe you’re just looking at the dashboard. If you’re a business analyst, maybe you’re interested in using the slice and dice and a code-free Data Explorer. If you’re a data scientist or a data analyst, maybe you write a little bit of SQL, and you’re interested to use the SQL IDE. So we tried to cater to the entire data team and provide a comprehensive set of tools across the board,” he said.

The project has quite a few subprojects. Of its roadmap, Beauchemin said, “Really, the idea is to push the different verticals forward. So the SQL IDE, for instance, we want to make it easier for people to visualize from within that context, We’re adding a lot of features around drag and drop, and usability in the slice-and-dice explorer that we have. And around the dashboard, just smoothing out a lot of the common user flows, making it easier for people to create and update dashboards.”

Superset’s users include American Express, Dropbox, Lyft, Netflix, Nielsen, Twitter, and Udemy, among others.

Beauchemin said he believes being open source is the better way to develop and distribute software, which he touts as its main advantage over competitors.

“Before Superset, we were paying for a patchwork of proprietary tools, and we kept running into limitations when it came to customizing charts and dashboards,” said Amit Miran, software team lead for Media Application Framework group at Nielsen. “Once the Superset project supported adding custom visualizations, that was the turning point for us at Nielsen to start adopting Superset in large projects. We’re very excited about native dashboard filters and future support for cross-filtering, which will make our viz plugins even more powerful.”

“Apache Superset helps Airbnb democratize data insights and make data-informed decisions,” said Jeff Feng, product lead at Airbnb and member of the Apache Superset Project Management Committee. “Superset uniquely connects SQL analysis with data exploration for thousands of employees each week. It also serves as a flexible and reliable platform for visualizing metrics, helping executives and knowledge workers see and understand data.”

 

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Udemy, Dremio, Real, Bit.

A newsletter digest of the week’s most important stories & analyses.