Interviews / Technology

Open Source Leaders: Matei Zaharia, Apache Spark

13 Jun 2017 6:00am, by

The following interview is the first in a series, called Open Source Leaders, where we profile project leaders in the open source IT community, to learn more about how they developed their software as well as the challenges and benefits that come with running an open source project.  

When the Apache Hadoop Project began making waves back in 2011, the world congregated at the altar of MapReduce as the newest, best way to deal with large batch processing jobs across large data sets. Over time, however, the bloom has fallen off the MapReduce rose, as Java developers were sucked up by the Hadoop team in order to turn data scientist requests into workloads.

Enter the Apache Spark Project. Originated at the University of California Berkeley by then graduate student Matei Zaharia, Spark has grown to become the new Hadoop, replacing a great many MapReduce use cases in only three short years. The reasons are obvious: Spark is simpler, faster, and makes it easier to get to those higher level goals of big data projects: machine learning, SQL support, and stream processing.

At the Spark Summit in San Francisco the second week of June, Zaharia, now CTO of DataBricks, the company behind Spark, took to the stage to show off some interesting use cases, and to help announce a new machine learning library and a serverless option for running Spark. We caught up with him at the show to find out about the current market, and to see just how the big data world was realizing its potential.

You discuss the usage of Spark in genomics in your keynote. How has it changed genetic decoding work for the better?

People have gotten better at analysis. There’s this group called the Broad Institute. It has the biggest gene sequencing center in the US. It writes the standard tools and pipeline for that, and all of that runs on Spark.

What are the biggest changes for them?

I think the scale and the cost of sequencing, which is going down a lot. Before when I was doing this in 2013, there weren’t any real cases where medical outcomes were changed due to sequencing. Now there are many of those. Some places where that’s successful is cancer, where each cancer develops different mutations. If they know which one it is, they can choose a drug to target that cancer.

The other one is where you’ve got unknown pathogens, like Zika or SARS. There are these labs that work very quickly to ID these and see what DNA is there, and to reconstruct those pieces like puzzle.

Is Spark replacing MapReduce when it’s being adopted by your customers?

I think that’s true. It’s replacing MapReduce. New workloads just aren’t happening in MapReduce. It’s too complex and it’s a hassle, and often performance is not so good. I think the other thing that happened is because Apache Spark is easy to use and plugs into Python you get brand new users who wouldn’t have used it before. They’re not Java engineers. They’re able to use it. In all these companies, even the most hardcore Hadoop vendor, they’ll say, ‘For new projects, we recommend not using MapReduce.’ At the high level, it’s still letting you do the same kinds of things.

Are more data scientists coming into the pipeline? It’s been tough to find them for a long time.

I think there are a lot more data scientists coming in. There are some really interesting efforts: Most major universities have now started a data science program, so people coming out of colleges will know these basic tools. There are also a lot of training courses and programs after the fact where you go on a six-month training course and learn enough statistics to be a data scientist.

The thing that’s unique about Spark is we tried to have this unified engine and analytics platform where you can in one place manage your application, and not have to manage all these systems.

At DataBricks, we’ve made a big effort to educate people and develop these skills. We ran three massive online courses in the last two years. We had over 100,000 people sign up, and around 20,000 people finished them. That’s a very high rate for the courses.

What are still the big challenges to becoming a data scientist?

I think the biggest challenge is that people come from many different backgrounds. The material that’s right for one person is wrong for someone else. We’ve been trying to structure our courses based on your background. Are you coming from a data background or a software engineering background? The other thing we’ve seen is people coming from an SQL background, and we just want to expose these advanced analytics with SQL: Exposing those in SQL for non-machine learning experts to use.

SQL on Hadoop is standard now. Is that the case for your customers?

We developed the original SQL engine on Spark, then we started Spark SQL and integrated that into the project in 2014. It’s become the most actively developed open source SQL engine with over 2,000 contributors, and it’s still developing very fast. This year a bunch of companies got together and contributed cost-based optimization to it. Another thing that we’ve been working on is Streaming SQL through structured streaming, so the same engine can now do streaming now. We added that last year, and it’s now generally available. Apart from that, we announced some performance benchmarks: we’re five to ten times faster than other engines like Flink and Kafka Streams.

Those other Apache streaming projects have been growing. How do you align with them in Spark so there’s less overlap?

There are a few types of projects there. Kafka started out as a just a message bus. That’s very useful, but it’s basically a storage system. In the cloud, there is Amazon Kinesis, then there are compute frameworks. Kafka has added a compute framework called Kafka Streams.

The thing that’s unique about Spark is we tried to have this unified engine and analytics platform where you can in one place manage your application, and not have to manage all these systems.

Do you see anyone still using Apache Storm?

There are people using Storm. I think it’s just not developed as accurately. This may change at some point. Twitter moved on to Heron, and Hortonworks used to drive a lot of Storm stuff. In general, there are a lot of systems in this space that are complex to use. We want the consolidated on one that’s easy to use.

Are artificial intelligence (AI) and machine learning (ML) the big drivers for long term usage of big data processing platforms?

I think big data, advanced analytics, machine learning, and artificial intelligence are definitely here to stay, and they will continue to grow. The use cases of AI are only possible because you have large data sets, but that’s different from the Hadoop vendors and Hadoop as a platform. The main value is as a storage system and data lake. It’s a much less expensive way to store data, and once you’ve stored it you do analytics.

That is being completely disrupted by the public cloud, where it’s often an order of magnitude cheaper to store stuff in and it’s geo-distributed, and you only pay for the storage when you’re storing stuff, and only pay for compute when you’re computing. It’s operated by someone else. I think because of that, the value of those (on-premise) deployments decreases. They’re still useful if you can’t use the public cloud.

Those vendors are working hard to provide isolated clouds for government. The way we set up our company, we are agnostic in storage. We aim to provide analytics. For our use cases, we see all of them are asking, “How can I do more analytics without the operational complexity of Hadoop?”

What does “serverless” mean in the context of your new offering?

What it means is basically for the users, they don’t need to configure resource allocations or Spark config settings at all when running a computation. The way it worked before today, each user as they want to run some analysis had to say here’s how much memory, and how many nodes I want. Here’s how many CPUs I want, and so on. That means every user had to be an expert in configuring Spark. Data scientists are often not experts in Spark configuration. It also means poor cost efficiency and poor cost management.

With serverless, we turn the concept around. The administrator of the account decides an overall size minimum and maximum, and we scale up a pool of machines, and users connect to that in queries. We extended Spark to understand how many resources to allocate to each job. It can greatly reduce cost. It’s kind of the same thing as [Google’s] BigQuery: you submit queries and get answers. Maybe there’s an administrator who controls the overall spend on the system. We think, again, much like SQL, data science is an extremely interactive type of application.

How many people are contributing to the Apache Spark Project these days?

It’s a lot of people on Apache Spark. It’s over 1,000 people and might be around 1,200 or 1,300 now. It’s still growing. A lot of it is in the libraries: machine learning libraries, Python and R support, and so on. There isn’t that much action in just the core engine. There is value in having this unified set of libraries: you can combine them in any way and find the algorithm you’re looking for. The most active component is the Spark SQL Engine. It’s easy to extend and optimize for a specific data source, and a specific query. A lot of people who build key-value stores are building connectors.


A digest of the week’s most important stories & analyses.

View / Add Comments