Cloud Native / Data / Machine Learning

The 2020s Will Be Defined by Scale-Out Data

22 Apr 2020 5:00pm, by

If the 2000s was when networking evolved on the internet (I called this the read/write era, others named it ‘Web 2.0’), and the 2010s was all about the compute layer, then the 2020s will see a revolution in the data layer.

That’s according to DataStax Chief Strategy Officer Sam Ramji, who outlined his vision at last week’s The New Stack Virtual Pancake Breakfast webinar.

Specifically, Ramji was talking about how these trends “scaled out” on the internet. As The New Stack readers will know, scale-out architecture refers to adding more power to an application by adding more machines — rather than the “scale-up” approach, which relies on upgrading a machine by adding a faster CPU or more memory. Scale-out is, of course, a cornerstone of the cloud native world we now live in.

Subscribe: SoundCloud | Fireside.fm | Pocket Casts | Stitcher | Apple Podcasts | Overcast | Spotify | TuneIn

Before we dig into Ramji’s predictions, it’s worth quickly reviewing how we got here.

The 2010s saw the maturation of several major internet platforms: social, mobile and cloud. From an infrastructure perspective, cloud was by far the most important. All the big players now have substantial cloud computing infrastructures — Amazon, Google, Microsoft, Apple and Facebook. With the emergence of containers in the middle of the decade (which The New Stack founder Alex Williams was among the first to cover), a more efficient and scalable way of managing applications on the cloud was discovered.

The next revolution was the open source container orchestration platform Kubernetes, which enabled even more “scale-out” of the compute platform. Kubernetes, which evolved out of an in-house Google platform called “Borg”, has experienced rapid growth over the past couple of years. According to the most recent survey of The Cloud Native Computing Foundation (CNCF), 78% of respondents are now using Kubernetes in production — a leap from 58% last year.

Now that Kubernetes is so prevalent among cloud native companies, attention is starting to focus on the data layer. Apache Cassandra has become the open source database of choice in the cloud native world, and DataStax is among a cadre of startups offering commercial solutions on top of Cassandra.

What is Apache Cassandra? It’s a highly scalable distributed open source database, first developed at Facebook and released as an open source project in July 2008. It’s a so-called NoSQL database, a type of non-relational database “built specifically for scalable applications.” Nowadays, Cassandra is used by corporations like Netflix, Comcast, eBay, Hulu and Intuit. One of its biggest users is Apple, which runs 150,000 Cassandra instances and stores hundreds of petabytes of data.

The idea that Ramji and others are pushing is that Cassandra (the data plane) is a natural complement to Kubernetes (the control plane). Both are open source, both are distributed, and both are highly scalable. As Ramji put it in another interview, “Cassandra and Kube is like peanut butter and chocolate […] kind of a perfect pairing of data and compute for a cloud native world.”

If anyone has insight into how Kubernetes and Cassandra can be used together, it’s Ramji. He led the Kubernetes team at Google during his time there (late 2016 to mid-2018) and now he’s leading strategy for the Cassandra-focused startup DataStax.

“Apache Cassandra has got over a decade of hard-won battle-tested code improvement,” Ramji said on the Virtual Pancake webinar. So it’s ready, he believes, to be the distributed database of choice for major cloud projects.

Although it’s worth noting that Cassandra will need to be further adapted to scale on Kubernetes, as it isn’t native to that platform. To that end, DataStax launched its open source Kubernetes operator last month. An “operator” is a tool that makes deploying and managing an application on Kubernetes easier.

There are other Kubernetes operators for Cassandra available on the Web, not to mention plenty of competition for DataStax in the scale-out architecture market. Cockroach Labs, Redis Labs and MongoDB all have cloud native database products.

It’s interesting to ponder what future applications the pairing of Kubernetes with Cassandra (or an alternative scalable database) could lead to. Ramji is keeping an eye on artificial intelligence and machine learning apps. Now that the networking and compute layers are solved, he thinks that over the next ten years “there’s an opportunity to make data really easy, really manageable, and create a playground for apps of the future, which will include AI and ML apps.”

That’s because to create truly effective AI and ML apps, you need a database that can scale aggressively.

“You look at the kind of loads that those systems put on modern infrastructures,” Ramji said, “just doing a training set on a set of static image data, you could be looking at sustained demand of many gigabytes a second — let alone image recognition overlaid with video, audio, anything else that you might want to do.”

If you add Kubernetes to the mix, it’s a recipe for the future of AI and ML applications.

“So the demands on the system for raw throughput, times the ability to scale as wide as you might scale a cloud infrastructure like Kubernetes,” said Ramji, “kind of does give us a little peek ahead of time, right? What’s the old, most excellent quote: the future is already here, it’s just unevenly distributed.”

Look no further than Google for an example of AI and ML apps built on cloud native technology. In Google’s case, the control plane was Borg (the mother of Kubernetes) and the data plane was its own massively scalable database management system, Google Cloud Spanner.

“So when you look at why Google was able to build a modern AI and machine learning business,” said Ramji, referring to Google search, Ads, Gmail and other products, “it was because it had this Borg control plane, and you had Spanner as your data plane. So the marriage of those two things made compute and data so universally addressable, so easy to access, that you could do just about anything that you could imagine.”

The intriguing thing is what tens of thousands of other businesses and startups could do with the same technology (only this time open source). In other words, there’s a good chance the leading AI and ML apps of the coming decade will be built on Kubernetes and Cassandra.

DataStax, CNCF, Redis Labs and MongoDB are sponsors of The New Stack.

Image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.