Microsoft Revamps HDInsight, Its Stalwart Big Data Service
Microsoft is launching in public preview today, a new, fully revamped version of its HDInsight (HDI) cloud-based modern data stack service. The new version, dubbed HDI on AKS (Azure Kubernetes Services), will initially support three cluster types, based on the Apache Spark analytics, data engineering and machine learning platform; the Apache Flink platform for streaming and batch data processing, and the Trino query engine for data lake and federated query analytics. Check out: Data 2023 Outlook: Rethink the Modern Data Stack As its name would imply, the new service is premised on Kubernetes-based container technologies, rather than virtual machines. This means that once an HDI on AKS “cluster pool” is stood up, that provisioning and deprovisioning individual clusters running within that pool becomes a much more straightforward and agile process than it is for clusters in the original version of HDI.
What’s Wrong with the Old One
Speaking of the original version, a crash course in open source analytics is in order. In the beginning of the “big data” era, the open source Apache Hadoop project reigned supreme. Around that time, Microsoft launched HDInsight as its cloud-based Hadoop service. The offering, built in collaboration with the erstwhile Hortonworks, eventually expanded to include cluster types optimized for other open source engines, including Apache Spark, Kafka, HBase and Hive LLAP. Also read: Will Kubernetes Sink the Hadoop Ship? The original HDI service still runs today but has receded in relevance due to a number of factors. These include the release of Azure Databricks, Azure Synapse Analytics and the recent public preview release of Microsoft Fabric, all of which leverage Apache Spark (which is far more suitable to interactive analytics than was Hadoop) and which also offer more sophisticated tooling and ease of use. Another issue is that Hortonworks’ 2019 merger with Cloudera led to the deprecation of its Hortonworks Data Platform (HDP) Hadoop distribution. While Microsoft built its own clone of that distro based completely on Apache open source bits, the tech has still suffered from the stigma of obsolescence. Background on Microsoft Fabric: Microsoft Fabric Defragments Analytics, Enters Public Preview Another factor includes the growing popularity of various new open source frameworks, which HDI never onboarded. These include two of the new HDI’s components: data lake query engine Trino, an evolution of the original Presto database engine developed at Facebook, and streaming data processing platform Apache Flink. There’s also the matter that even though the original HDI offers a Spark cluster, it’s still largely based on Hadoop’s resource manager, known as “YARN.” And finally, the virtual machine-based architecture of the original HDI made the deployment of clusters slow and otherwise unwieldy.
Cooperative Coexistence, not Creative Destruction
Despite the fact that both HDInsight versions offer Spark cluster types, the two services are largely complimentary, so their coexistence can be easily rationalized. Furthermore, the presence of Trino, which supports connectivity to, and federated queries between, a large array of back-end data sources, means that HDI on AKS can connect to, and integrate, a number of Azure’s other data services. These include Azure SQL Database, Synapse Analytics and even Azure Database for MySQL, PostgreSQL and MariaDB. And since Power BI can, in turn, connect to Trino, the end-to-end integration story is a good one. While the complementary nature of HDI on AKS, relative to the original HDI, is reassuring, the bigger question that remains is how to rationalize the coexistence of the new service with Synapse Analytics and, perhaps more importantly, the emerging Microsoft Fabric. If the same workloads can be accommodated on both, and Microsoft is making both available, it leaves open the question for Microsoft/Azure customers of which to use when.
DIY vs. IDE
The answer lies in the modalities of usage: if you want to work with HDI on AKS, you will need to get ready to work with code, command line interfaces (CLIs) and external tools. For example, if you build a Trino cluster, you’ll need to do one of three things to work with it: (1) download the Trino CLI to your own computer; (2) establish a Secure Shell (SSH) session with the cluster and run the CLI from there (as shown below, you can do this in a web browser, from the Azure Portal); or (3) connect to Trino from an external query tool (e.g. DBeaver) using a JDBC driver designed specifically for Trino on HDI for AKS.
Flink also has an SQL client accessible via SSH. Spark clusters are more flexible, as they can be accessed in code from Jupyter notebooks available directly from the cluster’s Overview blade in the Azure Portal, or from a bunch of external tools. While Trino, Flink and Spark do have their own user interfaces, all of which are easily accessed from the Azure Portal, those UIs serve as management tools for the cluster infrastructure, and monitoring jobs or queries (as shown below, for Flink), rather than developing and submitting them.
This is a very different experience from working with Azure Synapse or Microsoft Fabric, each of which features its own web-based user interface, and each of which can also be accessed from popular Microsoft ecosystem tools like SQL Server Management Studio, Azure Data Studio and Visual Studio Code. Analytics teams who don’t want to muck with IT will prefer this approach. IT teams who don’t themselves work with analytics, but want to manage services for those who do, may well prefer to have fewer layers between them and the software.
Stick Shift or Automatic?
By the same token, the HDI and Fabric services themselves — even if they have broad capabilities and certain technologies in common — are very different. HDI on AKS is a Platform as a Service (PaaS) offering that provides very fine-grained control over each of its components, to customers who are desirous of that control and possess the talent and skill sets needed to manage it. These are the customers who might run the same software completely on their own, but they want to do it in the cloud, and not have to worry about the hardware or virtual machine infrastructure, or the care and feeding of the underlying Kubernetes pods, clusters and nodes. Fabric, meanwhile, is a Software as a Service (SaaS) offering, based on a unified compute capacity model, and which provides a user interface and abstraction layer over the underlying technologies. Resources are provisioned and deprovisioned automatically and elastically, without the customer needing to micromanage those details. It’s a much more turn-key service that lets customers focus on the business requirements to which the technology is being applied, rather than the management and configuration of the infrastructure. Delegating that responsibility to the service will be a great convenience for some customers, but may be an obstacle or impediment for others. That’s why offering both services makes sense, as Microsoft can accommodate the range of customer preferences, in-house talent and approach to cloud cost management. Some will want a highly-engineered, refined platform while others will prefer the ability to work with “commodity” open source data and analytics technologies and add their own engineering value on top of them.
Hands-on, with Sleeves Rolled up
Want to get to work with Trino, Flink and/or Spark? If you’re willing to get your hands dirty and read through some documentation to come up with the learning curve, HDI on AKS is a very nice way to get your analytics work done. Though I had some direct support from the product team, I was able to work productively with HDI on AKS, while the service was still in private preview, and the documentation was still pretty austere. (Both screenshots in this post came from my own work with the service.) If I can do it, so can any good data engineer or cloud services-savvy analytics pro. And with HDI on AKS now in preview, it makes sense for such users to give it a look, put it through its paces, and consider how it might fit with the rest of their organization’s modern data stack components and tools. Customers who would like to try out HDI on AKS can create a new cluster at https://aka.ms/starthdionaks and peruse the documentation at https://aka.ms/hdionaks-docs. Subscription prerequisites and one-time setup steps can be found at https://learn.microsoft.com/azure/hdinsight-aks/prerequisites-subscription.