MongoDB Partners with Databricks for App-Driven Analytics
Last month, modern operational database player MongoDB announced a close partnership with Databricks, a company whose platform focuses on data engineering, AI and analytics, and who coined the term “data lakehouse.” The motivation for the partnership is both customer- and technology-driven and should go a long way toward easing a platform integration that many customers had been forging on their own.
Speaking of their own company and Databricks in a blog post, MongoDB’s Matt Asay, Dana Groce, and Ariel Amster said “we observed that a large and growing population of joint customers has for years enabled the flow of data between our two platforms to run real-time businesses and enable a world of application-driven analytics, using MongoDB Connector for Apache Spark. So we asked ourselves: How could we make that a more seamless and elegant experience for these customers?”
The New Stack spoke with several MongoDB executives about the partnership, including Alan Chhabra, EVP, Worldwide Partners; Andrew Davidson, SVP, Products; and Jeff Sposetti, VP Product, Analytics & Enterprise Tools. Together, the three provided in-depth context on the partnership, MongoDB’s approach to analytics workloads, and to integration with other platforms in the data/analytics space.
As you might expect of someone in his position, Mr. Chhabra espouses a very strong ethos around partnering, including with companies that might seem competitive: “I believe many customers and analysts see us all competing with each other, and I don’t think that’s a prudent strategy. I feel companies like MongoDB and Databricks should play well together… and the other players that try to build everything on their own will offer poor products and customers will suffer.”
Chhabra says that Databricks is the kind of partner with whom MongoDB customers wanted to see the company pursue greater technical integration, and that’s what led his team to pursue the relationship. “What you see us do is go after certain technology partners that our customers want us to work better together with” Chhabra said. “You saw us earlier this year launch an alliance in the streaming area with Confluent. And for the last six months, we’ve been excited about what can we do better with AI/ML and data warehousing. Databricks, we feel, is a very complementary company to MongoDB.”
Chhabra says customers want these two domains to move closer together: “If I’m servicing lots and lots of data scientists for data warehousing, or I need to go make sure I’m running analytics for certain AI use cases, I also want to be able to give my service to real-time applications. So customers want those two streams, which have historically been kept separate, to converge together. So that’s why you see an operationalizing of both use cases where analytics and operation data become one. So the leader, at least what we think, in modern application development for databases is MongoDB, and we feel like the leader for analytics, data warehousing, AI/ML is Databricks. This is why you saw us launch this strategic partnership…”
MongoDB’s Davidson sees the two companies’ having overlapping customer lists on the one hand, and complimentary platforms, in terms of capabilities, on the other. “We see Databricks is just highly synergistic with what we’re doing in a few ways,” he said. “Both [companies are] targeting highly technical personas. But you know, slightly less overlap in terms of the day-to-day practitioner. We see those practitioners increasingly working closely together, so we want to make sure they have a great joint solution.”
How Does It Work?
The integration is based on MongoDB’s Apache Spark connector, which allows MongoDB to be queried using Spark SQL and the Spark DataFrame APIs. But the Databricks integration goes further, extending all the way into the UI: Simply by clicking on the “Data” left navbar item and the “Add Data” button that results, Databricks developers can pick MongoDB as a data source. These steps are shown in the figures below.
Upon clicking the MongoDB data source option, Databricks developers are immediately placed into a sample notebook with code pre-written to query MongoDB. I was able to get all of this working in my own Azure Databricks environment, and all the figures in this post are screenshots I took myself, from my own machine. So I can validate that once connectivity is established, users are able to query data in MongoDB using just their Spark SQL and DataFrame skills, then visualize that data in the notebook.
Special features in MongoDB make even more powerful analyses possible. For example, aggregation pipelines and their unwind and project stages can together flatten the hierarchical data in a MongoDB document into a tabular structure more familiar to data analysts. This technique is explicitly demonstrated in the sample notebook and shown in the figure below.
The notebook also shows how the aggregated data can be written to managed Spark tables in formats like Delta Lake and even written back to new collections in a MongoDB database.
Where Are the Lines Drawn?
The ability to query data in MongoDB does beg the question of whether SQL can serve as a complete substitute for MongoDB’s own API and query language, MQL. The availability of MongoDB’s analytics nodes might suggest a data warehouse or lakehouse (including Databricks’) isn’t necessary. These overlaps with dedicated analytics platforms make it necessary to iron out where to use what.
Here’s Andrew Davidson’s take: “We definitely don’t see developers moving away from MongoDB’s query language for a variety of reasons. But we think of all of our companion personas…professionals of data, data engineers or data scientists using Spark, some of those people we expect to just run Spark directly against MongoDB using the Spark connector. And some of those people we expect to feel less intimidated, essentially, if they could take advantage of SQL. And we’re not opposed to that.”
Mongo sees the world of analytics on operational data as falling into a few key scenarios. The first, which the company dubs “In-App Analytics,” involves providing analytics capabilities in the app, sometimes referred to as embedded analytics. The idea here is to provide contextual analytics to application users without forcing them to move to dedicated BI platforms. MongoDB’s native APIs are most germane here.
The second area, which the company calls “Real-Time Business Visibility,” entails exposing data in the operational database to analysts for external analytics. This is where SQL access and analytics nodes work best, as their combined use optimizes access for analysts without burdening the operational database infrastructure.
It’s the second scenario (exposing data in the operational database to analysts) where the Databricks notebook scenario fits in. Here, the data volumes are likely on the smaller side, but the data is fresh and, because of this, the analytics may be quite valuable. The approach avoids data movement, minimizes requirements around data transformation and gets developers and analysts working most closely together.
The third area, “centralized analytics,” is where it makes the most sense to use external analytical data stores (data warehouses, data lakes and data lakehouses) and their query agents.
The Wider Ecosystem
Databricks isn’t the only partner here. Mongo has deep integrations with other analytics platforms. With that in mind, during our discussion, Chhabra posited the question “how do we better integrate with the cloud providers, whether it’s…[Azure] Synapse, or a Power BI with Microsoft or a BigQuery with Google or with AWS’ features like Kinesis, Redshift … ?”
With respect to Synapse Analytics, the platform facilitates the creation of integration datasets based on data in MongoDB, and even provides separate connectors for Atlas and customer-managed MongoDB, as shown in the figure below. All of this fits nicely into the third (“centralized analytics”) scenario for analytics on operational data.
MongoDB has been investing a lot in its coverage of multiple use cases. In addition to providing separate clusters of analytics nodes, MongoDB has added time series data capabilities, announced plans to add column store index capabilities, and more.
Mongo’s Sposetti explains it this way: “So we’ve built… a lot over the past three years… we said more and more developers are dealing with time series data, and so we added capabilities specifically into our core database around time series data.” He added “that’s just one of those examples where we saw a need that developers were solving by bolting point solutions into their applications. And we said, ‘look, developers have already committed to MongoDB… as their core database. Can we help them solve this problem too?'” He said MongoDB responded similarly to requirements around search, by creating Atlas Search on top of Apache Lucene.
Sposetti added: “A developer shouldn’t look at MongoDB and say ‘this is awesome for document modeling and awesome for modeling the business domain I have in my app, but as soon as I need to do an aggregation across a bunch of documents, I have to go figure out a different solution.’ We want to make sure that’s good for them, and they can solve this inside of MongoDB.”
The timing of the MongoDB-Databricks partnership announcement was auspicious, coming just the week before Amazon Web Services’ re:Invent event, as both companies are highly cloud-driven and both are longtime close partners with AWS, with both Databricks and MongoDB Atlas first launching on Amazon’s cloud.
MongoDB’s Davidson explained the historical precedent: “AWS actually launched the same year that MongoDB launched, and it’s been like two peas in a pod ever since. And the reason for that is that if you think about what happened in those early days, that the origin, and moment of cloud commodity hardware being available as a service for the first time. And, in that moment, developers needed distributed system software that made that IaaS useful. So MongoDB provides this… [our] document data model… [that] enabled you to abstract… on top of a bunch of distributed IaaS infrastructure from day one… allowed us, in my view, to be a developer data platform that turned that bunch of computers in the cloud into something useful for developers, and that was very strategic to AWS… Many of the largest workloads on AWS from day one were built on MongoDB.”
Of course, with offerings from AWS like Apache Spark on Elastic MapReduce (EMR) and MongoDB compatibility in DocumentDB, Databricks and MongoDB each face some competitive pressure from AWS, too. All the more reason, then, for the two companies to implement tight integrations, and to teach customers how to use it.
Teach It; Don’t Just Build It
On that question of customer education, MongoDB has invested significantly in its MongoDB University online training platform over the last decade and has recently revamped it. “We’ve had over a million people go through MongoDB University over the years,” Chhabra said. “I have so many SIs [systems integrators] that want certifications, training on MongoDB to help them with modernization for customers. So we’ve revamped MongoDB U over the last year, and the team has relaunched it on a new LMS [learning management system].”
In November, the company announced enhancements to the platform, including an enhanced experience for learners, an expanded catalog of courses, streamlined developer certifications, 24/7 exam access, hands-on Atlas labs, and foreign language support. In addition, new “Learning Bytes” video tutorials, each 20 minutes or less in length, will cover the latest updates for MongoDB features, as they are released.
This is a smart investment for the company, as is making integrations with other platforms as friction-free as possible. Platform integration can deliver a ton of value, but it can also be hard for customers to do on their own. If MongoDB can build the alliances, produce the samples, and develop the learning materials needed to enable professionals with MongoDB skill sets to do more, it will continue to build its folk hero-like image with the developer community and, inevitably, be used for an increasing number of new application implementations.