ArangoDB: Three Databases in One
ArangoDB, a German database expanding its business in the United States, has released new capabilities in version 3.5 of its eponymous database management software to make it easier to query and search growing data sets across multiple data models.
Multimodel databases take on the issue of effectively using data stored in different ways, but also of managing multiple databases, each with its own storage and operational requirements, including data consistency.
With ArangoDB, data can be stored as key-value pairs, graphs or documents and accessed with one declarative query language. And you can do both at the same time — a document query and a graph query. The combination offers flexibility and performance advantages, explained Claudius Weinberger, CEO.
In the database market, it’s not all about scale, he said.
“There are graph databases, there’s full-text search engines like Elasticsearch, key-value stores like Redis. In practice, unfortunately, the use cases you typically have, they don’t simply map to one of those data stores, said Jörg Schad, head of engineering and machine learning.
“So most often, you don’t simply have a graph problem. But you, for example, might have a document store problem and a graph problem at the same time.”
So developers typically take MongoDB for the document and something like Neo4j for the graph part, then build an application layer on top, he said.
“The core idea is actually offering all of this from the same engine and all the same transactional guarantees as from one database engine. You don’t have to write custom code to merge data on top or provide transactional guarantees across different data stores,” he said.
Arango also offers full-text search, imagined kind of like Elasticsearch, to index whole text data.
Among the new features in version 3.5:
- SmartJoins, available only in ArangoDB’s Enterprise version, allows you to shard the data and run joins on different machines. This makes it more efficient to run complex queries on large graphs that don’t fit on a single node.
- Configurable analyzers and sorted indexes for ArangoSearch. These can speed up query performance by 1,500 times, according to the company, and simplify management in a multi-tenant environment.
- k Shortest Path and PRUNE: k Shortest Path enables users to query for all shortest paths between two vertices and analyze result sets by path length or weight. PRUNE lets users add a syntax to their queries to stop deeper searching at a specified condition, which helps reduce overhead and speed results.
- Data masking anonymizes sensitive data like credit cards and Social Security numbers enabling the use of production data in testing environments. There are predefined components, but you can also then further define your own.
“Over here in Europe, [GDPR] is a huge topic,” Schad said. “You have to be careful which data you might return to certain groups in your organization. So we came up with a way to … easily to mask out, for example, private data, but still return meaningful results.”
ArangoDB was founded in 2014 in Cologne, Germany, and more recently set up headquarters in San Francisco. It secured $10 million in series A funding in March led by Bow Capital, bringing its total funding to $19.2 million.
Founders Weinberger and Frank Celler have worked together for more than 20 years. In 2004, they started the database consulting company triAgens, focused on NoSQL solutions.
ArangoDB offers an open source Community version and its commercial Enterprise version. It has a managed service in beta. Its customers include Barclays, Thompson-Reuters and Cisco.
It has plenty of company in the market. Seven of the top 10 most popular databases in the DB-Engines ranking are listed at multimodel. It also competes with other multimodel open source databases including OrientDB, Cassandra, Redis, Couchbase, YugaByte and others.
The difference, according to Weinberger, is that many of these have a graph layer built on top of a document or key-value store rather than a fully native model encompassing all three.
Three in One
At its core, written in C++, the database software is a combination of several data stores in one that can be accessed with one declarative query language called ArangoDB Query Language (AQL) that’s similar to SQL. AQL supports joins and traversals that can be combined in various ways — and you can query the different data stores simultaneously. It’s designed to be client independent — the language and syntax are the same for all clients, no matter which programming language is used.
It stores data from all three model types as JSON documents. With graphs, it stores a JSON document for each vertex and another for each edge. Special edge collections ensure that every edge has _from and _to attributes referring to the starting and ending vertices of the edge, enabling the use of one query language for all three model types.
On the back end, it uses RocksDB for scale.
Going forward, the company will be focused on optimizations for running larger clusters, Schad said. How to replicate data between nodes? How can administrators get the metrics they need to debug issues at scale? What about effective backup and fault tolerance for graphs that do fit on a single node?
It’s also delving into machine learning and making machine learning pipelines more efficient.
Feature Image: “Three” by Randen Pederson. Licensed under CC BY-SA 2.0.