While git has revolutionized the way teams collaborate on code, that same level of collaboration on data remains a work in progress. The Irish startup TerminusDB is just one of the projects working on creating a “git for data” system.
“What happens with machine learning teams and data science teams, even within the most sophisticated organizations, like Amazon, for example, is they go to the cloud data warehouse, they extract a bunch of csv [files] that that are interesting to them, and they go and they work in them, and they never go back,” said Luke Feeney, commercial and operations lead at TerminusDB. “And because it’s high-velocity data — changes to that data are very rapid — what happens is you have a bunch of csvs that are floating around … and nobody knows exactly what the latest thing is.”
“We want to provide a flexible data hub for people to store and distribute those, and then work on them collaboratively.”
From Big EU Project
TerminusDB grew out of work at Trinity College Dublin. Originally called DataChemist, led by Kevin Feeney and Gavin Mendel-Gleason, the team won a 4 million euro grant from the European Commission to build the technical architecture behind the Global History Databank. This multidisciplinary project involved a consortium of research institutions to compile all of the political and social data sets from all of human history, then provides them in machine-readable formats to enable people to do advanced analytics, historical analysis, and other projects.
“We started as a kind of data quality layer, to try and make sure that good data was going in, but there’s just an enormous amount of conflict in that data, said Luke Feeney, commercial and operations lead at TerminusDB. “You know, one person says that this polity disappeared in 600 BC, and another one says 632 BC. So being able to represent uncertainty within the data set was very important for us. … We have kind of uncertainty properties based the data types, to allow them ranges of uncertainty, so we can represent everything within the database.
TerminusDB is an open source in-memory graph database that stores data like git. It allows different people to work on different versions of the same project at the same time, using a whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel. Clone/fork operations allow data to be moved to your cloud or servers. Merge and branch operations let you mix and match data sources, significantly simplifying integration tasks. Users can later re-sync their work back into the central TerminusHub.
A colleague can then work on the merged project without having to reconstruct the environment to run the same queries.
“We started off building that data quality layer, and then we did a big proof-of-concept project where we tried to put the entire Polish economy onto a single graph so we could run high-quality queries against it looking for … conflicts within commercial data,” explained Luke Feeney. “We tried to build a kind of data versioning layer on the back of Postgres, and we found it had enormous performance issues. “
“So we prototyped the database at that point. The database is actually servers written in Prolog, which is an unusual choice. It is a logical programming language and comes at things from a different perspective. Then we implemented on the back of a C++ library called HDT. And, again, we find we had a lot of problems with HDT. We had seg faults and stuff, and we reimplemented the storage layer in Rust. And so we have a Rust-based storage layer for the bit manipulation. And we have a Prolog server, which runs the queries and the query management.” It uses the Web Ontology Language (OWL) for schema design.
Coming out of the university, they wanted to sell large scale graph systems and to enterprise, but saw that data management was a big issue for all the teams that we were working with. So they pivoted toward the problem of version control.
“What we have now is an append-only data structure that uses some novel approaches to memory management so that we can share deltas very easily,” he said. “So basically, we have a distributed version-control data store that is linked together via a hub — It’s a GitHub for Data.”
Tim Sehn of Dolt describes how Dolt and TerminusDB intersect:
“Dolt and TerminusDB are both versioned databases you can branch, merge, and diff. The main difference is that Dolt is a standard relational database and TerminusDB is a graph database. With Dolt, you interact with tables and SQL whereas with TerminusDB, you have nodes and edges, and you interact with those using a custom query language. SQL databases are far more widely deployed than graph databases. Dolt can act as a drop-in replacement for most MySQL databases and works with MySQL clients,” he explained. “We love what TerminusDB is building, but we’re not really competitive. We offer the same set of features on two different types of databases.”
TerminusDB and Pachyderm recently joined a group called the AI Infrastructure Alliance focused on building out the Canonical stack for machine learning.
DVC (Data Version Control) uses git directly to version data, according to Luke Feeney. “That means that you can version your data and your code together, which is good in some circumstances, but sometimes you just want to have a database because you want the flexibility to be able to query deltas and query different bits,” he said.
He said it and Pachyderm seem more tightly focused, Pachyderm focused on versioning entire pipelines via containers.
Said Joey Zwicker, co-founder of Pachyderm:
“It’s a very different product than Pachyderm, but we do overlap around working to version data. Terminus being a database (it reminds me of NomsDB by Attic Labs) restricts its use cases significantly in exchange for likely much higher performance and more complex relationship tracking than a tool like Pachyderm. Terminus seems to be great for graph data that needs to be queried and storing complex relationships, whereas a tool like Pachyderm is a much more generic data versioning system more similar to Git/Github.
“The main challenge to Pachyderm is that for specific workloads, it would be faster and better. But it is less general-purpose because not all data is so highly interrelated. It also required a specific language to query it that programmers/data scientists must learn instead of just programming in whatever language they feel like.”
As for Dolt, Luke Feeney said, “We don’t think SQL is the right way to go about it. While it makes it more comfortable for coders initially, the price you pay down the line is quite considerable. When you’re staying in relational, describing the difference between two states of the database is very, very complicated and so you have to have a very complicated meta language to govern the differences between those two states. In our world, because at our base, it’s all RDF triples, and all we’re saying when we’re saying the difference, the delta or the diff between two states, is these triples were taken away, and these triples were added. And that makes everything very simple and lower cost.”
There have been 110,000 downloads of the database it was open sourced in January. Organizations such as GitHub are trialing it, and giant Irish grocery wholesaler Musgrave Group is using it to run machine learning pipelines, he said.
To date, the team’s work has focused largely on Python users, going forward it wants to integrate with other enterprise software, notably for Excel users, whom Luke Feeney describes as largely overlooked.
“We want to allow Excel users to remain within their environment where they can do all of their work … So the GUI remains the same, but behind the scenes, they get that, modern workflow capabilities, where they can push their changes out to your colleagues, or pull the latest version of the Excel worksheet from a central repository, and, and then be able to go back to earlier versions very, very easily, and have all the versions saved in memory.”