TileDB: Managing Big Data Storage in Multiple Dimensions
Increasingly, the world’s data doesn’t fit neatly into the rows and columns of a relational database. Genomic and geospatial data, for instance, involves massive datasets represented in two or three dimensions. “As applications grow, you have to rethink storage management because not all the data fits in memory,” said Stavros Papadopoulos, the CEO of TileDB, a storage manager for multidimensional data.
He describes TileDB as a library for storing multidimensional data on any back end using a familiar interface.
“If you have a fancy geospatial application that focuses on visualization, you don’t have to create your own storage manager; it stores three-dimensional objects very efficiently on your laptop… S3 or Azure. All the complexity is taken away. You can be creative, build your own applications on top or do analytics on top. … All the magic happens behind the scenes.”
TileDB can handle both dense and sparse arrays. The format of multidimensional data is quite different than that in a relational database, he explained.
In dense arrays, such as a photograph, arrays composed of multiple cells or pixels, which, for instance, each contain information about its location as well as the amount of red, blue and green in it. In sparse arrays, on the other hand, such as a point cloud, many cells contain no information. They cannot be treated the same as dense arrays.
Using a relational database for multidimensional arrays brings huge storage and computational costs, while you can achieve huge gains if you can retrieve this data in its native format, he said.
Arrays are used in an analytic stack in machine learning, involving linear algebra, which works with matrices, two-dimensional arrays. High-performance computing libraries of linear algebra expect data to be in this two-dimensional format. So serving these libraries with the native array format makes things much, much faster, he said.
It’s not a database like MySQL or SQL server. It’s essentially a library that can be embedded to the programs of data scientists. Tiledb interfaces with the languages they use, such as Python or R and supports multiple storage back ends, such as HDFS or S3-compliant object stores like AWS S3, minio or Ceph. Your data might be in a supercomputing facility where you have a distributed file system where multiple machines store your data or in the cloud — AWS, Azure or Google Cloud.
“It’s very important for this library to integrate very efficiently with all these back ends so the user does not have to think about it. In TileDB, all these details are abstracted,” Papadopoulos said.
Last year, Cambridge, Mass.-based TileDB was spun out of a big data collaboration between Intel Labs and the Massachusetts Institute of Technology. It announced a $1 million seed round of funding in October, led by Intel Capital and Nexus Venture Partners.
Edmon Begoli, chief data architect at the Oak Ridge National Laboratory said at the time: “I consider TileDB one of the most sophisticated, best-written solutions for high-performance scientific data management that I have seen in years.”
TileDB earlier collaborated with the Broad Institute on creating a version of TileDB called GenomicsDB. The Broad Institute stores terabytes of genomics data modeled as a huge sparse 2D array.
In a research paper, it touts being faster than the HDF5 dense array storage manager, the SciDB array database system with both dense and sparse arrays, and the Vertica relational column-store for dense arrays, and at least as fast for sparse arrays.
An array in TileDB is physically stored as a directory in the underlying file system.
Its key idea is to organize array elements into ordered collections called fragments. Each fragment is dense or sparse, and it groups related array elements into regular-sized chunks of fixed capacity, which it calls data tiles. Cells that are accessed together are co-located on the disk and in memory to minimize disk seeks, page reads, and cache misses.
This organization turns random writes into sequential writes and boosts read efficiency with its own algorithm.
Application needs determine the choice of global cell order: For example, if an application reads data a row at a time, data should be laid out in rows rather than a columnar layout.
In sparse arrays, the user specifies data tile capacity, then creates the data tiles so they all have the same number of non-empty cells, equal to the capacity.
Writes are performed in batches, which speeds up performance, and each batch is written to a separate fragment sequentially. Sparse fragments can be used to speed up random writes even in dense arrays.
The TileDB read algorithm more efficiently finds the most recently updated fragment and avoids unnecessary tile reads when a portion of a fragment is totally covered by a newer fragment. As performance degrades as the number of fragments grows, a consolidation algorithm goes to work in the background while other concurrent reads and writes continue.
Each chunk of data or tile is compressed using multiple different compressors depending on the nature of the data, reducing storage costs, Papadopoulos said.
It’s a C++ library, and offers APIs in C, C++ and Python. APIs in R, Java and other languages are in the works.
“[They’re] trying to come up with a Python solution and the R world is trying to come up with an R solution. What we’re saying is that we should be doing a universal solution to work out all the details – have an API for you, for Python, for R, for Java. You don’t have to worry about this, just use us and you can build your own fast analytics on top,” Papadopoulos said.
Beyond its work in genomics, TileDB is looking to branch out into other fields, such as Lidar data — three-dimensional points in space — and time-series data, used heavily in financial services.
So far, it’s all still open source. TileDB, the company, is working toward building an enterprise-ready commercial offering, the timing of which depends on its next round of funding, Papadopoulos said.
Feature image via Pixabay.