San Francisco start-up MapD has released a database system, ParallelDB, built to run on GPUs (graphics processing units), which can be used to explore multi-billion row datasets quickly in milliseconds, according to the company.
The idea of using GPUs for database work may initially seem unusual, but after you think about it for a bit, you start to wonder why no one has commercialized the idea before.
“Imagine an SQL query. Or any kind of relational operator, doing the same thing over every row of data. That lends itself really well to the vector model of GPUs,” said Todd Mostak, founder, and CEO of MapD.
GPUs offer massive parallelism, or the ability to carry out a computation task across a wide number of vectors simultaneously, a vital operation for rendering graphics across a computer screen. There is no reason why this parallelism couldn’t also be used for data analysis; a database row is, after all, a nothing more than a single vector. And visualizing the data directly from the GPUs would, of course, dramatically reduce the amount of data shuffling that typically takes place to create such graphics.
Today, the largest bottlenecks for database systems are CPU and memory. As its turns out, GPUs have both in spades. Mostak designed a GPU-based database architecture that could offer 100x speedups over traditional CPU-based database systems (read: pretty much all database systems), offering the capability of executing a query in milliseconds rather than minutes.
MapD could be, for instance, set up on a machine with eight GPU cards in a single server, a setup that could offer a throughput of 3TB per second across 40,000 GPU cores.
Initially, MapD would be most attractive to big data projects with log analytics, geographical information systems, business intelligence, and social media analytics.
The technology has already been tested by a number of large companies in telecommunications, telecom, retail, finance, and advertising. Digital advertising company Simulmedia has been testing MapD to match inventory availability against ad units. Facebook, Nike, and Verizon are kicking the tires, as is MIT Lincoln Laboratory.
The company has raised $10 million in Series A funding from a consortium of investors, including Google Ventures, Verizon Ventures, and, naturally, GPU maker Nvidia.
The Inevitable Twitter Challenge
Mostak developed the idea for a GPU-powered database system, while a student, doing research at the MIT Computer Science and Artificial Intelligence Laboratory, working under database luminaries Sam Madden and Mike Stonebraker.
Mostak wasn’t even majoring in comp sci. Living in Egypt and Syria, Mostak was pursuing Middle Eastern studies at Harvard University. The final thesis project involved analyzing a lot of Tweets, and initially Mostak was using PostgreSQL along with Python and C code.
“Everything was just taking too long,” he said, noting that he had to run analysis jobs overnight. Mostak had computer science was an elective course, so at the time, he was taking a GPU programming class, where the idea for a GPU database system germinated.
The first prototypes didn’t yet challenge the sizes of in-memory database systems, MapD’s chief competitors. Harvard deployed an instance that ran 16Gb across 4 GPUs. However, the major strides that GPU builders are making — spurred on by 4K gaming and deep learning analysis — ensures successive new generations of ever-more powerful cards.
Now a MapD database running on a single server could be as large as 192GB per server, installed with eight Nvidia Tesla K80s. Nvidia’s next generation Pascal architecture-based cards, high-performance SKUs of which will hold 32GB of VRAM, will set the stage for 500GB databases rivaling the performance of in-memory databases.
Let’s stop for a second a reflect on this: MapD is promising an A 1/2 TB database running at transactional speeds on a single server.
MapD is not the first party to investigate the use of GPUs for database systems. The idea has been kicking around academia for awhile. GPUdb, out of Arlington Virginia, offers what it claims is the first GPU-accelerated database system.
Most of the approaches to date use the GPU as an accelerator. The problem with this approach is that any gains achieved from greater computational inefficiencies are squandered by the time it takes to pass data over the PCI bus, Mostak argued. MapD’s approach is just to make the GPUs the computational elements themselves (you can run ParallelDB on regular CPUs, though this approach offers no particular speed advantage).
ParallelDB is a column store database. The system can take incoming vanilla SQL queries and, using the Apple-championed open source LLVM (low level virtual machine) compiler, reduces them to IR (intermediate representation) and then compiles it to GPU code, with emphasis on the vectorizing the execution of the various SQL operators. The company has some patent pending technology on caching hot data on each GPU’s RAM to add extra pep.
The beauty of GPUs is that they have hella cores. A server can run about 10 to 30 cores, but about 40,000 GPU cores. Granted, GPU cores are pretty dumb compared to CPUs, “but you can process a lot with them,” Mostak said.
But the maximum core counts is not the only advantage GPUs bring.
“People think GPUs are great because they have so much computational power but we think that you really win because GPSs have so much memory bandwidth,” Mostak said. The Pascal cards will have the ability to scan data at a rate of 8TB/second scanning capability, a huge jump over CPU capabilities.
The accompanying visualization software can pull the computations directly from the GPUs into an OpenMP graphics card for visualization. “We can place output of the SQL queries into the rendering pipeline,” Mostak said. This could be useful for say displaying a million points on a map, or creating unusually dense scatterplot or network graphs. In addition to working with its visualization software, ParallelDB can also work with other ODBC (Open Database Connectivity)-fluent business intelligence suites such as Tableau.
What is the advantage of all this power? Reduced costs and performance improvements.
“MapD has come up with a unique solution for being able to analyze and query massive amounts of data in real-time using GPU technology,” said James E. Curtis, senior analyst of data platforms and analytics for 451 Research, in a statement. “They are dramatically reducing querying times in a very cost effective way which makes MapD a very disruptive force in the big data market.”
In one test, Verizon benchmarked MapD against a set of 20 Apache Impala servers churning though 3 billion rows. It took the Impala kit 15-20 seconds, whereas it took a single MapD server around 160 milliseconds.
As a result, MapD could pose a lower cost alternative to the likes of columnar stores such as Vertica, Amazon’s Redshift. MapD’s ParallelDB and Immerse can be procured as software for on-site deployment, or as a service from either IBM Softlayer or Amazon AWS.
IBM is a sponsor of The New Stack.
Feature image: Nvidia’s newly-released GTX 1080, the first card based on the company’s Pascal 28 nm fabrication technology.