Deep Information Sciences Offers a Self-Tuning Database System Built on Machine Learning
In the ever-more-crowded world of databases, Deep Information Sciences has reimagined the database for big data and the cloud with its latest release of deepSQL.
The package combines machine learning with a relational database designed to run data streaming, transactions and analytics at cloud scale concurrently. It’s based on founder and Chief Technology Officer Thomas Hazel’s ten years of work on the basic underlying science behind databases.
“Database science has not advanced since the late 1970s at the algorithm level,” says says chief strategy officer, Chad Jones, referring B- trees, B+ trees and other algorithms. “One’s good at writing, one’s good at reading, but they can’t do both, and you have to constantly optimize them.”
“In the 21st century, we’ve wrung out how far these algorithms can go, and they’re really hampering us from doing a lot of things, like elastic scale and things we want to do inside the cloud,” Jones said.
The Boston-based company was founded in 2010. It brought deepSQL to market last April, then added Docker support and joined the AWS Marketplace in September. Its latest release came last month, which it calls an “adaptive database.” It has raised $18 million in investment from Sigma Prime Ventures, Stage 1 Ventures, AlphaPrime Ventures and others.
The company aims to address problems associated with traditional databases including concurrency, caching and calibration that limit flexibility, performance and scale. As a plus, it is touted being “self-tuning” — no human optimization required.
The company previously offered an adaptive engine that sat under MySQL or Percona databases; it’s now gone one step further and integrated its open source database distribution while remaining compatible with the MySQL ecosystem, according to analyst Jason Stamper at 451 Research.
The company’s answer to the problems of traditional databases is called CASSI (Continuously Adapting Sequential Summarization of Information) based on self-tuning loop encompassing four stages: observe and analyze; predict and adapt; orchestrate and optimize; and closing the loop.
With a traditional database, a DBA has to optimize the different parameters to tune the performance for an application.
“The problem with that is if I’m going to do all this [stuff] that requires me to restart the database — or in some cases to rebuild the database — I’m doing that offline, and it’s trial and error. And they’re optimizing for a write case or a read case, not both,” Jones explained.
“We’re able to separate memory from the disk,” Jones said. “We’ve separated those things out so they can be configured independently. We’ve implemented machine learning, and it is able to model where resources are on the machine, the workload types on the machine and their requirements, and the information inside the machine, its structure, its metadata, those types of things.”
As workloads are interacting with the system, and while remaining online, machine learning will predict what needs to be organized into memory to handle the concurrent running of workloads.
“Say I have a database, and I have a snowstorm and shovels are selling like crazy, and I want to see how many bags of rock salt are selling so I can offer ‘buy a shovel and get 20 percent off a bag of rock salt,’” Jones said.
“Normally, I’d have to take that information, do ETL [Extract Transform and Load] into a separate database, analyze it, and that takes a lot of time. That decreases of the depth of insight into events while they’re occurring,” he said.
“What we’re able to do is as events come in, machine learning will watch what’s happening; it will watch transactions, and now I’m getting a workload optimized for analytics. I’m going to, on the fly, reorganize how information is presented in memory so I can handle transactions quickly still but also handle analytics asking for information from that data.”
DeepSQL writes data in an appendable manner.
“I don’t have to sit there and say, ‘Where’s an open block?’ And then write, because that takes IOPS—that takes a lot of overhead from an IOP perspective on the system. Most systems out there become IOPS-bound in scalability before they become CPU-bound or memory-bound. So throwing a larger machine at a database doesn’t necessarily get you more scalability because once IOPS runs out, more CPU and memory won’t help,” he said.
He says deepSQL reduces IOPS by 80 percent, then when you add in CPU and memory, there’s a linear scalability that’s predictable.
“Our kernel for the database understands when I add those resources, so if I have a virtualization platform, I can add CPUs and memory on the fly, and it will recognize it and start using it without human intervention. So a DBA doesn’t have to sit there and optimize it. The machine learning handles all that based on the conditions in the ever-changing set of workloads, so the system can read, write and handle queries at a performance level that we’ve seen all the way up to 64X some of our competitors,” he said.
He says most database performance improvements come simply by moving to SSDs while memory is even faster.
For in-memory systems, “my database can only be as big as the memory I have, and it’s not changing the science. It’s a faster medium, but the tradeoff is incredible. If I kick the cord out, I lose my database in memory. What we’re able to do is the best of in-memory systems with ACID compliance on disk. Some in-memory systems write to disk in the back, but we say all the data you need in memory to answer the questions coming in and do the writes and all that, we’ll predict what you need, but we’ll write it to disk as fast as it comes in, so you get the best of both worlds.”
He said the company just did an example with 1.2 trillion rows in a single table with indexing and were returning 1,200 records out of those rows in 0.3 seconds.
Stamper says he’s seeing growing interest in companies being able to handle both analytical and transactional workloads on the same database, as they seek faster answers to questions and try and avoid the ETL (extract, transform and load) step required when the two workloads are handled separately.
However, that might no longer deepSQL’s biggest strength, he says.
“That’s because we believe its self-tuning capabilities, combined with its scalability, make it an attractive proposition for a wide number of industries that are starting to find their databases expensive or labor-intensive to scale up or out,” Stamper says.
The company has a range of customers, including Croatia’s Ruder Boskovic Institute for genomic research and WordPress-managed hosting provider GEMServers.
Its speed and self-tuning capabilities were attractive to Florida State University’s Research Computing Center (RCC), a High-Performance Computing/supercomputing shop without dedicated DBA staff.
RCC interim director and operations manager Paul Van Der Mark said deepSQL is ideal for researchers who work with complex data sets but who aren’t database experts.
“We had a few faculty who needed to run queries on ‘large’ databases, between 20 and 40 GB, and we even had one researcher who had never written SQL queries himself,” he said, adding that it’s genomic and meteorological data sets are much larger,” he said. “We tried the Deep engine and found that it gave very good performance compared with the standard MySQL storage engines without going through the effort of manual tuning variables.”
DeepSQL is integrated with popular cloud automation platforms including BOSH, Cloud Foundry, Chef, vCloud Director, Vagrant, Compose, OpenStack and virtualization platforms including VMware, KVM and others.
Feature Image: “Down” by Tobias Weyand, licensed under CC BY-SA 2.0.