How MemSQL Drives Real World Real-Time Analytics
The MemSQL database system is all about “capturing data quickly and allowing people to analyze it immediately so they can make decisions as fast as their business needs them to be,” as the company’s chief marketing officer Gary Orenstein put it.
MemSQL customers showcased how they’re harnessing these real-time data capabilities of the database at the recent Strata+Hadoop conference in San Jose, Calif.
Among those putting MemSQL to work:
Mobile advertising monetization platform TapJoy talked about how it runs queries more than 200,000 times a minute to determine how it should bid on advertising to drive a set number of application downloads for which developers get paid.
Uber explained how using MemSQL’s geospatial functions, the company can architect dashboards for the company down to individual city-specific views. In its internal Apollo project, it implemented a custom query language for real-time geospatial analytics, allowing local managers to run ad-hoc queries on their own locales.
However, the project didn’t start with a mandate from above, Macy’s principal engineer Chandan Joarder, and Raj Sriram, Macy’s digital data engineering manager, explained. They had no budget for the project, so they asked for volunteers and tried to make it fun — and it has had some unexpected results.
For instance, the dashboard keeps a top-10 list of best-selling items, including the perennial favorite, Michael Kors handbags. When someone noticed that wasn’t among the top 10 on the dashboard, a look at the website indicated that item was unavailable. When the e-commerce team was consulted about a potential problem with the site, the response was, “How did you know?”
The system consumes data over an API, it goes into Kafka, then into the database. MemSQL’s ingestion engine means it can consume data without any code from Kafka, Joarder said. Data comes in 10- to-15 minute sequences.
The company previously did batch processing overnight that managers used the next day, Joarder explained, but the dashboard provides more-frequent visibility into problems that can greatly affect the bottom lines of struggling retailers.
The company was consuming about 50 million records per day, but that will grow to about 100 million records per day this year, Joarder said.
Also at the conference, MemSQL CEO and co-founder Eric Frenkiel highlighted the company’s work with Thorn, a company using facial recognition and image mapping in the fight against sexual exploitation of children. The organization in 2016 identified and returned 2,000 missing children to their parents, Orenstein said.
Similar to the work that Giant Oak does to fight human trafficking, Thorn analyzes 100,000 escort-service ads for posted on the web each day. Thorn’s big challenge was to match the fingerprint of one photo against thousands of others at the same time it had a fire hose of new photos coming into its database, Eric Boutin, engineering team lead for MemSQL in Seattle, explained in a blog post. He added database operators to do linear algebra operations on vectors to focus on scoring in machine learning to enable image-matching in real time and to reduce latency.
By adding a vector_dot product to the database, image matching has been reduced from 20 minutes to 200 milliseconds, Frenkiel said.
MemSQL, launched in 2011, embraced SQL at a time when the rest of the world seemed enamored with NoSQL alternatives such as MongoDB, Cassandra and HBase in the pursuit of scale.
Evan Weaver, CEO at FaunaDB another new SQL-based startup, says companies have been “badly burned” and are as frustrated with their databases as he and the rest of the team at Twitter were a decade ago.
“…For the most part [NoSQL users have] fled back to legacy relational databases because it’s the devil they know. But it only scales so far and has all kinds of isolation and availability problems. It’s painful to operate, but at least they know where the bodies are buried. Or they’ve fled to cloud where it’s somebody else’s problem,” he said recently.
Google’s Cloud Spanner is another newly announced SQL-based database.
MemSQL bills itself as a real-time data warehouse. Its three parts — an ingestion engine, memory-optimized tables for live data, disk tables for historical data — mean customers can build data pipelines without integration work to use both incoming as well as historical data for transactions and analytics.
Real-time can mean different things to different users, Frenkiel pointed out. Some customers want updates hourly or to minutes, while others seek continuous live updating.
He pointed to one social network customer that’s ingesting one gigabyte of data per second into MemSQL for real-time processing and a content delivery network customer is processing nearly 10 million updates per second.
Speaking at Spark Summit East in February, product manager Steven Camiña outlined the requisites for building a data pipeline: a message queue such as Kafka for data coming in different formats; a transformation tier such as Spark; a data persistence layer; and real-time visualization.
Its launch last fall of version 5.5 added MemSQL Pipelines, which introduces the SQL “create pipeline” syntax to build streaming pipelines from the command line to immediately use data being streamed into message brokers such as Kafka.