Data

With a Full-Stack Columnar Data Store, Interana Can Answer Your Questions Quickly

6 Mar 2017 9:34am, by

It all starts with curiosity, said Bobby Johnson, Chief Technology Officer and co-founder of Interana, a company providing Interactive Analytics to clients with big data. Curiosity and a belief that companies run better when more people have access to data.

“Our mission is to make the data accessible to everyone,” Johnson said in an interview in San Francisco last week. “You make a thousand little decisions every day and those decisions will be better if you’re more informed. Johnson, who helped run backend operations of Facebook for six years, wrote Scribe and was on the team that built Hive and Cassandra, left to start Interana with his wife Ann Johnson in 2012.

When he started at Facebook in 2006, it was not a data-driven company, he remembered. He argued with the other thirty developers about which features were most important. He built a log data aggregation tool at Facebook called Scribe so they could see which features were being used so they could make decisions based on actual data.

“It changed the culture,” he said. “People went from arguing points of fact to arguing about more interesting things that mattered.”

He took that experience with him when they formed Interana. The idea behind the company is to give everybody access to the all of the data (security permitted, of course), through what would turn out to be a full-stack columnar data store-based analytics and visualization software. Having everybody be able to look at the data and run queries makes everything more efficient.

“I never bought into this idea that there are priests of data,” he said. “A select few who wield the power and the rest are just dumb and have to have it packaged up and handed to them on a plate.”

“I spent his Facebook years at Splunk where I was dealing with exhaustive log data.” said Christina Noren, Interana chief product officer. That was back when developers didn’t care how they logged.

Scribe, she said, is the first instance of developers logging with intent. They would say “I want to write this message knowing that machines will be consuming this down the line. Which was revolutionary.”

With Interana, Johnson and his co-founder Amy Johnson looked at several options before deciding to build their own system from scratch.  Why not just use Hadoop?

That product is not optimized for any particular set of questions, said Noren, and assumes that you want to ask big questions that can run as batch jobs and can run over a period of time.

Asking a Different Question

“We ask a different question, based both on time and that fact that there are actors that are behaving,” she said. “We’re looking to answer questions; the core questions revolve around answering ‘what is happening?’”

Interana’s Visual Explorer Page

The users have a much more sophisticated knowledge of how the data needs to be put together to fit their needs, explained Johnson. For example, the most powerful thing is that the people who are asking questions don’t know the right question to ask until they start asking questions and getting responses.

This is familiar to all developers, who design systems to answer questions the business people say they want, only to have them ask for more data, or different angles when they see the results.

“It’s about quickly asking lots of imprecise questions to get better and better questions,” he said, “and that’s what Interana provides.”

For example, music systems provider and Interana customer Sonos collects data from its speakers and its iPhone app to give the company insight into what the customer experience is like from the time the speakers are plugged in until it plays its first song. The company captures data at every event, allowing analysts to ask questions: Were there glitches? If so, where did it happen? Was it in the app, or in the hardware or confusing instructions?

“Anytime someone clicks a mouse or touches the screen, or touch a database or gets a bounce back, you want to capture that data,” said Johnson.

Across the board, he said, Interana customers ask: What are my customers doing and when are they doing it?

We consume the data in a data store pre-optimized for a new class of questions, explained Noren. “The questions that people want to ask imprecisely and imperfectly and iteratively and fast are questions about behavior.”

Questions like: How many devices that had more than five minutes between installation and first song play were part of households that churned within a month, versus those that were retained? This is the data that drives business decisions.

Starting from Scratch

Having worked on answering similar questions at Facebook, Johnson realized the existing tools never really worked for analytics. “We spent years trying to get them to work,” he said. They were good for doing batches, extract-transform-and-load (ETLs) [jobs], building indexes but not for analytics. “Hadoop is cumbersome, and SQL was not set up to answer the questions that people care about.”

“Business users still don’t want to write SQL code 25 years later,” Noren said. “Go figure.”

So Interana built its own scale-out columnar database, with an eye to making it easy for everyone to ask questions that drive business.

“We’ve written a system that’s the best place for logs to end up so developers don’t have to be data butlers or analytics butlers for everyone else,” said Johnson. “We’re setting it up for all the other humans in your company.”

The Geeky Details

Johnson laid out the stack: The system runs on commodity Linux boxes. The code that actually reads and writes to disc and crunches through the data billions of rows at a time, is C++. The goal, he explained, is to keep that code really simple and straight-forward for the number crunching.

Next is a layer of Python for management of distributed systems, the management of queries, and a lot of the logic about the sampling. The database system employs a REST interface, JavaScript and React for the front end.

The Interana columnar database system has been benchmarked at 100 million rows per core per second, with the speed coming from executing queries directly on raw data, and running queries lock-free. Columns not referenced in a query get passed over, saving wasted cache, CPU cycles, and RAM.

“We have customers with 1000’s of columns of data. Wide, flat, sparse,” said Noren.

The system has no constraints, assuming you want to ask questions about actors and time. And not bothering with traditional business intelligence notions like dimensions and measures that drive most dashboards saves time as well.

But the company is not in the market for offering in-memory databases, as has been reported elsewhere, she clarified. Instead, the technology stores the data raw and optimizes for scan.

Interana is data at rest, not live streamed. Typically clients drop log events into S3 buckets. Interana monitors the buckets and pulls data in, making it available to query. Data is picked up as soon as a log is written, but is not real time.

“A minute is usually ‘real time enough’ for humans making decisions,” Johnson said.

The company is also very focused on the action of the interface and making that useful (but not necessarily beautiful). Dashboards are launching pads for asking other questions, Noren said. Every day new stuff happens, every day, new questions can be answered, all without waiting for developers to code a specific data retrieval.

Up Next: Community

Building community is one of the top projects this year, for the company. It’s been on the table, but its growing customer base has kept the engineers busy. It plans to have free versions available for developers and product managers to see how it works and play around in.

“But we’re victims of our own success,” Noren said. One of its first customers was the Bing search service, which runs billions of events an hour, and required a lot of attention to fine-tuning the product.

Although the company has fewer than fifty clients, each one is a data-intensive culture. The goal, Noren said, was to start at the top clients and work its way down to become a de facto industry standard.

So it’s been scrambling to serve these large customers that are making the internal culture change to view data as immediately accessible. The smaller relatively companies, like Reddit and Imgur, are amongst its most active users. Over 12 percent of Reddit employees log into Interana several times a day. For example, during Pride Week last month, the LGBT subreddits were oversubscribed, so Reddit’s ad sales people were running queries to find other subreddits that had an affinity with the LGBT community to suggest things advertisers wouldn’t necessarily expect, like restaurants or bars.

“This is about individuals,” she said, leaning in. Although their paychecks come from the companies, there are thousands of people who are devoting a portion of their workday to using Interana effectively for their jobs. That community is important for those users to network and look for other career opportunities.

“I feel a greater loyalty to those users than I do to the companies who write the checks,” she said.

A newsletter digest of the week’s most important stories & analyses.