AI-Aided Coding
On average, how much time do you think you save per week by an AI-powered coding assistant such as GitHub Copilot or JetBrains AI Assistant?
I don’t use an AI coding assistant.
Up to 1 hour/week
1-3 hours
3-5 hours
5-8 hours
More than 8 hours
I don’t save any time, yet!

How Periscope Uses Kubernetes to Power Data Science Services

Jan 29th, 2019 3:00pm by
Featued image for: How Periscope Uses Kubernetes to Power Data Science Services

Tom O’Neill of Periscope Data on What Data Scientists Do and Why You Care

In this episode of The New Stack Makers podcast, we spoke with Tom O’Neill, co-founder and Chief Technology Officer of Periscope Data, who is responsible for overseeing the technology vision for the company. Periscope Data is a platform for modern data teams.

With the explosion of data, O’Neill is tasked with finding the fastest way to make terabytes of data useful to Periscope’s customers, many of whom are data scientists, data analysts, data engineers, who use SQL, Python or R.

Data scientists are gaining in importance because of the explosion of available data. Airbnb is not a hospitality company. It’s a data company masquerading as a hospitality company. But the data is only as good as the company’s ability to mine it. It’s the data engineers that are making decisions in this environment.

It’s data scientists who are answering questions like “how many tokens to put in the treasure chest at the end of level 4 to optimize lifetime value for each of your demographics,” O’Neill explained.  They are answering questions today much more sophisticated than previously possible.

There are two kinds of data teams, he said.

One is inward facing data teams, the business intelligence (BI) partners. These teams are understanding sales and marketing, determining which parts of the product are the most valuable, and determining what leads to a churn or upsale?

“At its core, Tinder is a ranking problem.”

Then, there are the outward facing data teams are a part of engineering teams. “These are the data scientists at Tinder who help optimize the swipe algorithm to determine ranking order,” he said.  “At its core, Tinder is a ranking problem.”

The Tinder algorithm is trying to maximize connections. Data scientists look at what has historically worked, and what has not worked and from that, determine precision and recall curves. They answer questions like “How many people should we show this user? What kind of people should we show this user? What will most likely lead to a successful outcome? Then adjust the algorithms to support that.” O’Neill said, “It could be based on machine learning, it could be based on rules, or on statics or on a combination of all three.”

How Periscope Works

The average data analyst spends 20 hours a week on Periscope’s platform. The scale is impressive.  Every 24 hours, analysts on their platform write 300,000 new lines of SQL, in addition to processing the millions of queries already on the platform. The company runs 30TB of working memory (RAM) across its fleet, just for performing the analysis. It manages twenty-something million queries per day.

So how does Periscope manage this scale? First, said O’Neill,  you must put all of the data in a single source of truth, then perform analysis on that. So the company uses ETL (extract, transform and load) tools to copy client data into the Periscope platform where complicated connections can be made that go across different dimensions. It also connects with data from other third-party data sources including, Marketo and Google Analytics into Redshift and Snowflake.

On top of that, they built the Periscope Data platform which helps data scientists model their data & perform the analysis using SQL, Python & R.

This is all orchestrated with Kubernetes.

O’Neill said the company uses Kubernetes in a number of ways. One is to manage servers to maintain a persistent network state. It uses Kubernetes to shard many different deployments of the same service. The company’s BI product connects to 5-6,000 active databases at any given time, each with a dozen or more connections that need to always be up. “It was a fun challenge getting redundancy and scalability without duplication of those connection pools,” said O’Neill.

In this Edition:

2:10:  Tools.
5:00:  The stack.
11:01: Using R and Python in Periscope Data and its impact on security.
16:24: Persistence in memory network state.
17:15:  What to think about when they’re creating a CI/CD pipeline, what kinds of things should they be looking for.
18:55: What data teams do and what services they provide to people that aren’t engineers.

Raygun sponsored this podcast, which was produced independently by The New Stack.

Feature image via Pixabay.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.