Cloud Native / Data / Machine Learning

How Rubikloud Uses Spark to Bring Data-Driven Analysis to Retail

3 May 2017 3:00am, by

There are plenty of marketing advances from retail organizations but their IT departments have not always matched those innovations. With growing demand for better analysis of customer behavior, retailers are finding that traditional tools aren’t always up to the task.

Toronto-based Rubikloud has been looking at ways to improve data analysis, using a system based on Apache Spark. The company’s RubiOne hosted platform allows retailers to sidestep existing tools such as SQL and Excel. We spoke to Rubikloud’s vice president of technology, Adrian Petrescu to find out what makes Rubikloud different.

What are the limitations in IT that have held back marketing innovations? Are these limitations due to the technology or the nature of the IT departments?

Retail organizations are worried about inventories. They’re running systems that aren’t able to handle the complexities if there’s a need to make any sort of change. We even have situations where there are websites handling hundreds of dollars a minute, but with firms being unable to change a link color by themselves

What this has meant is that when business analysts try to run a model, they’re not able to do this easily. They’re trying to pull together data from complex systems where, point-of-sale data is in one place, CRM data in another place and couldn’t get them working together

A typical example is when an analyst will grab a data dump at the beginning of the year and find they’re not able to run transactions on a laptop, as only one percent of the data is available to them.

Describe your platforms: How do they differ and how are they are used?

The problem that all retailers are faced with is trying to forecast on data from multiple sources, not just from sales data but other personal activity such as loyalty cards. We look at the customer lifecycle. RubiCore is aimed at the developers while RubiOne aimed at those guys who are running products, who are programming in R or Excel.

Why use Spark as the building block? What other tools did you look at?

We’ve done a lot of window shopping to decide what to us. Spark had the best compromise between power, flexibility and ease of use. Hadoop was a massive pain — don’t know anyone who enjoys programming in that unless they work in Hive — and that’s a limited number of people. Apache Storm is another option but that’s real time and no-one’s acting in real time.

We’ve integrated Spark in a particular way, making it easier to get that data. We’ve created an ephemeral interactive environment. What this means is that there’s no need to access the data directly — which would be occupying computer memory. So you don’t start a Spark session from your notebook but call up an API.

You’ve mentioned the limitations of SQL and Excel, why are these tools inadequate?

SQL is less limiting than Excel. But the trouble with SQL is that you’re querying databases that, as previously mentioned, are not all in one place. You can’t join a table across three different places but you can join multiple tools in RubiOne.

One of the problems is the sheer size: if you have three to five years of transactional data that’s a terabyte of data, there’s not the disk space on a laptop for that. RubiOne runs in the cloud — Amazon Web Services or Azure for personal preference but we can do private cloud if required.

Can you describe how RubiOne actually helps retailers to garner more accurate information: what exactly are the processes?

It’s important to note that we can’t just have a plug-in, our data architecture team has to talk to different stakeholders to understand our customer. It’s not uncommon for our data architects to know more about the data model than any individual working for a retailer — they’re all experts on their own silos but don’t see the bigger picture.

“There are all sorts of ambitious “general-purpose machine-learning” projects out there, but those have proven to only be useful as developer tools, not as end-user products.”

What’s made things easier is there’s an acceptance of cloud: we’re in the period where cloud has stopped being a scary technology, all of their concerns have melted away and they realize that it’s safer being managed by companies who understand security.

Can you go into more detail about that API and how RubiOne has to be tailored for a vertical?

As I mentioned, one of our core theses as a company is that data science/machine learning products can only really exist if they’re vertically integrated. There are all sorts of ambitious “general-purpose machine-learning” projects out there (some of the very powerful players, like Google), but those have proven to only be useful as developer tools, not as end-user products. The overhead of data extraction, feature engineering, etc., is simply too great for a general business user.

On the other end of that spectrum, you have finished tools that simply use machine learning “under the covers” to accomplish some other goal (and Rubikloud certainly does play in that space as well). But RubiOne is somewhere in the middle — it’s exposing machine learning libraries directly, but at a sufficiently high layer of abstraction that you can interact with things like “stores” and “customers” rather than just “feature vectors” and “data frames.”

You spoke about the three stages of RubiOne: Explore, Develop and Automate, can you go into more detail?

There are three stages: Explore is the fun one, providing retailers with ad hoc exploratory analysis.

Develop is the set of APIs that lets you execute “production-ready” machine-learning models as part of a pipeline that includes all of the infrastructures we use to run our own models — things like feature extraction, data sanitation, the outputs of other models, etc). By coding to that API, your model gets automatically get linked into that batch job, has its inputs/outputs wired up for you, etc.

And finally, the automate piece has to do with executing the pipeline from develop in a production-grade deployment environment, which is really the “killer app” for a lot of these traditional retail businesses.

Automate lets them scale workloads that used to have to run on a one percent sample of the data (so that it would fit on the analyst’s laptop) into a Spark cluster connecting to a highly-replicated data warehouse, on a batch schedule.

These three pieces form a cycle: you explore to find a shortcoming in the existing model, you develop that model to plug into the current flow, you then automate that model to put it into production; then once the results of that deployment start rolling in, you go back to explore to measure its effect and find the next optimization.

You mentioned some neat pieces of engineering — can you go into a bit more detail about some of the developments?

There’s too many to list exhaustively, but I’ll pick three examples that are kind of representative.

One of the key benefits of using RDM (which is our name for the standardized Retail Data Model, a database schema tailored for retail industry best-practices) is that we can understand the actual semantics of each client’s data automatically. This led to the development of a RubiOne ORM, which lets us wrap native Python objects around RDM tables. Then we can define a library of functions and visualizations around those objects, covering the majority of day-to-day use, without making the user think in terms of rows, columns, joins, etc. You can just write code of the form:

   flagship_store = Store('123')
   for customer in flagship_store.top_customers:
        if any([p in customer.transaction_history['product_id'] for p in promo_products]):
            plot(customer.category_breakdown)

In native SQL, or even a tool like Tableau, this would be a massive, query across multiple tables. In RubiOne, it of course still is, under the covers — but the user was able to express that complexity with familiar programmatic constructs.

We decided very early on to use Spark for the production batch jobs we’d be running for ourselves, but exposing Spark in the same way for the explore piece, on infrastructure that is potentially outside of our direct control in the case of on-premise deployments, was an engineering challenge in itself.

The first step was getting a battle-hardened Spark installation running inside a Docker Swarm, in such a way that we could dynamically resize the YARN cluster and have service discovery etc. happen automatically without needing us or our client to intervene. But even once we had that, it turned out that marrying the interactivity of explore with the batch-job-centricity of Spark was a challenge in itself.

Spark wasn’t really designed for a multi-user environment with connections coming and going over time and expecting persistent access to old RDDs and DataFrames. We had to implement a souped-up version of the Livy API that can hold a persistent client connection and federate access to it from multiple notebooks owned by multiple users. It was a lot of work, but it enabled a whole new interaction model that was previously unattainable for most business analysts.

Our machine learning models can get really big. After sharding, a single batch job could easily equate to hundreds of individual tasks, in a complex web of dependencies. It looks something like this:

To orchestrate this massive workflow, we adapted a little-known open-source tool put out by Spotify called Luigi. Luigi is a vital part of what makes automate tick; it lets us (and by extension, RubiOne users) express the dependency graph declaratively rather than imperatively, which drastically reduces the complexity of writing any individual task. RubiOne users can just plug their model into one layer of this graph and let the rest of the behemoth be scheduled automatically.

You’re a big user of Jupyter notebooks: Is it fundamental to your products or was there another option?

Jupyter is absolutely central to explore. There were other options we investigated, most notably Apache Zeppelin, but in the end, Jupyter was a very clear winner for our requirements.

The whole Jupyter ecosystem has actually been incredibly useful; in particular, Jupyterhub allowing us to create multitenant on-demand servers in a Docker Swarm is a killer feature that none of the competitors had. The integration with multiple kernels (in particular, R, which is a very common request in the business analysis community) was also more mature than others’. It was also the easiest to extend with our own plugins, branding, and authenticators.

One of your key differentiators is in credential management, can you go into more detail as to how that’s handled within RubiOne?

Credential management happens at the JupyterHub layer; when a user logs into the Explore dashboard, we verify them against an LDAP directory which assigns them roles and privileges. Each resource in RubiOne (whether it be hardware resources like Spark clusters or data resources like tables and rows in RDM) is then scoped in that LDAP directory, and the JupyterHub authenticator dynamically generates low-privilege time-limited authentication tokens on the fly.

The end result is that the user doesn’t have to worry about managing their own credentials and secrets (something that people are notoriously bad at), and the site administrator can still do granular grants and revocations of access to resources instantaneously. This is especially useful when RubiOne is being used at multiple levels of the organization — a category manager might only see the rows of the transaction table that pertain to her market segment in her country, while her CMO would see the entire country, all from the same notebook.

And what of the future: what plans do you have to enhance RubiOne?

It opens up whole new areas for retailers. Right now, its use is confined to the online world. You can track the purchase as long as it’s through the website. But we’re looking at joining the information up so you can tag any browsing offline to an online purchase. Offline buying is not something we should be ignoring it, it’s still 85 percent of total retail revenue. But most retailers aren’t doing enough to join offline and online, in the future, that’s going to get a lot deeper.

Feature image: Rubikloud’s vice president of technology, Adrian Petrescu (Rubikloud.)


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.