Data / Machine Learning / Serverless / Storage

SimilarWeb’s Cloud Infrastructure for Large-Scale Data Analysis

11 Nov 2016 8:57am, by

We often hear it said that every company is becoming a software company. That every company is now a tech company. But with microservices and API-based architectures exposing data assets, and with the increasing use of machine learning across all industry sectors, it is just as true to say that every company is now a data company.

Around the globe, businesses are identifying unique data sets that they have been collecting and building and are now opening up those data sets as new product lines.

In California, data management company VIMOC Technologies is working with parking garage companies that want to make real-time parking space data available as a data set for inputting into GPS systems. While this can help optimize parking availability assets and increase revenue, the parking garages themselves are seeing opportunities to make that data available for other uses: to better understand traffic flow in a city and business center, for example.

In the U.K., media company offers a data-mining license so that partners can integrate news and financial content as data sources into their products and algorithms. Some even posited that a popular news report on French President Hollande’s stance over Brexit had been ingested by a trading algorithm which then caused traders to rapidly sell pound sterlings, causing Great Britain’s currency to drop. While the timing doesn’t quite add up in this case, it does point to how media content is already being used as the raw data for machine learning algorithms.

In Spain, the bank BBVA has anonymized and aggregated monthly transaction data of its customer’s credit card purchasing behavior. This data can now be packaged up as a commercial product and is already being used by market feasibility and urban developers to identify the demographics of consumers in specific neighborhoods so they can assess the likelihood of new business success in that zone.

Companies are leveraging their composable, microservices architecture to operate like data companies: enabling data assets to be collected, stored, normalized, and packaged up in such a way as to be made available to both external customers (with adequate security controls), and fed back into internal machine learning workflows. And as they are building those data sets in cloud environments, real decisions need to be made about which vendor can provide not just data analysis capabilities, but all of the distributed application requirements like authentication and authorization and data recovery management services at the same time.

So what can any business learn from companies whose core business focus is data?

Ingesting Data, Applying Algorithms that Learn, and Data as a Product

SimilarWeb is a data company that has seen massive growth in its global footprint over the past few years. Initially a startup with ambitions to become a competitor to Quantcast and Compete, SimilarWeb now sees much greater web traffic than either of those providers and offers a range of products aimed at helping businesses analyze and act on digital insights gleaned from website visits and engagement, digital market share, mobile app rankings, advertising spends, and more.

To provide its services, SimilarWeb collects data from three main sources: from applications it gives away for free to hundreds of millions of users; from its own data collection in which hundreds of thousands of businesses share its internal analytics with SimilarWeb; and from its relationships with third party providers (such as ISPs) to ingest data from them.

“Data arrives in real time: we receive 60,000 requests per second, and we then use that as the basis of our learning set and analysis,” said Einat Or, executive vice president of research and development at SimilarWeb. It’s main production pipeline is batch-based, as Or explained, “for the use cases we see, this data is for analysis, it is actionable from a strategic position. Our data helps companies understand how to invest, rather than informing real-time decisions on what to do with the next dollar.”

During its startup days, SimilarWeb used private hosting for its architecture, but recently, the data company has reached a tipping point where it would need to start working with two hosting facilities, which is when cloud management becomes a cost-effective option for them.

Like companies across a range of industry sectors, SimilarWeb is now in the process of moving to a cloud-based infrastructure. To do that, the had to make decisions about which vendor can support its level of data ingestion, provide security of access to customers and partners within permission policy boundaries, be able to link to its machine learning processes and push analyzed data out to customers.

The Google vs. AWS Bakeoff

When working with private hosting, SimilarWeb’s data pipeline saw all data coming from three source channels being directed into their own gateway application, called TAPI. From there, data went into Kafka, would be parsed and go into HDFS and a second copy sent to Amazon S3 for backup storage.

While Or credits Kafka as an excellent technology, the search for a cloud hosting provider meant more options were on the table. “Once it is in HDFS, then we run hundreds of thousands of jobs on the data using Hadoop or Spark depending on the type of algorithm we are running. This is our core strength: we take all that data we collect, then we use a lot of learning techniques so that we can deduce an accurate model of the digital world.”

Moving to a cloud hosting provider meant SimilarWeb could assess the service management components as well as review the tools available from each vendor for data analysis, and support for their machine learning data science techniques. “We did a bakeoff between Google and AWS. On the data analysis side, Google is superior, but the entire maturity of their cloud offering is actually very low,” said Or.

Or says that overall, if a company is looking at operating from the cloud, it is difficult to separate requirements for web-based application architecture from things like data analytics tooling. “We already have a pretty complex architecture and we wanted to make sure data recovery is provided in each region. Using another provider for some of the technologies would have been a much higher risk for migration. Maybe later on, if we see a reason we can use another provider for a particular aspect of technology. But cloud providers put very high prices on egress, so if we need to pipeline all our data to a different vendor and back, then it is costly. They make it not worthy from a cost perspective, especially if you are highly data-based operation like ours.”

Kinesis vs Pub/Sub

Or points to one example of the many technologies within each vendor that her team evaluated. AWS Kinesis is a data streaming platform within Amazon’s cloud that lets users ingest and analyze large streams of data. Google Cloud Platform’s Pub/Sub is an asynchronous messaging service for ingesting data (while analyzing data is done with the separate Google Cloud Dataflow).

“Pub/Sub is not as mature as Kinesis,” said Or. “Actually, it has real implementation problems. The challenges become very immediate.” Or says this isn’t always the case. SimilarWeb uses Hbase which can be difficult and is impressed with Google’s alternative, Dataproc. She also singled out the capabilities of Google’s Bigtable as “very strong”.

Or concurred, “AWS does have a gap to fill.” But for a production-level data service such as theirs that already has a user base that relies on their product, there were too many issues with managing their business in Google Cloud Platform. “Knowing you can manage it properly, and that users can only access the data that they should, all of those requirements are weak in Google. Even with all of the advantages of Google, it is still a risk. If I was part of a startup with 10 people, I might do that, but with thousands of customers and millions of free users it is too much of a risk.”

Other Cloud Vendor Considerations

Another big deciding factor for Or was the level of service and transparency that AWS offered over Google. Like Bustle, which had chosen to work with AWS on early iterations of both Lambda and Kinesis because it was invited to provide feedback on product usage and see features and bugs resolved often overnight, Or was impressed with the AWS approach.

“The attention to service with AWS was much higher. When you look at the overall company, then AWS was better than that AWS are running a very good developer program, they are very generous about the way they do their work, the entire customer service is superior and you have the feeling you are in good hands. They had a very clear view of what is coming up and they gave us access to some new tools, so we are part of their beta. This is their strength, I think,” said Or.

Managing Quantum Leaps in Accuracy for a Data Company

SimilarWeb’s use of machine learning algorithms alongside a learning set which analyzes and improves those algorithms on a regular basis means that at some stage, a statistically significant improvement in the algorithm model will occur at some stage. In SimilarWeb’s short history, that has already happened. In March this year, the data provider started delineating between website estimates of previous algorithm and those being calculated by their more accurate modeling.

“We constantly improve. But our customers are relying on the consistency of our data because they are looking at trends. So as long as the improvements are under a certain level of error, we implement the improvements and you get slightly improved data. When there is a quantum leap, there is the question: if customers rely on data in the past, how do we improve accuracy and provide consistent data?”

Or said their team spoke with product managers and interviewed customers. It discovered customers wanted all the historical data adjusted to meet the new accuracy model. It took several months to recalculate all of the data, mostly to test the consistency of the new accuracy models in batches. Historical data was created when the company was much smaller and less sophisticated, so at each batch analysis there were multiple improvements to be made. Or said another key benefit of moving to cloud infrastructure is that it would dramatically reduce the time to change such large historical data sets. After all, in a cloud environment, you just add more hardware to the problem, rather than need more time.

Decisions All (Data) Companies are Making

SimilarWeb’s current migration from hosting provider to cloud infrastructure demonstrates an increasing reliance on providers that can incorporate user authentication and authorization, large-scale data ingestion, algorithms that can constantly learn, compute power for analytics, storage, and data recovery. These are the deciding factors facing any business that is beginning to assess how to open up and capitalize and expose their data assets as products in new partnerships, using new business models.

Feature image: By Gabriel Santiago. Licensed under CC0 1.0

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.