As stores shut down during the global pandemic, many retailers experienced a surge in traffic to their e-commerce sites. On the particularly panic-filled days of March 12 and 13, online sales at full-assortment grocery stores increased by a whopping 325%. Whenever this level of unexpected traffic hits retailers’ websites, it is imperative that they can maintain application uptime and keep customers satisfied. If not, customers will quickly pivot to competitors’ sites.
It can be difficult, if not impossible, to predict events like the current pandemic. However, by training your infrastructure to recognize anomalous activity and making the appropriate adjustments, you can keep your customers satisfied. In our experience, a particularly effective way to maximize online sales and ensure customer satisfaction is by utilizing an effective ML-powered predictive analytics tool. With such a tool, you can identify issues, lower your mean time to repair (MTTR), and ensure your system stays up and running.
Using Machine Learning to Shorten MTTR
As a quick caveat, your ML tool will only be as good as the data that you feed it; generally speaking, the more data you provide your tool, the better it will perform. It is important that your data is annotated to show the target or the question you are looking to answer. Additionally, you want to periodically check the results of your ML algorithm to ensure accuracy. Maintaining high-quality data is essential when it comes to algorithmic decision-making. If your data is not properly labeled and vetted before your model is trained, you will run into problems down the line.
An effective ML-powered tool will learn what your baseline activity looks like, and then it will use this baseline to flag any anomalous activity. In the case of the aforementioned overworked websites, the machine learning tool would have sent the system administrators alerts on March 12; that way, even if the retailers’ site did crash, the MTTR would be shorter.
Anomaly detection is obviously not limited to calamitous events like global pandemics. Machine learning-enabled predictive analytics tools should be used every day of the year; such tools constantly monitor the number of people who access your websites, while accounting for where these customers are located, and what time of day they access the website. As an example, if North American customers usually access a given website from 9 a.m.-5 p.m., and there is a surge in activity at 6 a.m. one day, this behavior would be flagged. Likewise, if website traffic usually takes place on weekdays, a surge in traffic on a Saturday would be flagged.
While training your ML-algorithm, it is important to create a benchmark to account for seasonality. For example, you will want to send every Monday morning’s data to the training data set to be compared against the last four weeks’ Monday morning values; then, you will want to smooth out the data by removing the spikes — generally the top 5% of the values. After you compare the hourly 95th percentile values against your training data, you can generate events and see which ones have the highest chance of being anomalous.
Putting Your Algorithms to Work
After pulling in data from a variety of sources, an AI-powered anomaly detection engine equipped with single-variate and multivariate analysis algorithms will extract aberrations, offer explanations, and score anomalies based on their perceived severity.
Depending on how much an event has deviated from the previous weeks’ activity, the event can be classified as “a confirmed anomaly,” “a likely anomaly,” or “information,” which is essentially a notification that you may need to keep an eye on something in case there is an issue down the road. Anomalies can be scored on a scale of 0-10.
If an event is seen across a given monitored group, the anomaly score will be even higher. For example, if your infrastructure is using five different servers, you will want to group them all together, and if there is a problem on more than one server in this group, then your anomaly score will increase. Of course, the IT team responds to alerts according to the severity of the corresponding anomaly.
Anomaly Detection without Thresholds
With an ML-based predictive tool — perhaps one that uses both single-variate and multivariate algorithms — you can detect anomalies without thresholds. There is no need to establish static thresholds across the board, such as alerts for every time CPU usage rises above 75%. In fact, you will want to remove all static thresholds, so it can account for seasonality and more nuanced activity.
As an overly simple example, your tool will ensure that there are sufficient servers available during Black Friday sales. During these periods of increased volume, the backend infrastructure will be auto-scaled, and the ML-powered analytics tool will auto-align the thresholds, helping to avoid unnecessary false alerts. Importantly, this will cut down on unnecessary tickets that would otherwise take up technicians’ valuable time.
Facilitating Event Correlation
Given the intense competition that all e-commerce companies face, it is inevitable that DevOps teams will eventually update their websites to account for features that their competitors have put in place. Sometimes, DevOps personnel will employ containers and microservices in an effort to speed up their deployments.
While steering your predictive analytics tool, the machine learning component constantly studies the infrastructure; it perpetually assesses which server connects to which microservice, and which container belongs to which application. In complex environments, it can be difficult to ascertain where an issue is occurring. Which application, microservice, container, or server is causing the site to run slowly? Any given application may be comprised of several different microservices, and an outage in one of them could cause a cascading effect.
With continuous, automated monitoring, it’s easy to discover the root cause of the problem, while shortening your MTTR and making your applications more responsive. An effective tool should constantly learn the infrastructure, including its applications, dependencies, and when there is continuous integration and continuous deployment (CI/CD). Such a tool helps IT personnel quickly diagnose and trace the root cause of an issue.
A recent use case: a Florida-based grocery chain’s e-commerce website
Before the coronavirus epidemic took the world by storm, online grocery stores’ sales were already increasing. In 2019, U.S. online grocery sales were up 22% from the prior year. Now, in the aftermath of COVID, U.S. online grocery store sales are expected to increase by another 40%. Such a drastic boom in e-commerce traffic was not anticipated by many grocers, and many websites were unprepared. A Tampa, Florida-based grocery chain was one such grocer.
This grocery chain was able to find the source of its problems through an AI-powered engine.
Their business has a data center with hundreds of servers, and years ago, it would send alerts whenever there was an issue; however, their IT personnel were inundated with alerts, which created alert fatigue and caused important alerts to go ignored and unresolved.
After installing an ML-based tool, the grocer was able to filter and consolidate alerts. They have reduced alert noise, so IT personnel are only alerted when there is actually an issue at hand.
With an anomaly detection apparatus collecting metrics every 15 minutes from data collection agents, the tool captures overall trends, accounts for seasonality, and is immune to insignificant spikes and aberrations. As an example, in October of 2019, the grocery chain received a series of alerts, one of which stated that the response time of their URL monitor had increased; this was flagged as being a “likely anomaly,” as the URL monitor had surged by 1.7 times the baseline average.
In this particular case, there were two issues occurring. Firstly, there was a regional response time degradation issue, which turned out to be an issue with the ISP rather than a DoS attack. Secondly, there were memory spikes, which were due to an increase of traffic to the server.
By distinguishing between abnormal and normal trends, IT personnel were able to trace the anomalies back to the dependent resources that caused them.
The algorithm also provided the grocer with forecasting; in fact, the business knew immediately that the influx in traffic was not likely to subside soon.
The company’s servers’ disk usage normally increased by 1GB/day during periods of heavy traffic; however, now the disk usage increased by 5GB/day. The disk usage shot up to 79%, and it was predicted to reach almost 87% after seven days. Anomaly detection was triggered, an alert was sent, and the IT admin was able to plan accordingly.
Despite the Florida grocer’s complex infrastructure, their algorithms provided alerts, helped with event correlation, and ultimately reduced the IT team’s MTTR.
More recently, during COVID, the folks at the grocery could rest assured knowing that their ML-powered predictive analytics solution would quickly identify whether there has been a payment failure or a crashed website. During such stressful times, with bottlenecked supply chains and panicked customers, their machine learning tool not only helped keep their website up and lower MTTR, but it also helped to maintain a positive customer experience.
It is vital to maintain good customer experiences on your e-commerce sites. Although conventional threshold configuration techniques can warn you about performance hiccups, an ML-powered tool that enables event correlation and issue diagnosis will significantly reduce troubleshooting time. Machine learning in a comprehensive monitoring solution should not only facilitate automated application discovery and dependency mapping, but it should also suggest possible remedies after proactively diagnosing and prioritizing anomalies.
Put simply, in a complex business ecosystem, it can be time-consuming to identify the root cause of an issue. However, with an AI-powered predictive analytics tool, you can identify and fix issues without compromising on IT response time, which is crucial to maintaining a positive customer experience.
Featured image via Pixabay.