Development / Monitoring / Sponsored / Contributed

COVID-19 and Digital Services: An Action Plan for the Unexpected

22 Apr 2020 9:49am, by

Dynatrace sponsored this post.

Alois Reitbauer
Alois is the Vice President and Chief Technology Strategist at Dynatrace. With years of expertise in application performance management, technology alliances, SaaS and product management, Alois is a passionate evangelist for Dynatrace technology, services and solutions. He specializes in enabling early-stage product innovation, business validation of new technology solutions and strategizing with application architects, engineers and customers.

One of the impacts of the COVID-19 pandemic is a move towards digital services at an unprecedented scale. Some businesses are attempting to replace lost revenue streams through a shift to online activity. Other organizations are scrambling to support significant growth in online users. All of this puts a lot of pressure on IT systems and applications.

While most government agencies and commercial enterprises have digital services in place, the current volume of usage — including traffic to critical employment, health and retail/eCommerce services — has reached levels that many organizations have never seen before or tested against.

Organizations need to prepare for both expected and unexpected demand, not only for the services that their customers and users rely on today, but for the services being developed for tomorrow.

There are proven strategies for handling this. In this article, I will share some of the best practices to help you understand and survive the current situation — as well as future proof your applications and infrastructure for similar situations that might occur in the months and years to come.

Step 1: Understand Traffic Patterns and Potential Spikes; Remove Team Silos

The impact of traffic spikes is illustrated by the load that eCommerce web sites typically see during Black Friday. A massive rush of users over a very short time period makes systems begin to slow, and then potentially return errors. Retail sites are usually well-prepared for these spikes, as they know when they are coming. Other sites, including eLearning services, also experience seasonal or time-of-day-based usage patterns that they can prepare for (see figure 1).

However, the situation many websites recently experienced at the onset of COVID-19 was unprecedented — large, sudden traffic bursts with no clear pattern or knowledge of when the next burst would happen. For example, traffic spikes in government employment portals sometimes resulted from COVID-related news announcements (figure 2).

Figure 1: Spikes in total load time for a web page on an education site correlating to user activities increasing at regular intervals.

Figure 2: Impact of a massive spike in traffic on response times in a US employment portal.

The best way for organizations to get ahead of the curve, so they aren’t caught off-guard by sudden and unexpected spikes in online activity, is to rethink the structure of their IT teams. Learning from the past and incorporating data about future events helps to ensure your team is not hit by a surprise like this again. Achieving this requires business and technical teams in an organization to be in lockstep; communicating, aligning and preparing for what might happen. We refer to this as a BizDevOps strategy.

To support BizDevOps, organizations must create tighter collaboration among teams. They must establish an integrated communications approach centered on end-user data. As needed, teams should mandate daily meetings or standups to review what happened the day before, plan for the current day and look ahead to the days that follow. With everyone looking at the same data, it’s easier to work towards a common goal. And with everyone on the same page, organizations are ready to act quickly when surprises happen — like, say, a sudden rush of online activity prompted by a pandemic.

Step 2: Understand What to Get Ready for

Just seeing or predicting a spike in traffic will not solve the problem. Your systems will still be overloaded, and problems will continue to impact your users. The next step is to understand when your system is going to break.

As the dashboard example in figure 3 illustrates, historical usage patterns will not serve this purpose. This dashboard reflects traffic to the Austrian Economic Chamber website in late March 2020, starting when people began to submit requests for government financial support related to COVID-19. The site managers knew this event was coming, but they were not prepared for the massive amount of traffic that followed.

Figure 3: Increase in total load time for the Austrian Economic Chamber and worker association websites due to a surge in traffic.

So how do you know what to prepare for?

As no situation from the digital era compares to the current pandemic, this might be hard to assess. But often the best strategy is to conduct a stress test. To do this, simply hit your infrastructure with an increasing amount of traffic until you start to see a negative impact on response times or other errors.

Ideally, you’ll have a dedicated environment for this. If that isn’t the case, then test against your live system when the impact on users is minimal.

Once you know the limits of what your system can handle, you’re ready for the next step.

Step 3: Understand Why Your Systems Break

Simply knowing when your systems break won’t help you deliver better service to your users. Depending on the type of system you are running, the fix may not be obvious. It might be as simple as adding new servers, or as complicated as changing specific application behavior, like switching from dynamic to static content or disabling certain functionalities under high load.

Once you start to hit the breaking point identified by your stress test (step 2), utilize automated root-cause analysis (see figure 4) to identify which components broke and exactly why they broke. This will enable your teams to find ways to fix the problems.

Figure 4: Automated analysis of the root-cause of a problem using AI-based analysis.

Step 4: Validate Fixes in Real-Time

Once you’ve understood the root cause of problems, you’ll need your teams to work on fixes that can also be deployed rapidly into your live environment. As such, it’s imperative to monitor the impact of these fixes on the overall health of your system in real-time. Sometimes, fixing things in one place can result in problems arising somewhere else, which might mean you need to either roll back the fix or repair whatever new problem it has caused.

It’s important to focus on fixing one problem at a time, though, and validating the impact of these fixes. Teams should fix small, atomic problems and — as needed — combine these together rather than attempting to fix complex processes involving many interdependencies.

Step 5: Automate the Fix and Make It Repeatable

After completing step 4, you will have a list of small fixes that can be applied separately. You might be tempted to write down how to execute them. Documentation is good. A repeatable implementation is, however, much better.

Making fixes available in scripts has benefits, especially if you need to react quickly to unexpected situations. First, anybody can run them instantly without having to learn any specifics. Running an automated script is always much faster than performing these steps by hand. It is also much less error-prone.

Step 6: Automate the Workflow

Up to now, all steps have required someone to actively drive the process. While this might work fine during business hours, it does not prepare you for unexpected events outside of usual business hours.

Luckily, at this point, you have all the ingredients you need to automate the process and move towards what we call NoOps. Essentially, this means that no manual operations tasks are needed to perform well-defined operational steps.

In this step, we are linking root-cause analysis to the proper remediation action (see figure 5). Having this in place allows actions to be triggered automatically as needed.

Figure 5: Automatically call remediation actions to resolve problems based on detected problems.

You might wonder why you should not simply always have your systems at maximum capacity, or with all remediation features and steps enabled.

First, running infrastructure costs money. As shown above (figure 5), many sites are subject to very short spikes of capacity shortages — usually between half an hour and one hour. Running at full capacity all the time would result in unreasonably high costs.

Second, mitigation actions do not come free. Some mitigation actions will have an impact on your business. Let’s look at news/media sites, for example (figure 6). Third-party services, such as advertisements that are served up to the news site, could be one reason why the system slows down during times of peak traffic. A mitigation action could be to remove these components temporarily from the site.

Obviously, this would result in a loss in advertising revenue. It might also be that removing the third-party advertising components doesn’t have an impact in terms of improving user experience. Even if it does have an impact, it might be that the advertising revenue could help to pay for upscaling infrastructure, which might have a greater impact on user experience.

Figure 6: Spikes in total load time caused by third-party components on media websites.

Help to Get Started

We understand that organizations out there, who traditionally did not have to deal with surges in traffic impacting user experience, might not be prepared to implement all these measures. At Dynatrace we are helping by providing a few free services to help your organization respond to COVID-19. A great way to begin is with our free trial.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.