When you are launching a new application in production, or want to test that a current application works well in production, it is always good to test it in a pre-production environment first. However, performing some specific tests in the production environment can prove to be invaluable in keeping your system safer as well as improving the performance of your service. This is where a process called gameday system testing can prove extremely fruitful.
Gameday system testing, though the name may sound like simply a game, is actually a method of testing your apps and systems in your production environment in order to ensure everything works as expected, and to test your team’s responses to failures in all or part of the systems being tested.
Typically, a particular time (such as a few days to a week) is scheduled to conduct vulnerability tests and to log performance of the systems. When all tests are complete, the results can be evaluated to determine if any improvements to the system or to failure procedures need to be implemented.
How Can Gameday System Testing Help?
While testing outside of production is a very proper approach, it’s incomplete because some behaviors can be seen only in production, no matter how identical a staging environment can be made. — John Allspaw, Etsy
While staging environments are certainly an important part of the process, there is always the possibility of the production environment having differences of some sort. Notably, production is where actual users of your services will be interacting with your applications. Thus, it can be difficult to replicate this exact environment, no matter how close you are able to make the staging environment to production.
For a typical gameday testing scenario, there are some things you will want to do in order to help ensure you get the best information from the tests. It is usually a good idea to create both automated and manual tests, as each type of scenario can produce different results.
Also, it can be helpful to have people from multiple teams bring ideas for different testing scenarios that can be used for testing the system. Managers, developers, support personnel, and other teams all think differently as to what might break the system.
Of course, it is vitally important to notify all affected that gameday testing is going on and that one or more systems could experience an outage during that time. This helps ensure that your teams are prepared to act appropriately if or when there are interruptions to the usual services.
Finally, be sure to document everything in detail, especially anything that didn’t go according to plan and any bugs or response issues. For example, if there was a bug that caused a database crash, then it should be documented to be fixed. If the response in reporting or recovering the database isn’t optimal, this is a good time to address what could have been done better (and a far better time than after an unexpected outage).
The worst-case scenario with a GameDay exercise is that something will go wrong during the exercise. In that case, an entire team of engineers is ready to respond to the surprises, and the system will become stronger as a result … The worst-case scenario in the absence of a GameDay exercise is that something in production will fail that wasn’t anticipated or prepared for, and it will happen when the team isn’t expecting or watching closely for it. — John Allspaw, Etsy
Some notable examples of organizations that have taken advantage of gameday testing include Stripe and Gov.uk.
Stripe recently ran a gameday test, and it proved helpful to finding an issue that could have been worse had it come completely unexpectedly. The gameday was to test the failover of a Redis cluster using a kill -9 on its primary node.
As it turned out, all data in the cluster was lost! Fortunately, since they were ready to respond, they were able to restore the data quickly with a saved backup. As you can see, having such a thing happen randomly could have had far worse consequences, so it was certainly helpful to them to catch the issue during a planned test.
If there's anything to learn from this Redis problem, even a simple kill -9 test needs to happen more often in our industry.
— Kelly Sommers (@kellabyte) October 22, 2014
Gov.uk also had a recent gameday test, and their testing focused on emergency publishing and data loss. These tests helped to discover some minor documentation issues, which when updated will help them respond even more quickly to the scenarios they were testing.
With the emergency publishing, their team was able to work around a documentation issue, find a server where they had credentials, and post the necessary message relatively quickly. With the data loss, the team was again able to find a way passed a documentation issue and bring the database back up within a matter of hours. In the end, they found it helpful in that once some documentation was patched up, they would be able to have an even faster response time if one of these events were to occur.
Finding the Right Tools
Moving to the cloud can help make your gameday testing even easier. However, a recent poll I ran among prospects and clients found that more than 40 percent of IT professionals are afraid of cloud migrations because they don’t have the right tech.
If you are looking to move to the cloud consider something like Morpheus which makes provisioning, scaling, and maintenance of your apps and servers a much more simple process. Cloud management tools are not all created equal. Find one that can get you a point-of-contact right away — not three months from now. A best-in-class cloud management platform will allow you to provision databases and servers quickly, and have your app up and running in a fraction of the time. A good tool should also be able to add load balancing or IPAM to the mix to help improve the speed and reporting for your website or app.
Trust me, finding the right tool can be the difference between a successful migration and your job so choose wisely. Hit me up if you want some advice — I’d be happy to help.