As we all know by now, Amazon Web Services had a significant disruption to its Simple Storage Service (S3) service Tuesday. Given the popularity and extensive usage S3 has across the internet, this was a major issue for many companies.
AWS published a detailed explanation of what caused the service disruption. In this explanation, the engineers stated that a mistake was made by a member of the S3 team which caused an incorrect and inappropriately large number of servers to be removed from service, which caused a cascading impact that was felt throughout the industry.
Any and all DevOps professionals who have ever supported a production service can relate to this sort of problem. We’ve all typed the wrong command and caused a problem we did not expect. There’s even a term for it, it’s called “Fat Fingering,” referring to the typing mistakes that can happen when a finger hits the wrong key on a keyboard. We all can relate to this S3 team member, and we all can feel their pain. We’ve all been there before, this one is seemingly no different. It’s just on a much larger stage than what most of us experience.
Of course, you can hear the tweets coming a mile away. How could AWS have a system as large and as critical as S3 allow a simple fat-fingering mistake to bring it to its knees? How could AWS not anticipate this sort of problem?
You see, AWS did anticipate this problem, and like all good service providers, it has processes and procedures in place for dealing with problems like that when they occur. You simply don’t provide a service with the history of reliability and availability of S3 without fat-finger basic protections in place.
Let’s take a step back and look at fat fingering problems in general. To do this, let’s step away from AWS for a moment…
Manual Changes vs. Tooled Changes
You see, many newly created production systems, when they first come online, focus more on functionality and less on supportability. They are built to provide a specific capability. Of course, the services must be maintained, but maintaining these services often involve manual tasks by operations individuals, tasks such as logging on to servers and rebooting them, or terminating processes, or updating a configuration file. For many production systems, this is how they are maintained. Individuals login to the production servers and high privileged or “superusers” and manually perform necessary actions. This is, unfortunately, a very common process in the industry.
This is, of course, a bad practice when it comes to maintaining highly available services. Once you allow people to log into production servers, then you open the production servers to the mistakes that people can sometimes make. It is *very* simple to accidentally type the wrong thing and accidentally delete an important data file, or shutdown a server inappropriately, or terminate important processes.
With Amazon S3 Down We are all old men pic.twitter.com/aT5EXe9ieM
— Such _Politics (@SuchPolitics) February 28, 2017
That’s why architects and systems engineers for a long time have encouraged the use of a specific production best practice. Don’t allow anyone to login to a production server for any reason. Instead, build tools that perform the exact same functions that people typically need to perform on the production servers. This might include rebooting servers, trimming log files, changing a configuration setting, or restarting a service. All of these tasks should be done from tools that automate the dangerous tasks required to be executed on your production servers.
Why is this important? Because using automation tools gives you many advantages over the manual operations. Some advantages include:
- The tools can validate input given to them and make sure that the input makes sense. If the input is incorrect in some way, they can report the problem and fail rather than using the incorrect input and doing something bad to the servers.
- You only give the individuals the permissions they need. Now, if simply restarting a process on a server is an easy first step in solving that problem, a tool can make that job easy. You can create a tool that allows a first-line support individual to do that task. You do not need to give everyone full access to your servers.
- From a security standpoint, limiting what individuals can do to only the tasks they need to do to perform their jobs helps considerably in preventing bad actors from breaking into your systems. If nobody can log in as superuser to your production servers, then hackers can’t either.
- You can log operations activities, which makes post-mortem problem analysis easier and more accurate, and provides a trail of actions for security purposes.
By limiting the allowed tasks to only those that the tool will execute, and validating the input is correct and that the user has appropriate permissions, you go a long way toward preventing fat-finger problems from mistakenly bringing down a large system.
The idea that you can replace manual tasks with automated and scripted tooling is great in theory, and in theory, it can completely avoid problems such as fat-fingering causing a production outage.
However, the reality is a bit different. There are many realities that can cause problems:
- It is nearly impossible to predict, ahead of the need, all of the tooling you might need during a crisis situation to resolve a problem. So, undoubtedly, you need a “back door” that gives your most trusted operations individuals full access to everything. This is especially true with newer services and becomes less of a need as time goes on and a system becomes more mature.
- The tooling still requires input, and the best tooling validates that input. But what are the correct range of values you should allow? For example, a tool that reboots X percent of the servers in your fleet probably should not allow 100% of servers to be rebooted at once. But how many is acceptable? 10 percent? 20 percent? 90 percent? What’s the right answer that prevents problems but solves the need?
- Even with tooling and validation, it’s possible to miss some necessary validation and the tool can still, quite simply, do something wrong because it was never programmed to reject an unanticipated request.
These are realities that impact all operations teams. So what can you do? You can do your best to build the best tooling you can, with the least bugs as possible, and fix and improve them on a regular basis as necessary. In other words, you can treat your tooling like it is production software and maintain it in the same manner.
Back to AWS
And this is exactly what AWS did. The company’s engineers know that fat-fingering happens. They know that they can’t give everyone full access to all servers. Especially at the scale that AWS operates at, this would be very dangerous indeed. Instead, they do the best practice I describe above. They’ve built tooling to perform all the maintenance tasks they need to have performed. They’ve built tools to keep people from making mistakes in maintaining the systems, and they invest heavily in these tools.
Don’t allow anyone to login to a production server for any reason. Instead, build tools that perform the exact same functions that people typically need to perform on the production servers.
But how successful were they at building these tools for the S3 service? Extremely successful. S3 is, by virtually all measurable standards, a stellar example of a highly available and reliable service. S3, quite frankly, just doesn’t break. Even though it has experienced significant, significant, significant growth over the years, it does not have major outages. Everyone has depended on S3 being there…because S3 has proven to be very dependable.
Then what happened this week? Well, someone typed the wrong thing into a tool. This happens all the time. The tool, however, rather than rejecting the bad input and failing the request, decided to implement the invalid request anyway. And the rest is history.
So, this problem wasn’t a fat-finger problem by a poor operations engineer. This problem wasn’t a failure of following best practices for building high availability services. This problem wasn’t an unanticipated scaling problem. What caused the problem at AWS the other day? A simple software bug in an important tool. Something that happens all the time.
What should be done about this? The bug should be fixed (AWS says it has done this). Procedures should be reviewed to see if more problems can be uncovered (AWS says it is doing this).
And, more importantly, everyone else in the industry should learn from this. Best practices such as what AWS used here should be more widespread. If you still have production servers that engineers routinely log in to, re-evaluate that process. If your engineers can terminate EC2 instances randomly, re-evaluate that process. If you can delete random buckets or data tables from a command line or console, re-evaluate. Your systems will be more resilient and more stable, which helps everyone.
Feature image: New Old Stock.