Why Atlassian Failed So Hard
On April 4, Atlassian’s web services went down for about 400 customers or from 50,000 to 400,000 users. So far, so what, web and cloud services go down all the time. It’s never any fun, but it happens. But this time, Atlassian‘s woes kept going and going. Ten days later some users still don’t have access to Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie Cloud, Statuspage, and Atlassian Access.
Since Atlassian Jira’s market share is just over 84% of the bug-and-issue-tracking market, this is truly annoying. I mean, Atlassian’s business is all about bug tracking! And, for the longest time, Atlassian barely had a thing to say about the problem. The first comment came two days later and said little, “While running a maintenance script, a small number of sites were disabled unintentionally. We’re sorry for the frustration this incident is causing, and we are continuing to move through the various stages for restoration.”
Not a Cyberattack
Then, there was largely silence. In a purported note from Atlassian CEO Scott Farquhar, said, “On Tuesday morning (April 5 PDT), we conducted a maintenance procedure designed to clean up old data from legacy capabilities. As a result, some sites were unintentionally deactivated, which removed access to our products for you and a small subset of our customers. We can confirm this incident was not the result of a cyberattack and there has been no unauthorized access to your data.”
That, and the obligatory: “This is our top priority and we have mobilized hundreds of engineers across the organization to work around the clock to rectify the incident” was it.
Why the silent treatment? Part of the reason, Gergely Orosz, developer and writer, suggested is that, “Atlassian staff and customers turned their attention to Atlassian’s flagship annual event, Team 22. Held in Las Vegas, many company employees, much of the leadership team, and many Atlassian partners traveled to attend the event in person.” Despite the system failure, Atlassian appeared to have stayed focused on Team 22.
The affected Atlassian customers were not happy. As one tweeted, “What happened there? It’s not a small hiccup with a few minutes or an hour-long downtime, Confluence and Jira are literally down all day.”
What Went Wrong
The company had deactivated its standalone legacy app, “Insight — Asset Management,” for Jira Service Management and Jira Software on customer sites. The process went badly wrong because of two critical problems:
- Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.
- Faulty script. Second, the script we used provided both the “mark for deletion” capability used in normal day-to-day operations (where recoverability is desirable), and the “permanently delete” capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.
But didn’t Atlassian have backups? Well, yes, they did. They maintained both a synchronous standby replica in multiple AWS Availability Zones (AZ) and separate immutable backups designed to enable recovery to a previous point in time.
So, what’s the problem? The backups are, in a word, messy.
If they restore from the checkpoint, the troubled 400 customers would get their data back but everyone else would lose all their data since the backup had been made. So, Atlassian had to manually pull the data from the backups.
Viswanath explained, “What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers.”
They’re now automating it, but even so, it’s slow. “Currently, we are restoring customers in batches of up to 60 tenants at a time. End-to-end, it takes between four and five elapsed days to hand a site back to a customer. Our teams have now developed the capability to run multiple batches in parallel, which has helped to reduce our overall restore time.”
Eventually, Viswanath promises, “we will conduct and share a post-incident review with our findings and next steps. This report will be public.” In the meantime, the remaining out-of-service companies are unhappy both with the failure and Atlassian’s poor communications.