The Atlassian Outage Just Keeps Going and Going and…
Usually, when there’s a web service outage there’s much angst and crying, but within a few hours, or at most a day, all’s right with the world again. Not this time. Not with Atlassian. This time, we’re going on a week of Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie Cloud, Statuspage, and Atlassian Access all being out of service for at least some users.
As Atlassian tweeted on April 6, “While running a maintenance script, a small number of sites were disabled unintentionally. We’re sorry for the frustration this incident is causing and we are continuing to move through the various stages for restoration.”
No Other Explanation
Officially Atlassian hasn’t had any other explanation for the service failure. In a purported note to a Jira user from Atlassian CEO Scott Farquhar, however, he said, “On Tuesday morning (April 5 PDT), we conducted a maintenance procedure designed to clean up old data from legacy capabilities. As a result, some sites were unintentionally deactivated, which removed access to our products for you and a small subset of our customers. We can confirm this incident was not the result of a cyberattack and there has been no unauthorized access to your data.”
The company said the right things: “This is our top priority and we have mobilized hundreds of engineers across the organization to work around the clock to rectify the incident.” But days into the outage, users are getting sick and tired of waiting.
As developer Morten Linderud, tweeted, “Our project managers are climbing the walls and our service request queues are all Slack threads. I’m currently doing the on-call rotation checklist in a badly formatted Word document someone had managed to copy from Confluence.”
Small Number Affected
The only consolation is the outage appears to be affecting only a small number, 400, users. That amounts to a tiny fraction of Atlassian’s user base of approximately 226,000 customers. Of course, if you’re one of those impacted, you’re hating life right about now.
As one user put it on Reddit, Atlassian just sent an update that read: “We were unable to confirm a more firm ETA until now due to the complexity of the rebuild process for your site. While we are beginning to bring some customers back online, we estimate the rebuilding effort to last for up to 2 more weeks.”
His reaction? “Thoughts and prayers to my sanity, fellow sys admins.”
In its most recent status report, April 11, 15:34 UTC, Atlassian said, “A small number of Atlassian customers continue to experience service outages and are unable to access their sites. Our global engineering teams are working 24/7 to make progress on this incident. At this time, we have rebuilt functionality for over 35% of the users who are impacted by the service outage, with no reported data loss. The rebuild stage is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time but are critical to ensuring the integrity of rebuilt sites. We apologize for the length and severity of this incident and have taken steps to avoid a recurrence in the future.”
Farquhar, also said, “I want to personally apologize for the Atlassian outage that you are experiencing. We understand how mission-critical our products are to your business and want to make sure you know we are doing everything we can to resolve this. We hold ourselves to the highest standards in dependability, transparency, and customer service, and over the past few days, we have failed to live up to that standard.”
The chief concern of many of those who’ve been hit by this problem is that they’ll end up losing data. To date, though, no one has reported losing their data. Hopefully, no one will, and all will, eventually, be well for all of Atlassian’s users.