Confluence of Events Cause Atlassian Outage
Atlassian, the makers of popular developer tools Jira, Confluence, and Bitbucket suffered a multi-week outage of their cloud service beginning on April 5th. This blog offers some background on the company and the outage and provides some info on upcoming research on the topic.
Who Is Atlassian?
Atlassian is a multi-billion dollar software company founded in 2002 in Australia by two University of New South Wales graduates. These two founders, Mike Cannon-Brookes and Scott Farquhar, initially funded the company with $10,000 of credit card debt. Since then, they have grown Atlassian to over 7,000 employees around the world, providing a suite of products targeted at the development and project management communities. Atlassian has been very successful at providing software products that are being used by organizations ranging from start-ups to some of the largest enterprises in the world, boasting over 200,000 customers.
Atlassian Doubles Down on Cloud-First
Atlassian began as an on-premise enterprise software company and has been pushing for a decade to migrate their on-premise customers to its cloud offerings that run in AWS. In its latest push, Atlassian stopped selling its popular on-premise server edition in February 2021, leaving its feature-rich but pricey datacenter edition as the only remaining on-premise edition available for its customers. To provide additional incentive to move to the Atlassian Cloud, it simultaneously announced a doubling of the price for its datacenter edition.
Atlassian Cloud Outage
On April 5, 2022, Atlassian suffered a self-inflicted outage of their cloud services. The outage was initially reported to have affected around 400 customers but was recently updated to 775 customers. This is a relatively small subset of Atlassian’s overall customer base, however the outage wasn’t fully resolved for almost two weeks. Two days without an issue tracking system (Jira), a knowledge base (Confluence), or a source code control and build orchestration system (Bitbucket) can cause major disruptions to an organization, having them down for two weeks is almost unimaginable.
What Happens in Vegas Stays in Vegas
On the same day as the start of this outage, Atlassian was kicking off their Atlassian TEAM 2022 at the Venetian in Las Vegas. This is Atlassian’s big event of the year with a multi-day conference and many product and company announcements. Unfortunately for Atlassian, all the splash and coverage was focused on their outage and not their event. There’s never a good time for a service outage, however, some times are far worse than others.
So What Actually Happened?
Atlassian cloud operators were executing a script to clean up and remove accounts for obsolete application software for a subset of users in their cloud. This is routine stuff for anyone that has run a cloud service, however, in this case, a couple of things went very wrong that combined to make the situation very bad. The IDs for the accounts that were supposed to be deactivated were not the correct IDs, causing active accounts to be deleted. Additionally, the deactivation script was configured for a permanent delete rather than a ‘soft’ delete that would have been relatively easy to recover from.
As a result, the accounts, and all associated data for 775 of Atlassian’s customers, were permanently deleted from the production system.
The Long Road to Recovery
Once the damage was done, Atlassian began the recovery process, which was painfully manual and even with all hands on deck was only restoring around 5% of the affected customers each day. Atlassian engineering was able to remove some of the roadblocks through automation and parallelization of the process so they could recover around 20% of the affected customers each day, finally resolving the issue on April 18th, a full two weeks after it began.
Adding Insult to Injury
This outage started out bad and only got worse due to the lack of clear and timely communication. Atlassian’s service status page was indicating an outage but the frequency of updates as well as the content provided was woefully inadequate.
It’s too easy for operations teams to get immersed in solving the problem at hand, especially during a high-pressure outage. However, it’s critical that the team keep the user community informed about the status and the estimated time to recovery so everyone impacted can plan the next steps as well.
This outage will undoubtedly have a negative impact on Atlassian’s “Cloud-First” initiative as customers will use this experience to question a move to their cloud. It will take some time, a clean track record, and transparent communication to build back the user community’s faith and trust.
Aragon will be publishing a Research Event Note on this outage where we’ll go into detail about what went wrong and provide some best practices that might have prevented it from happening in the first place.
We’ll also be including some guidance to enterprises to use this outage as a “wake-up call” to review their own disaster recovery and business continuity plans. Stay tuned.
Have a Comment on this?