What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Cloud Services / Microservices

3 Key Factors for Future-Proofing SaaS Cloud Platforms

As the head of platform engineering at Atlassian, I led our team of engineers in rearchitecting our cloud infrastructure from scratch. Here are the three key factors for building future-proof cloud platforms to ensure scalability and reliability.
Mar 9th, 2021 1:56pm by
Featued image for: 3 Key Factors for Future-Proofing SaaS Cloud Platforms

Mike Tria
Mike Tria is the Head of Engineering for Platform at Atlassian. Mike oversees Atlassian's global cloud infrastructure, identity and frontend platforms, enterprise offerings, and our third-party developer ecosystem. Mike has 15+ years of experience as a software engineer and leader. As a former comedian, Mike also brings high energy and a sense of humor to the tough challenges he faces.

The company’s internal IT infrastructure sometimes becomes a mystery: very large cloud systems, a growing number of microservices and, in addition, home offices bring dozens of new interfaces that need to be secured. In short, all this shows that software-as-a-service (SaaS) providers need to rethink how they provide the most reliable and secure cloud infrastructures to their customers.

Companies must therefore create a new standard for themselves in order to be able to meet their customers’ expectations because only reliable systems enable successful service provision. Conversely, however, this means that developers have to do even more.

As the head of platform engineering at Atlassian, I led our team of engineers in rearchitecting our cloud infrastructure from scratch. Here are the three key factors for building future-proof cloud platforms to ensure scalability and reliability.

1. Rely on Microservices

At the heart of any SaaS cloud platform is complexity. Complexity can be managed in two ways: Centralized or distributed. A centralized system is a monolith where all the complexity lives in a single system that provides a single interface to the outside world. A distributed system is often built using microservices, where the complexity is broken down into individual services that in turn communicate with each other. At first glance, the microservices-based system seems more fragile because there are more interfaces and thus more opportunities for errors. However, the opposite is true: by using Service Level Objectives (SLOs) for each microservice and additional alerts to relevant teams, a “deep defense” is created in the system. Some companies already use SLOs, but only in the outer layers of their system. Atlassian, for example, relies on this deep defense and uses over 1,400 microservices. It is these microservices with attached SLOs linked to alerts that are the key to a reliable large-scale system.

2. Use Automation Everywhere Possible

If just five percent of the issue services stopped doing their job, all reliability would be gone. However, automation in all possible places can ensure that reliability is close to 100%. A separate tool monitors all existing microservices, links alerts to anomalies in individual SLOs and forwards them to the appropriate teams to take over. This tool should further be coupled with an incident management system — so any protocol breach is reported directly and can be fixed immediately. In addition, a microservice system can be effectively scaled with it, which ultimately also opens up further engineering resources for demanding work.

3. Introduce an Error Budget

To ensure that the reliability of services can be properly assessed, an internal “error budget” should be introduced in the future. This concept has already been used by modern software companies for two to three years.

A powerful benefit of error budgets is they give teams room in which to operate. As long as the service is within its error budget, the team can operate business-as-usual. Only when the error budget is breached should the team change to emergency tactics.

As an example, the authorization team is given the task that the service must respond within one second 99.99% of the time. The number of minutes in which this figure is not reached is then the error budget. Assume from the calculation that 52 minutes of the year is the error budget. If something goes wrong and the team takes two minutes to react once, that is within the budget and requires no further action.

Previously at Atlassian all errors were treated the same, with the use of an error budget and SLOs this is different — errors are assessed and are now scalable. At Atlassian, all services receive this classification by default.

Conclusion: Reliability Is the Foundation for Successful Service Delivery

Without a customer, there is no service, and without reliable service, there is no customer. But to establish such services, most companies cannot avoid revising their internal service structure. With a switch to microservices, a central platform for their management and the use of an error budget, they lay the ideal foundation for being able to offer customers reliable and scalable services on an ongoing basis.

Feature image via Pixabay.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.