‘Hey, someone, somewhere mentioned AWS S3 had an outage?’ It is the last day of February over at the Amazon Simple Storage Service team and a team member accidently executed a command which knocked out several important servers which lead to the S3 outage. We would have missed this event, if it hadn’t been for the mass hysteria in the media!
It isn’t the first service issue that AWS has had, and it is not alone. All Cloud service providers experience outages, whether that’s AWS, Google, Azure, Force.com or any other organisation in the sector. It is a fact that if something can fail then at some point it will. With the recent S3 issue there were two types of AWS customers, those who had their own service outages as a result and those who didn’t. (If you’re interested in reading more about February/March problem head over to AWS’s statement on the outage.)
The difference between these two types of AWS customer?
Simply put the reason for these two very different outcomes is down to the individual organisations’ application architecture. Those AWS customers that design single points of failure into services were impacted and furthermore they knew they would be affected, as AWS has never claimed 100 per cent service availability and officially advises customers to architect accordingly. When you build Cloud solutions you have to accept shared responsibility for risk.
In the February’17 S3 incident, the issue was with a single AWS region. Those customers who were impacted had obviously made the business decision that the risk of an outage at AWS was acceptable when compared to the cost of architecting their application to work across regions, across Cloud vendors or using an alternative such as Google App Engine.
The knock on effect to SaaS Vendors
The AWS incident wasn’t restricted to just those AWS customers that had a single point of failure either, there were a number of SaaS vendors whose platforms had varying degrees of issues, the knock on being that their customers were also dragged into the incident.
For these SaaS vendor customers, there is no easy of knowing if their chosen SaaS vendor will or won’t have these issues due to Cloud outages. However, the devil is in the detail and a delve into a Saas vendor’s terms of service will reveal that they, just like their cloud provider, won’t offer a 100 per cent service guarantee. Therefore customers of SaaS vendors also have to assume that they will suffer from unplanned outages and therefore plan accordingly. In this particularly large S3 incident, problems were compounded for some SaaS customers as they were relying on those solutions to deal with the initial AWS issue.
Lessons to learn from the AWS S3 outage incident
Outages like the one experienced by AWS should not be seen as a reason to dismiss the Cloud, as the positives generally outweigh the negatives, but it does mean that careful and consider appreciation is needed when utilising it.
- Make sure you understand the implications of your design decisions and have assessed the business risks of those decisions.
- Make sure you understand the concept of shared responsibility and that you architect your services in-line with your risk decisions.
- Make sure you understand the implications of choosing a specific Cloud vendor.
The Cloud is your friend but utilise it correctly
If you are planning to build new or migrate existing services to the Cloud, then take some time to understand and assess the risk. It is vital to have an appreciation of how your services will be deployed on the Cloud and what that means for your organisation, then architect your solution appropriately.
If you already have your services deployed on the Cloud, then make sure you have completed the risk assessment and that your current solution meets your needs. If that hasn’t happened then give it a top priority. If your organisation has changed significantly since you deployed services to the Cloud then re-assess the risks to your organisation and make plans accordingly.
If you have completed your assessment and found that your current solution doesn’t meet your needs then as a matter of urgency you need to look at what can be done to improve your deployment. This may mean that you need to re-architect and refactor elements of your service. Use this as an opportunity to make sure that your services are fully Cloud native.
In addition to thinking about service availability, look at how you are using the elastic nature of cloud computing. If one aspect of your application architecture is compromised there may be other aspects as well that can be improved. A sub-optimal Cloud architecture will result in poor service availability and higher than necessary operating costs.