
The recent #AWS and #Azure outages over the past two weeks are a good reminder of how seemingly simple problems (failure of power source or incorrect script parameter) can have a wide impact on application availability.
Look, the cloud debate is largely over and customers (commercial, government agencies, and startups) are moving the majority of their systems to the cloud. These recent outages are not going to slow that momentum down.
That said, all the talk of 3-4-5 9s of availability and financial-backed SLAs has lulled many customers into expecting a utility-grade availability for their cloud-hosted applications out of the box. This expectation is unrealistic given the complexity of the ever-growing moving parts in a connected global infrastructure, dependence on third-party applications, multi-tenancy, commodity hardware, transient faults due to a shared infrastructure, and so on.
Unfortunately, we cannot eliminate such cloud failures. So what can we do to protect our apps from failures? The answer is to conduct a systematic analysis of the different failure modes, and have a recovery action for each failure type. This is exactly the technique (FMEA) that other engineering disciplines (like civil engineering) have used to deal with failure planning. FMEA is a systematic, proactive method for evaluating a process to identify where and how it might fail and to assess the relative impact of different failures, in order to identify the parts of the process that are most in need of change. Read More…