Data borrowed a new word from society, resiliency. This introduced the possibility that software failure is a spectrum, not an absolute event. Artificial intelligence is becoming increasingly humanoid. It can now withstand, or recover quickly from difficult conditions. So we have a situation like an auto factory. The paint shop is down but the assembly line is working. Let’s consider resilience first. Then we can move on to how software resilience is changing the way data centers work.
Let’s Talk About Software Resilience …
The user only wants one thing from their software. It must enable them to do their work in the way that suits them best. Service providers want their customers to have the same experience. For this to be possible they need applications that are both reliable and resilient. These are no longer buzz words. They are the new data reality.
The Institute of Electrical and Electronics Engineers has a ‘department of reliability’ to shepherd these things along and help make them happen. It defines system resiliency as ‘the ability of a storage system to continue operating even when there has been an equipment failure, power outage or other disruption’. This is a deep topic you could learn about here.
Reliability on the other hand is having perfect functionality at all times, which we know by now has become the impossible dream. Let’s contrast reliability and resilience using an email service provider as an example. If the address book function crashes, it should still be sufficiently resilient to work. Although users will have to type email addresses in full.
The designer of that email software compartmentalized the address book, so if it collapses this does not cause a chain reaction throughout the system. If they were smart, they would have ensured their system automatically repaired and restarted the address book. This is what messages ‘we are having problems with certain components but are working on them’ means.
Software is therefore reliable if its mission-critical features are able to recover from failure. This shift from absolute reliability towards sufficient resiliency to keep going is changing the way data centers and their clients work. They are now far less likely to fall over completely. The new definition of ‘down’ is ‘slowing down’ temporarily. Let’s talk about what this means in practice.
Has Data Center Redundancy Morphed into Software Resilience?
It sure has, but this is not an admission of failure. The reality is we are getting smarter at what we do. Our systems no longer fall over completely because we compartmentalize them in silos. The remaining challenge is to continuously improve the reliability of each individual component.
Let’s take an air-conditioning system as an example. A modern data center could have three plants cooling it continuously. If one fails, the other two are sufficiently resilient to increase their output while the team quickly fixes the troublesome one. This situation is a learning opportunity to improve the reliability of all three plants.
Resilient software works the same way, in principle at least. Having redundant backup systems is no longer the only way to strive for 100% availability, the holy grail of software designers. Resilient systems able to bounce back quickly have taken over. And that’s important, because with cyber hacking software is never going to be foolproof.
So That Is the End of Data Centers, I’ll Be Okay on My Own?
We have economies of scale that outperform private data storage. What’s more, we relieve business owners of the burden of data management so they can get on with making money. And the chances are we will be even more affordable in the future thanks to new, nimble software. We no longer have to build two – or more – of everything because resilience is the new redundancy. It makes sense to outsource when you can get a less-expensive service.
Resiliency is the New Redundancy. Where to From Here
If an online financial service has a major system fault, this directly affects their customers, their bottom line, and their sustainability. Therefore they need to build resiliency into everything they do. They can then rely less on internal system redundancy. This is because their reliability imperative diminishes. They can tolerate non-essential subsystem fall overs and only need backup for core.
According to software developer Infrastructure Masons, “Buying more systems is cheaper than building a more redundant facility, assuming the software is aligned to do it. In other words, it only works if resiliency is built into the software.” This leads to the question, what level of redundancy and resiliency is appropriate?
The Workload of the System is the Key
Clearly, a manufacturing plant using robotics cannot afford to rely on resiliency alone. It also needs 100% redundancy backup so it can keep supplying its business-to-business customers. However the same is not true of its ancillary systems. The risks attached to its ordering system falling over are low, although this should not apply to its logistics, because it is drawing raw materials around the clock. The workload of the system is the key.
Determining Appropriate Levels of Resiliency and Redundancy
“It depends on what the application is,” comments Infrastructure Masons. For example, machine learning could make resiliency redundant, whereas public clouds for humans need to recover quickly while backup tides them over.
Data criticality adds another layer of criteria. Mission critical data needs military grade availability plus the ability to self-repair. This should be the direction data centers are moving. Cloud is only good enough for day to day storage, assuming adequate support that customers cannot easily verify.
In conclusion, customers can tolerate the new normal of non-essential systems sometimes being slower. Whereas if core goes down completely they will vote with their feet, or should we say with a few mouse clicks. Availability is still the big deal tier-four data centers deliver.