
MTBF (mean time between failures) is a very misleading guide for companies who want maximum up time for servers, storage and network connections.
Most companies focus on MTBF exclusively but I think they’re wrong. The problem is the measurement but the lessons people learn from it.
They think that if only they make everything redundant they can make everything failsafe. Not so. By making things more complex, they make the problem worse.
For example, if you use multipath networking, you get a lot of complex wiring. Sure, the aggregate MTBF may be better but if there’s a problem its much harder to resolve. In other words, reducing MTBF actually increases repair time.
Instead, in the cloud especially, they need to think about simplicity first and foremost. The more you keep things simple, the more you keep uptime high. If something does go wrong, you have to have procedures in place to identify problems very quickly and replace faulty parts. But the job is easier.
For example, instead of a multipath, multi-redundant network in my data centre, I can have a primary switch and if there’s a problem, I can just cut over to my standby backup switch and carry on. It takes a few minutes and then I can focus on diagnosing the original problem.
It’s the different between trying to repair a jet engine on the ground or in the middle of a flight.
Of course, the big vendors prefer to make systems more complex because they’re more expensive. But sshhh! They don’t want you to know that.