Tagged: Availability Toggle Comment Threads | Keyboard Shortcuts

  • greenclouddc 6:55 pm on December 13, 2011 Permalink | Reply
    Tags: Availability, ,   

    R.A.I.D. is B.A.D. 

    close-up of an opened hard drive

    If you think your data is protected against bit rot because it’s on a RAID array, think again.

    Bit rot is a real problem. All storage media is at risk. Magnetic media like discs and tapes can lose data integrity over time. Even RAM is susceptible to cosmic rays. You may not even know about it because the disc keeps working but a bit or a byte here and there has changed.

    A typical home user with a 2TB disk has an almost 100% chance of bit rot. If it just changes a pixel in a video, it’s not going to be a problem. But if it’s in your database or operating system software, that failure could be much more significant.

    Today there is no protection. This problem is undetectable and then there is no means of recovery.

    You might think that RAID is the answer but it’s not. An Amplidata white paper, The RAID Catastrophe, explains why:

    • It doesn’t do hashing or CRC to verify the data.
    • It doesn’t track bit rot or correct it (nor do most operating systems).
    • There are no ‘previous versions’ to restore; just the current version of each file.
    • Storage volumes are increasing dramatically so a rebuild in the event of a total drive failure can be very lengthy (even a couple of days on 2TB HDD arrays).
    • Increasing the number of drives or the size of each drive increases the overall risk of failure; even if the MTBF for any individual drive seems low.
    • If one drive in an array fails, it is quite likely that a second one will also fail soon because RAID writes to all drives in the array at the same time.
    • RAID-6 double parity is a partial answer to this but it imposes a big performance overhead.
    • Forensic data recovery from RAIDs is very, very difficult, time-consuming and expensive. It’s like trying to turn an omelette back into the original eggs.

    There is an additional problem with RAID storage. If you have a problem with a drive, you replace it and the rebuild fails for some reason you have no way of recovering the data on the array. At least on a single disk, you can recover all the data except on the failed sector(s). This is another example of IT industry focusing on MTBF rather than on complexity or recovery time.

    The IT industry needs a radical rethink about storage. The old orthodoxy isn’t good enough. It may be heresy to admit it but perhaps we need to move beyond RAID and towards new, unbreakable storage technology like Amplidata’s.

     
  • greenclouddc 6:47 pm on November 22, 2011 Permalink | Reply
    Tags: Availability, , Unbreakable   

    MTBF is the wrong thing to measure 

    iStock_000000174856XSmall

    MTBF (mean time between failures) is a very misleading guide for companies who want maximum up time for servers, storage and network connections.

    Most companies focus on MTBF exclusively but I think they’re wrong. The problem is the measurement but the lessons people learn from it.

    They think that if only they make everything redundant they can make everything failsafe. Not so. By making things more complex, they make the problem worse.

    For example, if you use multipath networking, you get a lot of complex wiring. Sure, the aggregate MTBF may be better but if there’s a problem its much harder to resolve. In other words, reducing MTBF actually increases repair time.

    Instead, in the cloud especially, they need to think about simplicity first and foremost. The more you keep things simple, the more you keep uptime high. If something does go wrong, you have to have procedures in place to identify problems very quickly and replace faulty parts. But the job is easier.

    For example, instead of a multipath, multi-redundant network in my data centre, I can have a primary switch and if there’s a problem, I can just cut over to my standby backup switch and carry on. It takes a few minutes and then I can focus on diagnosing the original problem.

    It’s the different between trying to repair a jet engine on the ground or in the middle of a flight.

    Of course, the big vendors prefer to make systems more complex because they’re more expensive. But sshhh! They don’t want you to know that.

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
Follow

Get every new post delivered to your Inbox.