Is RAID dead? A quick Google search indicates it might be, at least for Big Storage. Why? Lots of reasons, tightly coupled RAID systems, bit error rates of dense drives, and the length of the repair/rebuild process takes. Basically, while RAID was a good tool for yesterday, today’s data sizes require new data protection methods.
Which brings us to erasure codes. With erasure codes most implementations use Reed-Solomon error correction that calculates ‘erased’ data based on the data that remains. The details are highly mathematical and good bedtime reading but beyond the scope of this post. What you need to know, however, is that erasure codes require CPU horsepower to crunch those equations on the front end (when storing data) and again on the back end to get your data back. More on that in a minute.
In general, erasure codes make sense in the world of dense drives. Here’s why: All you do is take a stripe of data, add some erasure codes, and send it out to a bunch of separate disk drives. We can tune up the erasure bits to protect from 1 to 8 independent failures and still recover the data and be more efficient than 3x mirroring. Protect the data, not the drives.
Realizing density was going to be a problem, vendors like Isilon, Cleversafe, Amplidata, Atmos, Quantum Stornext were early in implementing erasure codes. The nomenclature of N+M became prevalent in our vocabulary, they proclaimed that 10+4, 16+4, or 20+8 gives you the disk efficiency plus all the protection needed for bulk data systems without having to 3x mirror your data.
Okay, but what actually happens to application performance in the real world when a drive fails to the system? What about two drives? How about three? Because of that CPU horsepower needed to crunch the equations and drive the I/O to get the data back (I told you we would return to that), application performance can really bog down during data recovery. In fact, you may even have to throttle the rebuilding process until late at night so as not to disturb your users.
Let’s look at this problem in more detail. Most of these systems implement EC codes like this with 16 data blocks and 4 erasure codes.
We would call this a wide-stripe erasure encoding. If you lose a disk, you end up reading data from all the non-failed data disks (plus an erasure code disk) to be able to reconstruct the lost disk’s data.
There is significant I/O expansion for just a single disk failure, the system is now pulling 16 drives worth of I/O to fix a failure, not to mention the high CPU load to perform all the calculations needed to regenerate the missing data. This is a problem, since most bulk capacity systems try to minimize cost with relatively slow CPU complexes.
There is an interesting read on Facebooks warm blob architecture that uses low power CPUs to store data, but offloads reconstruct to a small high power CPU complex. This is rack/row scale architecture that segments data access from data recovery but hits at the crux of the problem – impact to the application is minimized through dedicated infrastructure. The enterprise storage arrays don’t have this flexibility so default to one CPU complex for all tasks. The first generation erasure encoded that use the above 16+4 erasure codes could be reading from 64 drives for 4 disk failures….ouch.
This problem got us to thinking about a way to reduce the disk I/O explosion and CPU requirements for failures in an erasure code protected storage system. That thinking lead Igneous to invent DEEPR (Distributed Erasure Encoding with Prioritized Repair). For this overview we are only talking about the DEE part, PR is another conversation – sneak peek here.
DEEPR distributes the EC codes into smaller, localized groups instead. For example, instead of 16 data drives and then 4 EC drives, you might have 4 groups of 5 drives, each with a single local EC drive, plus 4 global EC drives as well, for reasons we’ll explain in a minute.
So, you end up with a few more drives. Why? What’s the benefit?
Let’s take a look at DEEPR in practice. Here is an example of a 4 x (4+1) + 4 configuration (4 groups of 4 data drives plus 1 local EC drive, plus 4 global EC drives at the end).
The payoff occurs when you lose a drive:
The system only has to read data from 4 drives to reconstruct the data. Less I/O needed to reconstruct the missing data, the local repair block requires less CPU to do the XOR calculation of repair, and less impact on the system. This is called localized repair. This is a 4x reduction in I/O to reconstruct the data.
What happens if you have two or more drives fail? In our example, once you have lost a single drive in group one, you have 15 data drives left to fail, 12 of which are in other local groups. So the odds are high that the second failure would be in a different local group (4 to 1 odds, in fact) and still a 2x reduction in I/O needed to recover the data. In that case you again just recover using 4 drives (the three remaining good data drives and the local EC drive:
BUT … what happens if a second failure occurs in the same local group:
This is a problem because now you don’t have enough information in the local group to recover. This is where the Global Erasure Codes come into play. You use those 4 EC codes at the end to help recover the localized failure.
In all fairness, we didn’t invent local repair, many of the large scale cloud vendors have come to the same conclusion on I/O reduction, faster repair, and efficient storage infrastructure including Google, Microsoft Azure, and Facebook, Facebook Xorbas. What we did was add an extra erasure code with zero space overhead (Patent: 9116833). We created an algorithm where we XOR all the local erasure codes and generate another virtual global erasure code.
Voila – more protection with less overhead.
So why should you care about your erasure codes? All along the industry wanted the efficiency that erasure codes bring to bulk storage, but didn’t fully understand the I/O ramifications of wide-strip erasure codes as we now live in a world where “Failure is the new normal.” DEEPR is the strategy for the new generation of bulk storage systems. A best of both worlds solution that minimizes I/O expansion AND storing data efficiently.
Tell us about your experience with data protection and rebuilding drives in this short survey: