Sunday, February 5, 2012

Data integrity: ZFS

Introduction

If you store a lot of data, or if you store data for a long time, disks are going to break. That, in itself, is nothing new, and the various RAID systems all address the problem of maintaining availability of the data store when disks break. The fact that RAID is not a backup is nothing new either: You can have an 8-way RAID1 mirror; if you corrupt your important file due do an application bug, you corrupt it on all 8 drives with no hope of recovery. Hence we do backups.

An issue that is properly taken into account by neither the traditional RAID systems nor backup systems is data integrity; if what your high-availability storage of backup system happily gives you data that differs from the data you wrote there some time before, you are still out of luck. This post deals with the data integrity problem, initially for the storage array only.

Data integrity issues caused by non-atomic writes

Any redundant storage system has more than one disk -- by definition. This is obvious, but it introduces a rather nasty little detail: although it is possible to atomically overwrite a single block on a single disk, there is no way to atomically do so on more than one disk at a time. This may seem an academic issue, but it is not, as the following example will show:

Suppose I have a 3-disk RAID5 setup, and that I am about to write to the, say, 1000th block on the RAID device. I will write the (symbolic) value "CD" there, where beforehand there was the value "AB".

On the underlying devices, the values "A", "B", and "P" were first stored, where P equals A XOR B. Also, A XOR B XOR P is zero, by virtue of the logic used. Under normal circumstances, I would now write values "C", "D", and "Q" (where Q = C XOR D) to the underlying devices. Again, C XOR D XOR Q = 0 would hold.

Suppose, now, that the system crashes after writing "C". We now have the situation where the three underlying devices contain "C", "B", and "P". However, C XOR B XOR P is not zero, and XOR B does not equal P. We have a problem now, because although we can detect that something is amiss, we have no way to detect which values are the "right" ones. Even worse, an optimized RAID implementation may choose to read from only two disks, thus returning the incorrect values "CB", or even "EB" (if it reads B and P, and constructs E such that E XOR B equals P).

The situation described here is known as the RAID5 write hole. It is a serious problem in software RAID implementations, and it is often used as an argument to sell hardware RAID solutions. Good hardware RAID solutions go to great lengths to avoid the non-atomicity problem described above. The system as a whole will usually be on an uninterruptible power supply, and the RAID controller usually has a battery backup unit. The former should usually prevent the power from suddenly failing, but even if it does, the battery backup unit saves the day: it will store this "CD" value that we were about to write in battery-backed memory, and even if the power fails after writing "C" to the first underlying device, it will correctly write the "D" and "Q" values to the second and third underlying devices at power up, so that the RAID array is consistent.

That indeed sounds very nice, but even a hardware RAID solution does not save us from unrecoverable read errors (data corrupting itself on the device), or broken disk controllers that return incorrect data, as we will see:

Data integrity issues caused by incorrect data being returned from the disk or the controller

Even if our expensive hardware RAID controller made sure that all values were written to the disk, that does not mean that the data actually ended up as-is on the platter:
  • The disk controller may claim that the data was written, but in fact, it may be lying about its cache being flushed; the data may not yet have been on the platter.
  • The disk controller may claim that the data was written, but the data may in fact have been corrupted in a bad section of disk-controller memory; the wrong value was written in that case.
  • There may have been a corruption in the RAID-controller's memory, causing it to communicate the wrong data to the underlying drives.
  • ...
All these cases cause an inconsistent situation on the underlying devices. But it gets worse. Even if the correct values were written, that does not mean that they will actually be returned:
  • Disks are so large nowadays that unrecoverable read errors are not unlikely to occur at least once on a disk array. Bit rot on the underlying device will usually be caught by CRC codes on the platter, but if it is not, the wrong data will be read off the platter.
  • Even if the disk returns the correct data to the disk controller, that does not mean that the disk controller will return the correct value to the RAID controller; the same issues (bad buffer chips, communication errors).
  • ...
Ow, and don't be fooled into believing that RAID1 is any better; mathematically, it is just parity RAID / RAID5 on two devices (both devices contain "A", and A XOR A is indeed zero).

So, what now?

All of the above force us to face the sad truth that we cannot trust data that we get back from traditional RAID solutions, neither in software RAID nor in hardware RAID solutions. The main problem in all of these RAID solutions is that they trust devices to return to correct data if they indicate that they read the data successfully. We now now, that this is a dangerous principle. So what now?

Enter ZFS.

ZFS is a combined volume and filesystem manager that is built on the principle of not trusting any piece of hardware, except the system's RAM. The latter is still problematic, but since all of our current computer system are Von Neumann machines, we must at least use the system's memory. If we use ECC RAM, we can at least put a high trust in the RAM, though.

ZFS uses a combination of techniques to avoid the write hole and data integrity problems:
  • A strong data checksum is written for any block, all the way up to the root using a Merkle tree. This implies that if we read a block from an underlying device, we know whether it is correct or not.
  • The whole system is based on copy-on-write semantics. This makes writing quite a bit slower, but you never end up in an inconsistent state.
ZFS is available on Solaris, FreeBSD, and on Linux (both as a FUSE plugin and as a native kernel module).