RAIDZ

tl;dr: RAIDZ is effective for large block sizes and sequential workloads.

Introduction

RAIDZ is a variation on RAID-5 that allows for better distribution of parity and eliminates the RAID-5 “write hole” (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group.

A raidz group can have single, double, or triple parity, meaning that the raidz group can sustain one, two, or three failures, respectively, without losing any data. The raidz1 vdev type specifies a single-parity raidz group; the raidz2 vdev type specifies a double-parity raidz group; and the raidz3 vdev type specifies a triple-parity raidz group. The raidz vdev type is an alias for raidz1.

A raidz group of N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P devices failing without losing data. The minimum number of devices in a raidz group is one more than the number of parity disks. The recommended number is between 3 and 9 to help increase performance.

Space efficiency

Actual used space for a block in RAIDZ is based on several points:

  • minimal write size is disk sector size (can be set via ashift vdev parameter)

  • stripe width in RAIDZ is dynamic, and starts with at least one data block part, or up to disks count minus parity number parts of data block

  • one block of data with size of recordsize is splitted equally via sector size parts and written on each stripe on RAIDZ vdev

  • each stripe of data will have a part of block

  • in addition to data one, two or three blocks of parity should be written, one per disk; so, for raidz2 of 5 disks there will be 3 blocks of data and 2 blocks of parity

Due to these inputs, if recordsize is less or equal to sector size, then RAIDZ’s parity size will be effictively equal to mirror with same redundancy. For example, for raidz1 of 3 disks with ashift=12 and recordsize=4K we will allocate on disk:

  • one 4K block of data

  • one 4K parity block

and usable space ratio will be 50%, same as with double mirror.

Another example for ashift=12 and recordsize=128K for raidz1 of 3 disks:

  • total stripe width is 3

  • one stripe can have up to 2 data parts of 4K size because of 1 parity blocks

  • we will have 128K/8k = 16 stripes with 8K of data and 4K of parity each

  • 16 stripes each with 12k, means we write 192k to store 128k

so usable space ratio in this case will be 66%.

The more disks RAIDZ has, the wider the stripe, the greater the space efficiency.

You can find actual parity cost per RAIDZ size here:

(source)

Performance considerations

Write

A stripe spans across all drives in the array. A one block write will write the stripe part onto each disk. A RAIDZ vdev has a write IOPS of the slowest disk in the array in the worst case because the write operation of all stripe parts must be completed on each disk.