System Configuration

  1. Optional: wipe solid-state drives with the generic tool blkdiscard, to clean previous partition tables and improve performance.

    All content will be irrevocably destroyed:

    for i in ${DISK}; do
    blkdiscard -f $i &
    done
    wait
    

    This is a quick operation and should be completed under one minute.

    For other device specific methods, see Memory cell clearing

  2. Partition the disks. See Overview for details:

    for i in ${DISK}; do
    sgdisk --zap-all $i
    sgdisk -n1:1M:+${INST_PARTSIZE_ESP}G -t1:EF00 $i
    sgdisk -n2:0:+${INST_PARTSIZE_BPOOL}G -t2:BE00 $i
    if [ "${INST_PARTSIZE_SWAP}" != "" ]; then
        sgdisk -n4:0:+${INST_PARTSIZE_SWAP}G -t4:8200 $i
    fi
    if [ "${INST_PARTSIZE_RPOOL}" = "" ]; then
        sgdisk -n3:0:0   -t3:BF00 $i
    else
        sgdisk -n3:0:+${INST_PARTSIZE_RPOOL}G -t3:BF00 $i
    fi
    sgdisk -a1 -n5:24K:+1000K -t5:EF02 $i
    done
    
  3. Create boot pool:

    zpool create \
    -d -o feature@async_destroy=enabled \
    -o feature@bookmarks=enabled \
    -o feature@embedded_data=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@extensible_dataset=enabled \
    -o feature@filesystem_limits=enabled \
    -o feature@hole_birth=enabled \
    -o feature@large_blocks=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
        -o ashift=12 \
        -o autotrim=on \
        -O acltype=posixacl \
        -O canmount=off \
        -O compression=lz4 \
        -O devices=off \
        -O normalization=formD \
        -O relatime=on \
        -O xattr=sa \
        -O mountpoint=/boot \
        -R /mnt \
        bpool_$INST_UUID \
        $INST_VDEV \
        $(for i in ${DISK}; do
           printf "$i-part2 ";
          done)
    

    You should not need to customize any of the options for the boot pool.

    GRUB does not support all of the zpool features. See spa_feature_names in grub-core/fs/zfs/zfs.c. This step creates a separate boot pool for /boot with the features limited to only those that GRUB supports, allowing the root pool to use any/all features.

    Features enabled with -o compatibility=grub2 can be seen here.

  4. Create root pool:

    zpool create \
        -o ashift=12 \
        -o autotrim=on \
        -R /mnt \
        -O acltype=posixacl \
        -O canmount=off \
        -O compression=zstd \
        -O dnodesize=auto \
        -O normalization=formD \
        -O relatime=on \
        -O xattr=sa \
        -O mountpoint=/ \
        rpool_$INST_UUID \
        $INST_VDEV \
       $(for i in ${DISK}; do
          printf "$i-part3 ";
         done)
    

    Notes:

    • The use of ashift=12 is recommended here because many drives today have 4 KiB (or larger) physical sectors, even though they present 512 B logical sectors. Also, a future replacement drive may have 4 KiB physical sectors (in which case ashift=12 is desirable) or 4 KiB logical sectors (in which case ashift=12 is required).

    • Setting -O acltype=posixacl enables POSIX ACLs globally. If you do not want this, remove that option, but later add -o acltype=posixacl (note: lowercase “o”) to the zfs create for /var/log, as journald requires ACLs

    • Setting normalization=formD eliminates some corner cases relating to UTF-8 filename normalization. It also implies utf8only=on, which means that only UTF-8 filenames are allowed. If you care to support non-UTF-8 filenames, do not use this option. For a discussion of why requiring UTF-8 filenames may be a bad idea, see The problems with enforced UTF-8 only filenames.

    • recordsize is unset (leaving it at the default of 128 KiB). If you want to tune it (e.g. -o recordsize=1M), see these various blog posts.

    • Setting relatime=on is a middle ground between classic POSIX atime behavior (with its significant performance impact) and atime=off (which provides the best performance by completely disabling atime updates). Since Linux 2.6.30, relatime has been the default for other filesystems. See RedHat’s documentation for further information.

    • Setting xattr=sa vastly improves the performance of extended attributes. Inside ZFS, extended attributes are used to implement POSIX ACLs. Extended attributes can also be used by user-space applications. They are used by some desktop GUI applications. They can be used by Samba to store Windows ACLs and DOS attributes; they are required for a Samba Active Directory domain controller. Note that xattr=sa is Linux-specific. If you move your xattr=sa pool to another OpenZFS implementation besides ZFS-on-Linux, extended attributes will not be readable (though your data will be). If portability of extended attributes is important to you, omit the -O xattr=sa above. Even if you do not want xattr=sa for the whole pool, it is probably fine to use it for /var/log.

    • Make sure to include the -part3 portion of the drive path. If you forget that, you are specifying the whole disk, which ZFS will then re-partition, and you will lose the bootloader partition(s).

  5. This section implements dataset layout as described in overview.

    Create root system container:

    • Unencrypted:

      zfs create \
       -o canmount=off \
       -o mountpoint=none \
       rpool_$INST_UUID/$INST_ID
      
    • Encrypted:

      Pick a strong password. Once compromised, changing password will not keep your data safe. See zfs-change-key(8) for more info:

      zfs create \
       -o canmount=off \
       -o mountpoint=none \
       -o encryption=aes-256-gcm \
       -o keylocation=prompt \
       -o keyformat=passphrase \
       rpool_$INST_UUID/$INST_ID
      

    Create other system datasets:

    zfs create -o canmount=off -o mountpoint=none bpool_$INST_UUID/$INST_ID
    zfs create -o canmount=off -o mountpoint=none bpool_$INST_UUID/$INST_ID/BOOT
    zfs create -o canmount=off -o mountpoint=none rpool_$INST_UUID/$INST_ID/ROOT
    zfs create -o canmount=off -o mountpoint=none rpool_$INST_UUID/$INST_ID/DATA
    zfs create -o mountpoint=/boot -o canmount=noauto bpool_$INST_UUID/$INST_ID/BOOT/default
    zfs create -o mountpoint=/ -o canmount=off    rpool_$INST_UUID/$INST_ID/DATA/default
    zfs create -o mountpoint=/ -o canmount=off    rpool_$INST_UUID/$INST_ID/DATA/local
    zfs create -o mountpoint=/ -o canmount=noauto rpool_$INST_UUID/$INST_ID/ROOT/default
    zfs mount rpool_$INST_UUID/$INST_ID/ROOT/default
    zfs mount bpool_$INST_UUID/$INST_ID/BOOT/default
    for i in {usr,var,var/lib};
    do
        zfs create -o canmount=off rpool_$INST_UUID/$INST_ID/DATA/default/$i
    done
    for i in {home,root,srv,usr/local,var/log,var/spool};
    do
        zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/$i
    done
    chmod 750 /mnt/root
    for i in {nix,}; do
        zfs create -o canmount=on -o mountpoint=/$i rpool_$INST_UUID/$INST_ID/DATA/local/$i
    done
    

    Datasets for immutable root filesystem:

    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/state
    for i in {/etc/nixos,/etc/cryptkey.d}; do
      mkdir -p /mnt/state/$i /mnt/$i
      mount -o bind /mnt/state/$i /mnt/$i
    done
    zfs create -o mountpoint=/ -o canmount=noauto rpool_$INST_UUID/$INST_ID/ROOT/empty
    zfs snapshot rpool_$INST_UUID/$INST_ID/ROOT/empty@start
    
  6. Format and mount ESP:

    for i in ${DISK}; do
     mkfs.vfat -n EFI ${i}-part1
     mkdir -p /mnt/boot/efis/${i##*/}-part1
     mount -t vfat ${i}-part1 /mnt/boot/efis/${i##*/}-part1
    done
    
  7. Create optional user data datasets to omit data from rollback:

    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/games
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/www
    # for GNOME
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/AccountsService
    # for Docker
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/docker
    # for NFS
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/nfs
    # for LXC
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/lxc
    # for LibVirt
    zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/libvirt
    ##other application
    # zfs create -o canmount=on rpool_$INST_UUID/$INST_ID/DATA/default/var/lib/$name
    

    Add other datasets when needed, such as PostgreSQL.

  8. Generate initial NixOS system configuration:

    nixos-generate-config --root /mnt
    

    This command will generate two files, configuration.nix and hardware-configuration-zfs.nix, which will be the starting point of configuring the system.

  9. Edit config file to import ZFS options:

    sed -i "s|./hardware-configuration.nix|./hardware-configuration-zfs.nix ./${INST_CONFIG_FILE}|g" /mnt/etc/nixos/configuration.nix
    # backup, prevent being overwritten by nixos-generate-config
    mv /mnt/etc/nixos/hardware-configuration.nix /mnt/etc/nixos/hardware-configuration-zfs.nix
    
  10. ZFS options:

    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
    { config, pkgs, ... }:
    
    { boot.supportedFilesystems = [ "zfs" ];
      networking.hostId = "$(head -c 8 /etc/machine-id)";
      boot.zfs.devNodes = "${INST_PRIMARY_DISK%/*}";
    EOF
    

    ZFS datasets should be mounted with -o zfsutil option:

    sed -i 's|fsType = "zfs";|fsType = "zfs"; options = [ "zfsutil" ];|g' \
    /mnt/etc/nixos/hardware-configuration-zfs.nix
    

    Allow EFI system partition mounting to fail at boot:

    sed -i 's|fsType = "vfat";|fsType = "vfat"; options = [ "x-systemd.idle-timeout=1min" "x-systemd.automount" "noauto" ];|g' \
    /mnt/etc/nixos/hardware-configuration-zfs.nix
    

    Disable cache:

    mkdir -p /mnt/state/etc/zfs/
    rm -f /mnt/state/etc/zfs/zpool.cache
    touch /mnt/state/etc/zfs/zpool.cache
    chmod a-w /mnt/state/etc/zfs/zpool.cache
    chattr +i /mnt/state/etc/zfs/zpool.cache
    
  11. If swap is enabled:

    if [ "${INST_PARTSIZE_SWAP}" != "" ]; then
    sed -i '/swapDevices/d' /mnt/etc/nixos/hardware-configuration-zfs.nix
    
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
      swapDevices = [
    EOF
    for i in $DISK; do
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
        { device = "$i-part4"; randomEncryption.enable = true; }
    EOF
    done
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
      ];
    EOF
    fi
    
  12. For immutable root file system, save machine-id and other files:

    mkdir -p /mnt/state/etc/{ssh,zfs}
    systemd-machine-id-setup --print > /mnt/state/etc/machine-id
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
      systemd.services.zfs-mount.enable = false;
      environment.etc."machine-id".source = "/state/etc/machine-id";
      environment.etc."zfs/zpool.cache".source
        = "/state/etc/zfs/zpool.cache";
    EOF
    
  13. Configure GRUB boot loader for both legacy boot and UEFI:

    sed -i '/boot.loader/d' /mnt/etc/nixos/configuration.nix
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
      boot.loader = {
        generationsDir.copyKernels = true;
        ##for problematic UEFI firmware
        grub.efiInstallAsRemovable = true;
        efi.canTouchEfiVariables = false;
        ##if UEFI firmware can detect entries
        #efi.canTouchEfiVariables = true;
        efi.efiSysMountPoint = "/boot/efis/${INST_PRIMARY_DISK##*/}-part1";
        grub.enable = true;
        grub.version = 2;
        grub.copyKernels = true;
        grub.efiSupport = true;
        grub.zfsSupport = true;
        # for systemd-autofs
        grub.extraPrepareConfig = ''
          mkdir -p /boot/efis
          for i in  /boot/efis/*; do mount \$i ; done
        '';
        grub.devices = [
    EOF
    for i in $DISK; do
      printf "      \"$i\"\n" >>/mnt/etc/nixos/${INST_CONFIG_FILE}
    done
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
        ];
        grub.mirroredBoots = [
    EOF
    for i in $DISK; do
      printf "      { devices = [ \"$i\" ] ; efiSysMountPoint = \"/boot/efis/${i##*/}-part1\"; path = \"/boot\"; }\n" \
      >>/mnt/etc/nixos/${INST_CONFIG_FILE}
    done
    tee -a /mnt/etc/nixos/${INST_CONFIG_FILE} <<EOF
        ];
      };
    EOF