This is an old revision of the document!

Options for ZFS on RAID

Comparison

	ZFS with RAIDZ	ZFS on HW RAID	Commercial RAID iSCSI solution
Levels	Mirror/1, RAIDZ1/5, RAIDZ2/6, RAIDZ3, dRAID	Mirror/1, RAID5, RAID6	Mirror/1, RAID5, RAID6
Data checksums in FS	Yes	Yes	probably not

ZFS specific terms

Resilvering: RAID rebuild or drive initialization. RAIDZx: Number of drives that can fail. Up to three at the moment. dRAID: RAID stripes are distributed among that drives including hot spare. Leads to faster repair after a disk swap.

Why no hardware raid with ZFS on it

OpenZFS did a writeup on this.

Hardware RAID impacts the ability to self heal small errors negatively (there are not enough “copies” of data to determine which are the correct ones and which need to be replaced).
Sector size on RAID devices can not be determined with certainty so atomic writes are virtually impossible (needs battery backup as a mitigation on the HW RAID side)
HW RAIDs are a prime example of vendor lock in. ZFS itself can be read on any recent computer system with enough storage interfaces.

ZFS active passive failover with pacemaker

On Fedora there is a resource-agent package that contains an agent for use with ZFS. The same version is available on CentOS 10 stream but it does not contain the ZFS agent. The packages for Fedora 42 can be rebuilt and work on Rocky Linux 9.

There is an official repository for ZFS packages built for EL9 (and EL8). The “stable” option is on ZFS 2.1.x series which is still “supported” at 2.1.16 in Dec 2024. TrueNAS however is on the 2.2.x series which is in the testing branch at 2.2.6 in Dec 2024. There is an option to use a prebuilt module with EL 9. This is however not signed for use with EFI Secure Boot. the DKMS build system signs its artifacts automatically but of course pulls in a lot of dependencies.

To make use of the snapshot and send/receive feature for backup, another piece software is needed. There are some options, but I will use the part of TrueNAS that is available separately as zettarepl and that can be rebuilt for EL 9. Note however that the dependencies don't translate correctly.

Install instructions:

sudo dnf install -y https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm
sudo dnf config-manager --enable zfs-testing
sudo dnf install -y kernel-devel zfs
sudo mokutil --import /var/lib/dkms/mok.pub
# choose a temporary password
sudo reboot; exit

In the mok util enroll the key. Use the password you just set.

sudo modprobe zfs # this will not work on an EFI Secure Boot system without the MOK enrolement
 
# Disable services meant for automatic import on startup -> we do this with pcs
sudo systemctl disable --now zfs-share.service zfs-import-cache.service zfs-mount.service
 
sudo pcs resource create shared-ZFS-1 ocf:heartbeat:ZFS pool=shared-ZFS-1  op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="5s"
sudo pcs resource create zfs-scrub-monthly-shared-ZFS-1 systemd\:zfs-scrub-monthly@shared-ZFS-1.timer
sudo pcs resource create shared-ZFS-2 ocf:heartbeat:ZFS pool=shared-ZFS-2  op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="5s"
sudo pcs resource create zfs-scrub-monthly-shared-ZFS-1systemd\:zfs-scrub-monthly@shared-ZFS-2.timer
sudo pcs resource group add nfsgroup shared-ZFS-1 --before clustered-nfs
sudo pcs resource group add nfsgroup zfs-scrub-monthly-shared-ZFS-1 --after shared-ZFS-1
sudo pcs resource group add nfsgroup shared-ZFS-2 --before clustered-nfs
sudo pcs resource group add nfsgroup zfs-scrub-monthly-shared-ZFS-2 --after shared-ZFS-2
 
sudo dnf install python3-coloredlogs python3-jsonschema python3-isodate python3-croniter python3-paramiko
sudo dnf install python3-zettarepl-24.10.1-2.noarch.rpm
 
sudo nano /etc/systemd/system/zettarepl.service

[Unit]
Description=zettarepl TrueNAS snapshot and replication tool
 
[Service]
Environment=PYTHONPATH=/usr/lib/python3/dist-packages/
ExecStart=/usr/bin/zettarepl run /shared-ZFS-1/config/zettarepl.yaml

Note: No install section it will be launched using pacemaker.

sudo mkdir /shared-ZFS-1/config
sudo nano /shared-ZFS-1/config/zettarepl.yaml

periodic-snapshot-tasks:
  # Each task in zettarepl must have an unique id to make references for it
  shared-ZFS-1-qh:
    # Dataset to make snapshots
    dataset: shared-ZFS-1
    # You must explicitly specify if you want recursive or non-recursive
    # snapshots
    recursive: true
    # You can exclude certain datasets from recursive snapshots
    # Please note that you won't be able to use such snapshots with recursive
    # functions of ZFS (e.g. zfs rollback -r) as it would fail with
    # "data/src/excluded@snapshot: snapshot does not exist"
    # They are still consistent with each other, i.e. they are not created
    # independently but in one transaction.
    #exclude:
    #  - data/src/excluded
    #  - data/*/excluded
    # You can specify lifetime for snapshots so they would get automatically
    # deleted up after a certain amount of time.
    # Lifetime is specified in ISO8601 Duration Format.
    # "P365D" means "365 days", "PT12H" means "12 hours" and "P30DT12H" means
    # "30 days and 12 hours"
    # When this is not specified, snapshots are not deleted automatically.
    lifetime: PT1H
    # If set to false snapshots without any changes in them will be deleted
    # and not replicated.
    # allow-empty: false
    # This is a very important parameter that defines how your snapshots would
    # be named depending on their creation date.
    # zettarepl does not readsnapshot creation date from metadata (this can be
    # very slow for reasonably big amount of snapshots), instead it relies
    # solely on their names to parse their creation date.
    # Due to this optimization, naming schema must contain all of "%Y", "%m",
    # "%d", "%H" and "%M" format strings to allow unambiguous parsing of string
    # to date and time accurate to the minute.
    # Do not create two periodic snapshot tasks for same dataset with naming
    # schemas that can be mixed up, e.g. "snap-%Y-%m-%d-%H-%M" and
    # "snap-%Y-%d-%m-%H-%M". zettarepl won't be able to check for it on early
    # stage and will get confused.
    naming-schema: qh-%Y-%m-%d-%H-%M
    # Crontab-like schedule when this replication task would run
    # default schedule is * * * * * (every minute)
    schedule:
      minute: "*/15"    # Every 15 minutes
  shared-ZFS-1-hour:
    dataset: shared-ZFS-1
    recursive: true
    #exclude:
    # - shared-ZFS-1/xyz
    lifetime: P1D
    #allow-empty: false
    naming-schema: hour-%Y-%m-%d-%H-%M
    schedule:
      minute: "0"
      hour: "*"
replication-tasks:
  shared-ZFS-1-shared-ZFS-2:
    # Either push or pull
    direction: push
 
    # Transport option defines remote host to send/receive snapshots. You
    # can also just specify "local" to send/receive snapshots to localhost
    transport:
      type: local
    # Source dataset
    source-dataset: shared-ZFS-1
    # Target dataset
    target-dataset: shared-ZFS-2
    # "recursive" and "exclude" work exactly like they work for periodic
    # snapshot tasks
    recursive: true
    #exclude:
    #  - data/src/excluded
    # Send dataset properties along with snapshots. Enabled by default.
    # Disable this if you use custom mountpoints and don't want them to be
    # replicated to remote system.
    properties: true
    # Send a replication stream package, which will replicate the specified filesystem, and all descendent file systems.
    # When received, all properties, snapshots, descendent file systems, and clones are preserved.
    # You must have recursive set to true, exclude to empty list, properties to true. Disabled by default.
    replicate: false
    # List of periodic snapshot tasks ids that are used as snapshot sources
    # for this replication task.
    # "recursive" and "exclude" fields must match between replication task
    # and all periodic snapshot tasks bound to it, i.e. you can't do
    # recursive replication of non-recursive snapshots and you must
    # exclude all child snapshots that your periodic snapshot tasks exclude
    periodic-snapshot-tasks:
      - shared-ZFS-1-qh
      - shared-ZFS-1-hour
    # If true, replication task will run automatically either after bound
    # periodic snapshot task or on schedule
    auto: true
    # How to delete snapshots on target. "source" means "delete snapshots
    # that are no more present on source", more policies documented
    # below
    retention-policy: source

sudo systemctl daemon-reload
sudo pcs resource create zettarepl systemd\:zettarepl.service
sudo pcs resource group add nfsgroup zettarepl --after shared-ZFS-2

Incompatibility between zfs' handling of nfs and resource-agents

The ocf:heartbeat:exportfs configures nfs but does not write anything to disk. There are good reasons for doing this especially with the active/passive failover setup. zfs also has the ability to contain nfs setup instructions and will use exportfs to reconfigure nfs. It however does write something to disk and assumes that there is a configuration on diks for any exports. it calls exportfs -ra which “reexports” according to the on disk state.

This combination kills the nfs connections for a short period of time on certain zfs operations like taking snapshots.

The quick and dirty fix for this is to make the ocf:heartbeat:exportfs persist to disk. Hack:

--- /usr/lib/ocf/resource.d/heartbeat/exportfs.old      2024-12-20 23:46:19.223840200 +0100
+++ /usr/lib/ocf/resource.d/heartbeat/exportfs  2024-12-20 23:49:52.546388172 +0100
@@ -339,6 +339,7 @@
        fi
 
        ocf_log info "directory $dir exported"
+        cp /var/lib/nfs/etab /etc/exports.d/heartbeat.exports
        return $OCF_SUCCESS
 }
 exportfs_start ()
@@ -403,6 +404,7 @@
 unexport_one() {
        local dir=$1
        ocf_run exportfs -v -u ${OCF_RESKEY_clientspec}:$dir
+        cp /var/lib/nfs/etab /etc/exports.d/heartbeat.exports
 }
 exportfs_stop ()
 {

Projekte

This is an old revision of the document!

Table of Contents

Options for ZFS on RAID

Comparison

ZFS specific terms

Why no hardware raid with ZFS on it

ZFS active passive failover with pacemaker

Incompatibility between zfs' handling of nfs and resource-agents

**This is an old revision of the document!**

Table of Contents

Options for ZFS on RAID

Comparison

ZFS specific terms

Why no hardware raid with ZFS on it

ZFS active passive failover with pacemaker

Incompatibility between zfs' handling of nfs and resource-agents

This is an old revision of the document!