**This is an old revision of the document!**
Table of Contents
Options for ZFS on RAID
Comparison
| ZFS with RAIDZ | ZFS on HW RAID | Commercial RAID iSCSI solution | |
|---|---|---|---|
| Levels | Mirror/1, RAIDZ1/5, RAIDZ2/6, RAIDZ3, dRAID | Mirror/1, RAID5, RAID6 | Mirror/1, RAID5, RAID6 |
| Data checksums in FS | Yes | Yes | probably not |
ZFS specific terms
Resilvering: RAID rebuild or drive initialization. RAIDZx: Number of drives that can fail. Up to three at the moment. dRAID: RAID stripes are distributed among that drives including hot spare. Leads to faster repair after a disk swap.
Why no hardware raid with ZFS on it
OpenZFS did a writeup on this.
- Hardware RAID impacts the ability to self heal small errors negatively (there are not enough “copies” of data to determine which are the correct ones and which need to be replaced).
- Sector size on RAID devices can not be determined with certainty so atomic writes are virtually impossible (needs battery backup as a mitigation on the HW RAID side)
- HW RAIDs are a prime example of vendor lock in. ZFS itself can be read on any recent computer system with enough storage interfaces.
ZFS active passive failover with pacemaker
On Fedora there is a resource-agent package that contains an agent for use with ZFS. The same version is available on CentOS 10 stream but it does not contain the ZFS agent. The packages for Fedora 42 can be rebuilt and work on Rocky Linux 9.
There is an official repository for ZFS packages built for EL9 (and EL8). The “stable” option is on ZFS 2.1.x series which is still “supported” at 2.1.16 in Dec 2024. TrueNAS however is on the 2.2.x series which is in the testing branch at 2.2.6 in Dec 2024. There is an option to use a prebuilt module with EL 9. This is however not signed for use with EFI Secure Boot. the DKMS build system signs its artifacts automatically but of course pulls in a lot of dependencies.
To make use of the snapshot and send/receive feature for backup, another piece software is needed. There are some options, but I will use the part of TrueNAS that is available separately as zettarepl and that can be rebuilt for EL 9. Note however that the dependencies don't translate correctly.
Install instructions:
sudo dnf install -y https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm
sudo dnf config-manager --enable zfs-testing
sudo dnf install -y kernel-devel zfs
sudo mokutil --import /var/lib/dkms/mok.pub
# choose a temporary password
sudo reboot; exit
In the mok util enroll the key. Use the password you just set.
sudo modprobe zfs # this will not work on an EFI Secure Boot system without the MOK enrolement # Disable services meant for automatic import on startup -> we do this with pcs sudo systemctl disable --now zfs-share.service zfs-import-cache.service zfs-mount.service sudo pcs resource create shared-ZFS-1 ocf:heartbeat:ZFS pool=shared-ZFS-1 op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="5s" sudo pcs resource create zfs-scrub-monthly-shared-ZFS-1 systemd\:zfs-scrub-monthly@shared-ZFS-1.timer sudo pcs resource create shared-ZFS-2 ocf:heartbeat:ZFS pool=shared-ZFS-2 op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="5s" sudo pcs resource create zfs-scrub-monthly-shared-ZFS-1systemd\:zfs-scrub-monthly@shared-ZFS-2.timer sudo pcs resource group add nfsgroup shared-ZFS-1 --before clustered-nfs sudo pcs resource group add nfsgroup zfs-scrub-monthly-shared-ZFS-1 --after shared-ZFS-1 sudo pcs resource group add nfsgroup shared-ZFS-2 --before clustered-nfs sudo pcs resource group add nfsgroup zfs-scrub-monthly-shared-ZFS-2 --after shared-ZFS-2 sudo dnf install python3-coloredlogs python3-jsonschema python3-isodate python3-croniter python3-paramiko sudo dnf install python3-zettarepl-24.10.1-2.noarch.rpm sudo nano /etc/systemd/system/zettarepl.service
[Unit] Description=zettarepl TrueNAS snapshot and replication tool [Service] Environment=PYTHONPATH=/usr/lib/python3/dist-packages/ ExecStart=/usr/bin/zettarepl run /shared-ZFS-1/config/zettarepl.yaml
Note: No install section it will be launched using pacemaker.
sudo mkdir /shared-ZFS-1/config sudo nano /shared-ZFS-1/config/zettarepl.yaml
periodic-snapshot-tasks: # Each task in zettarepl must have an unique id to make references for it shared-ZFS-1-qh: # Dataset to make snapshots dataset: shared-ZFS-1 # You must explicitly specify if you want recursive or non-recursive # snapshots recursive: true # You can exclude certain datasets from recursive snapshots # Please note that you won't be able to use such snapshots with recursive # functions of ZFS (e.g. zfs rollback -r) as it would fail with # "data/src/excluded@snapshot: snapshot does not exist" # They are still consistent with each other, i.e. they are not created # independently but in one transaction. #exclude: # - data/src/excluded # - data/*/excluded # You can specify lifetime for snapshots so they would get automatically # deleted up after a certain amount of time. # Lifetime is specified in ISO8601 Duration Format. # "P365D" means "365 days", "PT12H" means "12 hours" and "P30DT12H" means # "30 days and 12 hours" # When this is not specified, snapshots are not deleted automatically. lifetime: PT1H # If set to false snapshots without any changes in them will be deleted # and not replicated. # allow-empty: false # This is a very important parameter that defines how your snapshots would # be named depending on their creation date. # zettarepl does not readsnapshot creation date from metadata (this can be # very slow for reasonably big amount of snapshots), instead it relies # solely on their names to parse their creation date. # Due to this optimization, naming schema must contain all of "%Y", "%m", # "%d", "%H" and "%M" format strings to allow unambiguous parsing of string # to date and time accurate to the minute. # Do not create two periodic snapshot tasks for same dataset with naming # schemas that can be mixed up, e.g. "snap-%Y-%m-%d-%H-%M" and # "snap-%Y-%d-%m-%H-%M". zettarepl won't be able to check for it on early # stage and will get confused. naming-schema: qh-%Y-%m-%d-%H-%M # Crontab-like schedule when this replication task would run # default schedule is * * * * * (every minute) schedule: minute: "*/15" # Every 15 minutes shared-ZFS-1-hour: dataset: shared-ZFS-1 recursive: true #exclude: # - shared-ZFS-1/xyz lifetime: P1D #allow-empty: false naming-schema: hour-%Y-%m-%d-%H-%M schedule: minute: "0" hour: "*" replication-tasks: shared-ZFS-1-shared-ZFS-2: # Either push or pull direction: push # Transport option defines remote host to send/receive snapshots. You # can also just specify "local" to send/receive snapshots to localhost transport: type: local # Source dataset source-dataset: shared-ZFS-1 # Target dataset target-dataset: shared-ZFS-2 # "recursive" and "exclude" work exactly like they work for periodic # snapshot tasks recursive: true #exclude: # - data/src/excluded # Send dataset properties along with snapshots. Enabled by default. # Disable this if you use custom mountpoints and don't want them to be # replicated to remote system. properties: true # Send a replication stream package, which will replicate the specified filesystem, and all descendent file systems. # When received, all properties, snapshots, descendent file systems, and clones are preserved. # You must have recursive set to true, exclude to empty list, properties to true. Disabled by default. replicate: false # List of periodic snapshot tasks ids that are used as snapshot sources # for this replication task. # "recursive" and "exclude" fields must match between replication task # and all periodic snapshot tasks bound to it, i.e. you can't do # recursive replication of non-recursive snapshots and you must # exclude all child snapshots that your periodic snapshot tasks exclude periodic-snapshot-tasks: - shared-ZFS-1-qh - shared-ZFS-1-hour # If true, replication task will run automatically either after bound # periodic snapshot task or on schedule auto: true # How to delete snapshots on target. "source" means "delete snapshots # that are no more present on source", more policies documented # below retention-policy: source
sudo systemctl daemon-reload sudo pcs resource create zettarepl systemd\:zettarepl.service sudo pcs resource group add nfsgroup zettarepl --after shared-ZFS-2
Incompatibility between zfs' handling of nfs and resource-agents
The ocf:heartbeat:exportfs configures nfs but does not write anything to disk. There are good reasons for doing this especially with the active/passive failover setup. zfs also has the ability to contain nfs setup instructions and will use exportfs to reconfigure nfs. It however does write something to disk and assumes that there is a configuration on diks for any exports. it calls exportfs -ra which “reexports” according to the on disk state.
This combination kills the nfs connections for a short period of time on certain zfs operations like taking snapshots.
The quick and dirty fix for this is to make the ocf:heartbeat:exportfs persist to disk. Hack:
--- /usr/lib/ocf/resource.d/heartbeat/exportfs.old 2024-12-20 23:46:19.223840200 +0100 +++ /usr/lib/ocf/resource.d/heartbeat/exportfs 2024-12-20 23:49:52.546388172 +0100 @@ -339,6 +339,7 @@ fi ocf_log info "directory $dir exported" + cp /var/lib/nfs/etab /etc/exports.d/heartbeat.exports return $OCF_SUCCESS } exportfs_start () @@ -403,6 +404,7 @@ unexport_one() { local dir=$1 ocf_run exportfs -v -u ${OCF_RESKEY_clientspec}:$dir + cp /var/lib/nfs/etab /etc/exports.d/heartbeat.exports } exportfs_stop () {