BTRFS / ZFS, Swap, Swap with EXT4 journaling

BTRFS

Create RAID1 (with self healing features):
mkfs.btrfs -m raid1 -d raid1 /dev/sdXX /dev/sdYY --label <name>
  
Mount one disk and it mounts the whole RAID (do not mount more than one):
mount /dev/sdXX /mnt/<name>
  
Add a disk:
mkfs.btrfs /dev/sdXX
btrfs device add /dev/sdZZ /mnt/<name> -f
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/<name>
  
Check balancing status:
btrfs balance status -v /mnt/<name>
  
Check file system status:
btrfs fi show
  
To auto mount at boot, run btrfs fi show, copy the UUID into /etc/fstab:
UUID=<uuid> /mnt/<name> defaults 0 0

Run scrub to check and repair potential errors:
btrfs scrub start /mnt/<name>

Check scrub status, cancel and resume:
btrfs scrub status /mnt/<name> (-d to show per disk)
btrfs scrub cancel /mnt/<name>
btrfs scrub resume /mnt/<name>
  
If scrub does not fix the errors, then it may be reversed to a working B-tree:
mount –o recovery /dev/sda /mnt/<name>

A degraded one disk BTRFS RAID1 results in a generic error. mount: invalid file system type, invalid flag invalid superblock on /dev/sdXX. ("mount: fel filsystemstyp, felaktig flagga, felaktigt superblock på /dev/sdXX")

But this can be overridden and the RAID1 can run with only one disk:
mount -t btrfs -o degraded /dev/sdXX /mnt/<name>

df reports invalid free space for BTRFS.

Check btrfs status /mnt/<name>
- High DUP value but not much free, then balance meta data by running:
btrfs balance start -m /mnt/<name>
- Out of metadata space, used close to total, free up 5% used data blocks:
btrfs balance start -dusage=5 /mnt/<name>

Beware of btrfs check --repair, ask developers instead.

http://www.beginninglinux.com/btrfs
https://www.thegeekdiary.com/how-to-use-btrfs-scrub-command-to-manage-scrubbing-on-btrfs-file-systems/
https://askubuntu.com/questions/464074/ubuntu-thinks-btrfs-disk-is-full-but-its-not

ZFS (OpenZFS)


Create a RAID1 pool with self-healing features and 4096 sectors (ashift=12):
zpool create <pool-name> -o ashift=12 mirror /dev/disk/by-id/<disk or partition 1> /dev/disk/by-id/<disk or partition 2> -m /mnt/<name> -f

Note 1, <pool-name> cannot be changed afterwards but mount point can.

Note 2, it is recommended to use by-id names and not /dev/sdx(X) because of possible name changes.

Note 3, not using ashift=12 results in 512 byte sectors.

Note 4, not using -m results in that it ends up in /<pool-name>.

Note 5, it seems there is a must to have at least 2 disks or partitions when creating the RAID1 (-f does not work), but one disk can be offlined and removed when this has been done although the RAID1 then runs in a degraded state.

Note 6, pools cannot be shrinked, only grown in size, therefore make sure the size is correct - add sufficient space for swap partitions and such.

Using whole-disk results in the disks being GPT partitoned with the following partitions:
/dev/sdX1 - Solaris /usr & Apple ZFS - linux_raid_member, the data partition
/dev/sdX9 - Solaris reserved 1, an empty partition only to fill the disk out for sector alignment

List pools:
zpool list

Check for errors:
zpool scrub /<pool-name>

Remove pool:
zpool destroy /mnt/<name>

Mount and unmount:
zfs mount/umount <pool-name>

Check mount point:
zfs get mountpoint <pool-name>

Change mount point:
zfs set mountpoint=/mnt/<name> <pool-name>

Check pool status - list disks and so on:
zpool status <pool-name>

Check pool status - list disks with full paths:
zpool status -P <pool-name>

Offline a disk or partition (to disconnect disk or partition):
zpool offline <pool-name> <disk or partition name from zpool status <pool-name>>

Online a disk or partiton (to re-add disk or partition after it has been disconnected and reconnected):
zpool online <pool-name> <disk or partition name from zpool status <pool-name>>

Detach disk:
zpool detach <pool-name> <disk or partition name from zpool status <pool-name>>

Attach disk - requires another disk from zpool with the data on for some kind of reference:
zpool attach <pool-name> <already-attached-disk> <disk-to-attach>

Making a disk och partition online triggers a resilvering process followed by a mail when completed. This goes quite fast if there are few differences.

Expand pool size:
1. Re-create the partitions and make them bigger or use larger disks
2. Run zpool set autoexpand=on <pool-name> 
3. Run zpool online -e <pool-name> <disk-name> for each disk

The available disk space increased after step 3.

ZFS - memory usage


There are a lot of recommendations to have enormous quantities or RAM when using ZFS, but there are seldom any explanation of why or what it is in ZFS that consumes it. I found a resource explaining it: https://www.zfsbuild.com/2010/04/15/explanation-of-arc-and-l2arc/.

It turns out that ZFS actually caches the most frequently used data in the RAM, this is the so-called ARC cache. By default it can chew up all RAM except 1 GB. It is also possible to have a second cache based on a SSD, this is the L2ARC cache.

The ARC is reduced if other applications need the memory. This can however be a troublesome if some application need the memory directly in order to start. There are ways to limit the caches.

An idle but online ZFS partition does not seem to eat much at all, just a few hundred MB.

BTRFS / ZFS - Test self healing features


I tested the self healing features in both BTRFS and ZFS by overwriting one of the disks with random data and then asking the file system to scrub itself. Both had 2 disks and ran in RAID1 configuration.

To test, create a RAID1 setup with 2 small disks (I used 512 MB).

Mount the storage, BTRFS: mount /dev/sdXX /mnt/somewhere, ZFS: zfs mount <pool-name>.

Go to the root folder - cd /mnt/somewhere.

Create test data dd if=/dev/urandom.

Create a MD5 file sum for the contents: md5deep -r -e -l -of * > /somewhere/files.md5

Mess one of the underlaying disks up a bit: dd if=/dev/urandom of=/dev/sdXX bs=1024 seek=15000 count=15000.

Make sure to set the correct device when doing this of course.

Run the scrub process, BTRFS: btrfs scrub start /mnt/<name>, ZFS: zpool scrub /<pool-name>.

Check the statuses, BTRFS: btrfs scrub status /mnt/<name>, ZFS: zpool status /<pool-name>.
This is where the magic appears - both file systems healed themselves.

Verify the MD5 sum: md5deep -m /somewhere/files.md5 *

ZFS scrub schedule

One common bully point for users running ZFS is that they have forgotten to make a scrub schedule - when to periodically check the mirror for errors and correct them.

It turns out Debian already has one scheduled when installing the standard ZFS utilities, located at:
/etc/cron.d/zfsutils-linux

It reads:
# Scrub the second Sunday of every month
24 0 8-14 * * root [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ] && /usr/lib/zfs-linux/scrub

So it seems to run the second Sunday every month.

Size, read speed and write speed comparisons

For a 512 MB drive resulted in BTRFS 447MB usable space and ZFS 464MB. I checked the actual occupied space, not what df reported. 4% more storage with ZFS.
    
Read tests:
BTRFS: 373030912 byte (373 MB, 356 MiB) copied, 6,8928 s, 54,1 MB/s
ZFS: 373030912 byte (373 MB, 356 MiB) copied, 8,63009 s, 43,2 MB/s
    
Write tests:
ZFS: 40960000 byte (41 MB, 39 MiB) copied, 3,26608 s, 12,5 MB/s*
BTRFS: 40960000 byte (41 MB, 39 MiB) copied, 0,322734 s, 127 MB/s

Note, the above read and write tests are not accurate for ZFS, it turns out it is much slower on degraded RAID1 arrays, so these values are too low.

Swap


According to the content on https://unix.stackexchange.com/questions/269098/silent-disk-errors-and-reliability-of-linux-swap does it appear like Linux has any kind of checksum controls on swap, it just assumes that the underlying storage hardware layer takes care of this.

The most simple way to get some kind of error checking besides the hardware is to make an EXT4 partition and put a swap file on it.

Swap with EXT4 journaling


cfdisk /dev/sdXY
Make a new partition, set the type to 83 Linux (not 82 Linux Swap), by going to New, create a partition, then go to Type and set it to Linux 83 and then go to Write.

Format it as EXT4:
mkfs.ext4 /dev/sdXY

Set the reserved blocks to none:
tune2fs -m 0 /dev/sdXY

Check the UUID of the partition, note the text UUID="<this text>"
blkid

Make a swap mount location:
mkdir -p /mnt/swap0

Mount the swap partition there:
mount /dev/sdXY /mnt/swap0


Make a swap file that fills the whole partition with zeros:
dd if=/dev/zero of=/mnt/swap0/swapfile0 bs=16M

Prepare the swap:
mkswap /mnt/swap0/swapfile0

Edit /etc/fstab, add these lines:
UUID=<UUID from above> /mnt/swap0    ext4    defaults 0 0
/mnt/swap0/swapfile0 none swap sw 0 0

Use the swap:
swapon /mnt/swap0/swapfile0
Or to parse the fstab and use the definition there:
mount -a

Change log

2021-01-10 19:35:10.

This is a personal note. Last updated: 2024-03-29 21:10:25.