Zero-assumptions ZFS, part 1
This is the first in a series of articles about ZFS, and is part of what I hope becomes an ongoing series here: the zero-assumptions write up. This article will be written assuming you know nothing about ZFS.
I’ve been interested in ZFS for a while now, but didn’t have a good reason to use it for anything. Last time I looked at it ZFS on Linux was still a bit immature, but in the past few years Linux support for ZFS has really stepped up, so I decided to give it another go. ZFS is a bit of a world unto itself though, with most resources walking you through some quick commands without explaining the concepts underlying ZFS, or assuming the user is very familiar with traditional RAID terminology.
I keep one desktop machine in my house (running bedrock linux with an Ubuntu base and an Arch Linux strata on top) that acts, among other things, as a storage/media server. I keep photos and other digital detritus I’ve collected over the years there, and would be very sad if they were to disappear. I back everything up nightly via the excellent restic to the also excellent Backblaze B2, but since I have terabytes of data stored there I haven’t followed the cardinal rule of backing up: make sure that you can actually restore from your backups. Since testing that on my internet connection would take months, and I’m afraid of accidentally deleting data or drive failure, I decided to add a bit more redundancy.
My server has 3 hard drives in it right now: one 4TB drive spinning disk drive, one 2TB drive spinning disk, and one 500GB SSD drive that holds the root filesystem. The majority of the data I want to keep is on the 2TB drive, and the 4TB drive is mostly empty. After doing some research (read: browsing posts on /r/datahoarder), it seems the two most common tools people use to add transparent redundancy are a snapraid + mergerfs combo, or the old standby, ZFS.
Installing ZFS on Linux
Getting ZFS installed on Linux (assuming you don’t try to use it as the root filesystem) is almost comically easy these days. On Ubuntu 16.04+ (and probably recent Debian releases too), this should be as straightforward as:
sudo apt install zfs-dkms zfs-fuse zfs-initramfs zfsutils-linux
For simplicity, the above command installs more than is strictly needed:
zfs-fuseare different implementations of ZFS for linux, and either should be enough to use ZFS on it’s own. The reason there are multiple implementations is due to how linux does things.
zfs-dkmsuses a technology (unsurprisingly) called DKMS, while
zfs-fuseuses (even less surprisingly) a technology called FUSE. FUSE makes it easier for developers to implement filesystems at the cost of a bit of performance. DKMS stands for Dynamic Kernel Module support, and is a means by which you can install the source code for a module and let the linux distro itself take care of compiling that source to match the running Linux kernel.
sudo pacman -Syu zfs-linux
Planning your drives
The first step to getting started with ZFS was to figure out how I wanted to use my drives. Most people who use ZFS for these purposes seem to go out and buy multiple big hard drives, and then use ZFS to mirror them. I just wanted more data redundancy on the drives I already had, so I decided to partition my drives.
Since I have one 2TB drive that I want backed up, I first partitioned my 4TB drive into two 2TB partitions using gparted. I then created an ext4 filesystem on the second drive.
Then I used
lsblk to check my handiwork. These two tools print
lists of all the “block devices” (read: hard disks) in my system and show
different ways to refer to them in Linux:
$ blkid /dev/sda1: UUID="7600-739F" TYPE="vfat" PARTUUID="ded30b23-f318-433c-bfb2-15738d42cc01" /dev/sda2: LABEL="500gb-ssd-root" UUID="906bd064-2156-4a88-8d88-8940af7c5a34" TYPE="ext4" PARTLABEL="500gb-ssd-root" PARTUUID="cc6695ed-1a2b-4cb1-b302-37614cf07bf7" /dev/sdc1: LABEL="zstore" UUID="5303013864921755800" UUID_SUB="17834655468516818280" TYPE="ext4" PARTUUID="072d0dd9-a1bf-4c67-b9b3-046f37c48846" /dev/sdc2: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="266677788785228698" TYPE="ext4" PARTLABEL="extra2tb" PARTUUID="1f9e7fd1-1da6-4dbd-9302-95f6ea62fff0" /dev/sdb1: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="89185545293388421" TYPE="zfs_member" PARTUUID="5626d9ea-01" /dev/sde1: UUID="acd97a41-df27-4b69-924c-9290470b735d" TYPE="ext4" PARTLABEL="wd2tb" PARTUUID="6ca94069-5fc8-4466-bba2-e5b6237a19b7"
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 1.8T 0 disk └─sdb1 8:17 0 1.8T 0 part sde 8:64 0 1.8T 0 disk └─sde1 8:65 0 1.8T 0 part sdc 8:32 0 3.7T 0 disk ├─sdc2 8:34 0 1.8T 0 part └─sdc1 8:33 0 1.8T 0 part sda 8:0 0 477G 0 disk ├─sda2 8:2 0 476.4G 0 part └─sda1 8:1 0 512M 0 part
If you’re not familiar with how Linux handles hard disks, Linux refers to hard disks as “block devices.” Linux provides access to physical hardware through a virtual filesystem it mounts at
/dev, and depending on what type of hard drive you have, hard disks will generally be of the format
Xis a letter from a-z that Linux assigns to the drive. Partitions on each disk are then assigned a number, so in the
lsblkoutput above, you can see that disk
sdchas two partitions, which show up in the output as
blkidcommand shows the traditional
/dev/sdXlabels, but also adds
UUIDs, which you can think of as a random id that will always refer to that particular disk. The reason for this is that if you were to unplug one of your drives and plug it into a different port Linux may give it a different
/dev/sdXname, e.g. if you unplugged the
/dev/sdcdrive and plugged it into another port it may become
/dev/sda, but it would keep the same
I wanted to convert my 2TB drive to ZFS, but since my precious data is all
currently located on my 2TB drive (
/dev/sdb1 above), I decided to pull a
swaparoo and first copy everything onto the second partition of my 4TB drive
/dev/sdc2 above), then let ZFS takeover the original partition (
and copy the data back onto that drive.
The end result I’m looking for is to have a layout with two “pools” (zfs-speak for sets of drives, more on this later). One pool should consist of my original 2TB drive, replicated to one of the 2TB partitions on my 4TB drive. The extra 2TB partition available on the 4TB drive will act as a second pool, which gives me nice ZFS benefits like checksumming and the ability to take snapshots of the drive, as well as the option to add another 2TB drive/partition later and mirror the data.
If you’re already familiar with
zpool, this is what the finished setup looks
$ sudo zpool status pool: longterm state: ONLINE config: NAME STATE READ WRITE CKSUM longterm ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc2 ONLINE 0 0 0 sdb1 ONLINE 0 0 0 pool: zstore state: ONLINE config: NAME STATE READ WRITE CKSUM zstore ONLINE 0 0 0 sdc1 ONLINE 0 0 0
ZFS terminology and concepts: mirrors, stripes, pools, vdevs and parity
ZFS introduces a fair amount of new concepts and terminology which can take
some getting used to. The first bit to understand is what ZFS actually does.
ZFS usually works with pools of drives (hence the name of the
and allows you to do things like mirroring or striping the drives.
And what does it mean to mirror or stripe a drive you ask? When two drives are mirrored they do everything in unison, so any data written to one drive is also written to the other drive at the same time. This way if one of your drives were to fail, your data would still be safe and sound on the other drive, and through a process ZFS calls “resilvering” if you were to install a new hard drive to replace the failed one ZFS would automatically take care of syncing all your data back on to it.
Striping is a different beast. Mirroring drives is great for redundancy, but has the obvious drawback that you only get to use half the disk space you have available. Sometimes the situation calls for the opposite trade-off: if you bought two 2TB drives and you wanted to be able to use all 4TB of available storage, striping would let you do that. In striped setups ZFS writes “stripes” of data to each drive. This means that if you write a single file ZFS may actually store part of the file on one drive and part of the file on another.
This has many advantages: it speeds up your reads and writes by making them concurrent. Since it’s storing pieces of one file on each drive, both drives can be writing at the same time, so your write speed could theoretically double. Read speed also gets a boost since you can also read from both drives at the same time. The downside to all this speed and space is that your data is less safe. Since your data is split between two drives, if one of the hard drives dies you will probably lose all your data – no one file will be complete because while your good drive might have half the file on it, the other half is gone with your dead hard disk. So in effect you’re trading close to double the speed and space for close to double the risk of losing all your data. Depending what you’re doing that might be a good choice to make, but I wouldn’t put any data I didn’t want to lose into a striped setup.
There’s a third type, a sort of compromise solution which is to use parity. This type of setup is frequently referred to as RAIDZ (or RAIDZ2 or RAIDZ3) and is somewhere between a full-on striped setup and a mirrored setup. This approach uses what’s called a parity disk to act as a kind of semi-backup. This is backed by a lot of complicated math that I don’t pretend to understand, but the take-home message is that it provides a way to restore your data if a drive fails. So if you have three 2TB drives, you can choose to stripe them but dedicate one to parity. In this setup, you’d have 4TB of available storage, but if a drive were to fail you wouldn’t lose any data (although performance would probably be pretty horrible until you replaced the failed disk). Think of it as a kind of half backup. You can tweak the ratio as well, if you dedicate more disks to parity you can survive more failing drives without losing data–this is what the 2 and 3 in RAIDZ2 and RAIDZ3 mean.
Now that we’ve gone over the high-level concepts of drive arrays and RAID, we
can dive into the more ZFS-specific aspects. The first item to go over is the
concept of a vdev. A vdev is a “virtual device,” and when
zpool pools drives
it pools collections of these virtual devices using one of the RAID approaches
(striped or mirrored) we discussed above. However what makes vdevs useful is
that you can put more than one physical drive (or partition) into a single
zpool can create striped and mirrored arrays over pools of vdevs, a
vdev can create striped or mirrored arrays over sets of drives. This is part of
what makes ZFS so flexible. For example, you could get the speed benefits of a
striped setup with the redundancy benefits of a mirrored setup by creating two
mirror vdevs, each of which is configured to mirror data across two physical
drives. You could then add both vdevs into a striped pool to get fast the fast
reads and writes that striping allows without running the risk of losing your
data if a single drive were to fail (this is actually a fairly popular setup
and is known as RAID10 outside of ZFS-land).
This can get quite complicated quite quickly, but this article (backup link here since original was down at the time of writing) does a nice job walking through the various permutations of vdevs and zpools that are possible.
ZFS can also be used on loopback devices, which is a nice way to play with ZFS without having to invest in lots of hard drives. Let’s run through a few of the possibilities with some loopback devices so you can get a feeling for how ZFS works.
When ZFS uses files on another filesystem instead of accessing devices directly
it requires that the files be allocated first. We can do that with a shell
for loop by using the
dd command to copy 1GB of zeros into each file (you
should make sure you have at least 4GB of available disk space before running
for i in 1 2 3 4; do dd if=/dev/zero of=zfs$i bs=1024M count=1; done
Now that we have our empty files we can put them into a ZFS pool:
sudo zpool create testpool mirror $PWD/zfs1 $PWD/zfs2 mirror $PWD/zfs3 $PWD/zfs4
$PWDabove is important, ZFS requires absolute paths when using files
You should now have a new zpool mounted at
/testpool. Check on it with
$ zpool status pool: testpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /home/nik/zfs1 ONLINE 0 0 0 /home/nik/zfs2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 /home/nik/zfs3 ONLINE 0 0 0 /home/nik/zfs4 ONLINE 0 0 0 errors: No known data errors
Your new ZFS filesystem is now live, and you can
/testpool and copy
some files into your new ZFS filesystem.
We’ve gone over the basics of ZFS, in the next post we’ll go on to some of the
more powerful and advanced features ZFS offers like compression, snapshots and
zfs send and
zfs receive commands, and the secret