Zero-assumptions ZFS, part 1
This is the first in a series of articles about ZFS, and is part of what I hope becomes an ongoing series here: the zero-assumptions write up. This article will be written assuming you know nothing about ZFS.
I’ve been interested in ZFS for a while now, but didn’t have a good reason to use it for anything. Last time I looked at it ZFS on Linux was still a bit immature, but in the past few years Linux support for ZFS has really stepped up, so I decided to give it another go. ZFS is a bit of a world unto itself though, with most resources walking you through some quick commands without explaining the concepts underlying ZFS, or assuming the user is very familiar with traditional RAID terminology.
Background
I keep one desktop machine in my house (running bedrock linux with an Ubuntu base and an Arch Linux strata on top) that acts, among other things, as a storage/media server. I keep photos and other digital detritus I’ve collected over the years there, and would be very sad if they were to disappear. I back everything up nightly via the excellent restic to the also excellent Backblaze B2, but since I have terabytes of data stored there I haven’t followed the cardinal rule of backing up: make sure that you can actually restore from your backups. Since testing that on my internet connection would take months, and I’m afraid of accidentally deleting data or drive failure, I decided to add a bit more redundancy.
My server has 3 hard drives in it right now: one 4TB drive spinning disk drive, one 2TB drive spinning disk, and one 500GB SSD drive that holds the root filesystem. The majority of the data I want to keep is on the 2TB drive, and the 4TB drive is mostly empty. After doing some research (read: browsing posts on /r/datahoarder), it seems the two most common tools people use to add transparent redundancy are a snapraid + mergerfs combo, or the old standby, ZFS.
Installing ZFS on Linux
Getting ZFS installed on Linux (assuming you don’t try to use it as the root filesystem) is almost comically easy these days. On Ubuntu 16.04+ (and probably recent Debian releases too), this should be as straightforward as:
sudo apt install zfs-dkms zfs-fuse zfs-initramfs zfsutils-linux
Explanation:
For simplicity, the above command installs more than is strictly needed:
zfs-dkms
andzfs-fuse
are different implementations of ZFS for linux, and either should be enough to use ZFS on it’s own. The reason there are multiple implementations is due to how linux does things.zfs-dkms
uses a technology (unsurprisingly) called DKMS, whilezfs-fuse
uses (even less surprisingly) a technology called FUSE. FUSE makes it easier for developers to implement filesystems at the cost of a bit of performance. DKMS stands for Dynamic Kernel Module support, and is a means by which you can install the source code for a module and let the linux distro itself take care of compiling that source to match the running Linux kernel.
For Arch Linux you’ll need to use the AUR and install
zfs-linux
. Check the Arch wiki’s
article on ZFS for more detailed instructions, but for most systems this should
suffice:
sudo pacman -Syu zfs-linux
Planning your drives
The first step to getting started with ZFS was to figure out how I wanted to use my drives. Most people who use ZFS for these purposes seem to go out and buy multiple big hard drives, and then use ZFS to mirror them. I just wanted more data redundancy on the drives I already had, so I decided to partition my drives.
Since I have one 2TB drive that I want backed up, I first partitioned my 4TB drive into two 2TB partitions using gparted. I then created an ext4 filesystem on the second drive.
Then I used blkid
and lsblk
to check my handiwork. These two tools print
lists of all the “block devices” (read: hard disks) in my system and show
different ways to refer to them in Linux:
$ blkid
/dev/sda1: UUID="7600-739F" TYPE="vfat" PARTUUID="ded30b23-f318-433c-bfb2-15738d42cc01"
/dev/sda2: LABEL="500gb-ssd-root" UUID="906bd064-2156-4a88-8d88-8940af7c5a34" TYPE="ext4" PARTLABEL="500gb-ssd-root" PARTUUID="cc6695ed-1a2b-4cb1-b302-37614cf07bf7"
/dev/sdc1: LABEL="zstore" UUID="5303013864921755800" UUID_SUB="17834655468516818280" TYPE="ext4" PARTUUID="072d0dd9-a1bf-4c67-b9b3-046f37c48846"
/dev/sdc2: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="266677788785228698" TYPE="ext4" PARTLABEL="extra2tb" PARTUUID="1f9e7fd1-1da6-4dbd-9302-95f6ea62fff0"
/dev/sdb1: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="89185545293388421" TYPE="zfs_member" PARTUUID="5626d9ea-01"
/dev/sde1: UUID="acd97a41-df27-4b69-924c-9290470b735d" TYPE="ext4" PARTLABEL="wd2tb" PARTUUID="6ca94069-5fc8-4466-bba2-e5b6237a19b7"
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 1.8T 0 disk
└─sdb1 8:17 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
sdc 8:32 0 3.7T 0 disk
├─sdc2 8:34 0 1.8T 0 part
└─sdc1 8:33 0 1.8T 0 part
sda 8:0 0 477G 0 disk
├─sda2 8:2 0 476.4G 0 part
└─sda1 8:1 0 512M 0 part
Explanation:
If you’re not familiar with how Linux handles hard disks, Linux refers to hard disks as “block devices.” Linux provides access to physical hardware through a virtual filesystem it mounts at
/dev
, and depending on what type of hard drive you have, hard disks will generally be of the format/dev/sdX
where theX
is a letter from a-z that Linux assigns to the drive. Partitions on each disk are then assigned a number, so in thelsblk
output above, you can see that disksdc
has two partitions, which show up in the output assdc1
andsdc2
.The
blkid
command shows the traditional/dev/sdX
labels, but also addsUUID
s, which you can think of as a random id that will always refer to that particular disk. The reason for this is that if you were to unplug one of your drives and plug it into a different port Linux may give it a different/dev/sdX
name, e.g. if you unplugged the/dev/sdc
drive and plugged it into another port it may become/dev/sda
, but it would keep the sameUUID
.
I wanted to convert my 2TB drive to ZFS, but since my precious data is all
currently located on my 2TB drive (/dev/sdb1
above), I decided to pull a
swaparoo and first copy everything onto the second partition of my 4TB drive
(/dev/sdc2
above), then let ZFS takeover the original partition (/dev/sdb1
)
and copy the data back onto that drive.
The end result I’m looking for is to have a layout with two “pools” (zfs-speak for sets of drives, more on this later). One pool should consist of my original 2TB drive, replicated to one of the 2TB partitions on my 4TB drive. The extra 2TB partition available on the 4TB drive will act as a second pool, which gives me nice ZFS benefits like checksumming and the ability to take snapshots of the drive, as well as the option to add another 2TB drive/partition later and mirror the data.
If you’re already familiar with zpool
, this is what the finished setup looks
like:
$ sudo zpool status
pool: longterm
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
longterm ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
sdb1 ONLINE 0 0 0
pool: zstore
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
zstore ONLINE 0 0 0
sdc1 ONLINE 0 0 0
ZFS terminology and concepts: mirrors, stripes, pools, vdevs and parity
ZFS introduces a fair amount of new concepts and terminology which can take
some getting used to. The first bit to understand is what ZFS actually does.
ZFS usually works with pools of drives (hence the name of the zpool
command),
and allows you to do things like mirroring or striping the drives.
And what does it mean to mirror or stripe a drive you ask? When two drives are mirrored they do everything in unison, so any data written to one drive is also written to the other drive at the same time. This way if one of your drives were to fail, your data would still be safe and sound on the other drive, and through a process ZFS calls “resilvering” if you were to install a new hard drive to replace the failed one ZFS would automatically take care of syncing all your data back on to it.
Striping is a different beast. Mirroring drives is great for redundancy, but has the obvious drawback that you only get to use half the disk space you have available. Sometimes the situation calls for the opposite trade-off: if you bought two 2TB drives and you wanted to be able to use all 4TB of available storage, striping would let you do that. In striped setups ZFS writes “stripes” of data to each drive. This means that if you write a single file ZFS may actually store part of the file on one drive and part of the file on another.
This has many advantages: it speeds up your reads and writes by making them concurrent. Since it’s storing pieces of one file on each drive, both drives can be writing at the same time, so your write speed could theoretically double. Read speed also gets a boost since you can also read from both drives at the same time. The downside to all this speed and space is that your data is less safe. Since your data is split between two drives, if one of the hard drives dies you will probably lose all your data – no one file will be complete because while your good drive might have half the file on it, the other half is gone with your dead hard disk. So in effect you’re trading close to double the speed and space for close to double the risk of losing all your data. Depending what you’re doing that might be a good choice to make, but I wouldn’t put any data I didn’t want to lose into a striped setup.
There’s a third type, a sort of compromise solution which is to use parity. This type of setup is frequently referred to as RAIDZ (or RAIDZ2 or RAIDZ3) and is somewhere between a full-on striped setup and a mirrored setup. This approach uses what’s called a parity disk to act as a kind of semi-backup. This is backed by a lot of complicated math that I don’t pretend to understand, but the take-home message is that it provides a way to restore your data if a drive fails. So if you have three 2TB drives, you can choose to stripe them but dedicate one to parity. In this setup, you’d have 4TB of available storage, but if a drive were to fail you wouldn’t lose any data (although performance would probably be pretty horrible until you replaced the failed disk). Think of it as a kind of half backup. You can tweak the ratio as well, if you dedicate more disks to parity you can survive more failing drives without losing data–this is what the 2 and 3 in RAIDZ2 and RAIDZ3 mean.
More info on the different RAID levels you can use with ZFS here and here.
Now that we’ve gone over the high-level concepts of drive arrays and RAID, we
can dive into the more ZFS-specific aspects. The first item to go over is the
concept of a vdev. A vdev is a “virtual device,” and when zpool
pools drives
it pools collections of these virtual devices using one of the RAID approaches
(striped or mirrored) we discussed above. However what makes vdevs useful is
that you can put more than one physical drive (or partition) into a single
vdev.
While zpool
can create striped and mirrored arrays over pools of vdevs, a
vdev can create striped or mirrored arrays over sets of drives. This is part of
what makes ZFS so flexible. For example, you could get the speed benefits of a
striped setup with the redundancy benefits of a mirrored setup by creating two
mirror vdevs, each of which is configured to mirror data across two physical
drives. You could then add both vdevs into a striped pool to get fast the fast
reads and writes that striping allows without running the risk of losing your
data if a single drive were to fail (this is actually a fairly popular setup
and is known as RAID10 outside of ZFS-land).
This can get quite complicated quite quickly, but this article (backup link here since original was down at the time of writing) does a nice job walking through the various permutations of vdevs and zpools that are possible.
Experimenting
ZFS can also be used on loopback devices, which is a nice way to play with ZFS without having to invest in lots of hard drives. Let’s run through a few of the possibilities with some loopback devices so you can get a feeling for how ZFS works.
When ZFS uses files on another filesystem instead of accessing devices directly
it requires that the files be allocated first. We can do that with a shell
for
loop by using the dd
command to copy 1GB of zeros into each file (you
should make sure you have at least 4GB of available disk space before running
this command):
for i in 1 2 3 4; do dd if=/dev/zero of=zfs$i bs=1024M count=1; done
Now that we have our empty files we can put them into a ZFS pool:
sudo zpool create testpool mirror $PWD/zfs1 $PWD/zfs2 mirror $PWD/zfs3 $PWD/zfs4
NOTE: The
$PWD
above is important, ZFS requires absolute paths when using files
You should now have a new zpool mounted at /testpool
. Check on it with zpool status
:
$ zpool status
pool: testpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/home/nik/zfs1 ONLINE 0 0 0
/home/nik/zfs2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/home/nik/zfs3 ONLINE 0 0 0
/home/nik/zfs4 ONLINE 0 0 0
errors: No known data errors
Your new ZFS filesystem is now live, and you can cd
to /testpool
and copy
some files into your new ZFS filesystem.
Next steps
We’ve gone over the basics of ZFS, in the next post we’ll go on to some of the
more powerful and advanced features ZFS offers like compression, snapshots and
the zfs send
and zfs receive
commands, and the secret .zfs
dir.