This page looks best with JavaScript enabled

FreeBSD ZFS Vol0 Basic Concepts

 ·  🎃 kr0m

ZFS is an advanced file system with a design that solves most of the problems present in current file systems. It was originally developed by Sun Microsystems but is currently maintained by the open source community under the name OpenZFS . The roles of volume manager and file system have been unified, allowing the storage space of the pool to be shared among several file systems/datasets and unifying the storage vision in a more coherent way.

ZFS has three main objectives:

  • Data integrity: When data is written to the file system, a checksum is calculated and stored. When the data is read, this checksum is recalculated and checked against the one stored during writing. If the checksums do not match, an error in the data has been detected. If the damaged data has redundancy, the problem will be corrected automatically. Otherwise, only error detection will be possible. The redundancy depends on the type of pool and the copies parameter of the dataset.

  • Pooled storage: Storage devices are assigned to a pool, and the space used by our data is obtained from this shared pool. The pool can grow in size simply by adding more storage devices to it.

  • Read performance: By caching the most frequently used files, ZFS achieves superior performance in read operations. The cache consists of two levels.

    • ARC: RAM cache of the most frequently used files.
    • L2ARC: Second-level disk cache.
  • Write performance: On the other hand, to increase performance in writes, we have ZIL/SLOG.

    • ZIL: Part of the pool used when the OS makes a synchronous write request, that is, it wants the data to be written immediately to the disk and not in RAM. The data is written to the ZIL and will be dumped to the pool when ZFS deems it appropriate. If the system shuts down abruptly, the ZIL will be read and the data will be transferred to the pool. In no case is there data loss. Most applications will not benefit from ZIL, only those that work with synchronous data such as databases, remote storage systems VMWare or NFS. Access to the ZIL is faster than access to the pool itself because in this area of the disk, no inode table needs to be consulted, no space needs to be found for the data to fit, no internal ZFS structures need to be rewritten, or anything like that. Only the data is written sequentially and then dumped to the pool where all these factors are taken into account. All pools have ZIL.
    • SLOG: It is exactly the same as ZIL, only it has been externalized to an external device to relieve the pool of this burden and thus distribute the IOPs. It is usually a faster device, although if it is not, there would still be a slight improvement since in all cases the IOPs load is being distributed. We must bear in mind that if we assign more SLOG devices to the same pool, the IOPs will be even more balanced. SLOGs must be mirrored because if we do not do this and an unexpected shutdown coincides with the failure of the SLOG disk, we will have lost data. As a general rule, the SLOG is usually sized at 16GB for every 64GB of data, but in servers with many writes, the formula: max amount of write traffic per second x 15 is usually applied.

The types of vdev are:

RAID Type Description Required Disks Parity Disks Fault Tolerance
Normal(Stripe with 1 disk) Equivalent to having a single disk but with the capabilities of ZFS and the possibility of expanding it to a mirror. 1 0 0
Striping Equivalent to RAID0, data is written simultaneously to all disks, the information is distributed among all disks, and in reads, it is read in parallel from all of them. 2+ 0 0
Mirroring Equivalent to RAID1, all data is written to all disks, and in reads, it is read in parallel from all of them. 2+ 0 N-1
2vdevs Mirror Pool Equivalent to RAID10, it is a device that uses two RAID1s underneath. 2+ 0 N-1
RAID-Z1 Equivalent to RAID5, N-1 disks are used for data and one for parity, and part of the data is on all data disks, it is written and read in parallel between the disks. 3+ 1 1
RAID-Z2 Equivalent to RAID6, exactly the same as RAID-Z1 but the parity is written twice. 4+ 2 2
RAID-Z3 Equivalent to RAID7, exactly the same as RAID-Z2 but the parity is written three times. 5+ 3 3

When a pool enters degraded mode, the pool’s information is still available, but performance will be reduced (except for mirrors) since the missing information must be recalculated from the redundancy data. When a defective disk is replaced, it will be filled (resilvered) with the missing data, recalculating it from the redundancy data. Until the resilvering is complete, performance will continue to be affected.

It is not recommended to set up a RAID-Z1 because when a faulty disk is replaced, the resilvering process begins, at which point the other disks (which will likely be the same age as the faulty disk) begin to be stressed with an approximate probability of 8% of failure. If a failure occurs, the data will be irretrievably lost. Therefore, a RAID with a fault tolerance of more than one disk should always be chosen.

Before starting to configure any ZFS pool, we must clarify some terminology and functionalities of ZFS:

Term Description
Vdev Physical disks can be “grouped” in different ways. For example, we can create a mirror with two disks, a stripe with two more, and a raidz with another three disks, and with all these form a pool. Each grouping, in this case mirror, stripe, and raidz, is what is called a vdev.
Pool Pools are the union of vdevs to form a storage unit. The pool is then used to create file systems (datasets) or block devices (volumes). All datasets or volumes share space between them, and the capacities of the datasets/volumes are determined by the ZFS version of the pool where they were created.
Copy-On-Write When a file is updated under ZFS, the old data of the file is not overwritten. Instead, the new data is written to another location on the disk to then update the file system metadata. In this way, in case of a system crash or abrupt shutdown of the computer, the original file still exists. Since the metadata was not updated and still points to the original position, ZFS does not need to perform fscks. Only the new data will be lost.
Dataset A generic term for referring to a file system, volume, snapshot, or clone. Each dataset has a unique name in the style of poolname/path@snapshot. Child datasets are named hierarchically, and the children inherit the properties of the parents. If the hierarchy level is several levels, the properties accumulate.
File system Most datasets are used as file systems. These can be mounted and contain files and directories with their permissions, flags, and metadata.
Volume These are block devices that can be useful for using other file systems such as UFS over ZFS or as storage devices in virtualization systems.
Snapshot Since it is a COW file system, snapshots do not require additional space and are instantaneous. You can revert to a specific snapshot or mount it as RO to recover a specific file. It is also possible to take snapshots of volumes, but it is not possible to mount them, only clone them or return to the previous state of the snapshot.
Clone A clone is a writable version of a snapshot, allowing a fork of the file system. A clone does not occupy additional space as long as no changes are made to it. The snapshot cannot be deleted while it has clones that depend on it. The snapshot is the parent, and the clone is its child, but this relationship can be reversed by promoting the clone. In this way, the snapshot will depend on the clone, and we can delete the snapshot if we wish. Clones are very useful for providing access to common data to two computers that need to make changes to it. Since the data is cloned, the common data is shared without needing double the space.
Checksum ZFS supports several checksum algorithms: fletcher2, fletcher4, and sha256. All of them can have collisions, but sha256 is the most reliable at the cost of higher computational load.
Compression In addition to reducing the space required to store files, compression also provides another derived benefit, which is the reduction of I/O since reading or writing a file must be done on a smaller number of blocks. Each dataset has its own compression, and the available algorithms are:
LZ4: This is the recommended algorithm in ZFS. Compared to LZJB, it compresses compressible data 50% faster and is 3X faster with non-compressible data. When decompressing, it is 80% faster.
LZJB: It offers good compression without consuming as much CPU as GZIP.
GZIP: The main advantage of GZIP is the ability to configure its compression level. In this way, the system administrator can adjust the CPU resources they want to assign to the compression task.
ZLE: An algorithm that only compresses sequences of 0s. It is only useful if our dataset will contain large sequences of 0s.
Copies It is possible to indicate the number of copies we want to make of the data. In this way, even with one disk, we can recover damaged data since the data is in several areas of the disk.
Deduplication Deduplication works by storing the hash of all existing blocks in RAM. When a block is written, it is checked against this table. If the file already exists, a reference to it is generated. If not, it is written as another file. There are two ways to check deduplication:
on: The hash stored in RAM is checked. In these cases, collisions may occur.
verify: The hash stored in RAM is checked, and the file is read byte by byte to compare it with the new one.
Deduplication requires a considerable amount of RAM, about 5-6GB of RAM per terabyte of data. If we do not have enough RAM, we can use L2ARC to try to alleviate the problem. However, it is probably better to use compression instead of deduplication since compression does not consume RAM and also saves space. Deduplication is only useful for identical plain text files (TXT, HTML, PDF, DOC). If even the slightest change is made to a file, it will no longer be deduplicated. For virtualization purposes, it will probably only be useful if we use jails. If we use Bhyve, we will consume a large amount of RAM without any benefit.
Scrub It is possible to manually verify the checksums of all blocks. In this way, those with redundancy will be repaired.
Quota ZFS allows defining quotas on datasets, users, or system groups, but not on volumes since these are already limited by their own nature.
Reservation By reserving space, we can ensure that a specific dataset always has a % of space reserved for it. This can be very useful for system logs or any critical service of the computer.
Resilver This is the process of filling a new disk that has replaced a failed one.
Vdevs/pools status Online: The vdev is connected and fully operational.
Offline: The vdev was disabled by the administrator to perform some maintenance task or replacement.
Degraded: The pool has a failing or offline vdev, and it needs to be replaced or reincorporated. The pool is still working but in degraded mode.
Faulted: The pool becomes inoperative, probably due to the failure of more than one vdev.

We can see a more detailed list at the following link: https://www.freebsd.org/doc/handbook/zfs-term.html#zfs-term-dataset

We must keep in mind that in FreeBSD there is no penalty for using whole disks or partitions for pools, which differs from the recommendations made by Solaris documentation. In FreeBSD, we should NOT use whole disks in any case. If the OS boots from such a pool and there are no partitions, there is no place to store the boot code, making it unbootable. On the other hand, if it is a data pool, it can also cause problems since it is impossible to reliably determine the size of an unpartitioned disk.

Regarding the use of RAM, it is usually recommended to have 1GB per terabyte of data. In case of enabling deduplication, 5-6GB per terabyte is recommended. If we do not have enough RAM, we can use L2ARC to try to mitigate the problem, but it is probably better to use compression instead of deduplication since compression does not consume RAM and also saves space.

If you liked the article, you can treat me to a RedBull here