##The Z File System (ZFS) The Z file system, originally developed by Sun Microsystems, is designed to future proof the file system by removing many of the arbitrary limits imposed on previous file systems. ZFS allows continuous growth of the size of the storage pool by adding additional devices in the future. Each top level device is called a **VDEV**, which can be a simple disk or a RAID transformation such as a mirror or RAID-Z array. ZFS combines the roles of file system and volume manager, enabling the addition of storage devices to a live system and having the new space available on all of the existing file systems in the pool immediately. ZFS file systems (called datasets), each have access to the combined free space of the entire pool. As blocks are allocated the free space of each file system is decreased. This approach avoids the common pitfall with extensive partitioning, free space fragmentation. , Aas all of the free space is available to each file system. ###ZFS Features and Components ####ARC (Adaptive Replacement Cache) ####L2ARC ####Copy-On-Write WUnlike a traditional file system, when data is overwritten on ZFS the new data is written to a different block rather than overwriting the old data in place. Only once this write is complete is the metadata updated to point the file to the new location of the data. IThis means that in the event of a shorn write (a system crash or power loss in the middle of writing a file), the entire original contents of the file are still available, and the incomplete write is discarded. This also means that ZFS does not require a fsck after an unexpected shutdown. ####Checksums Every block that is allocated is also checksummed (the algorithm is a per-dataset tunable property, see: zfs set). ZFS transparently validates the checksum of each block as it is read, allowing ZFS to detect silent corruption. If the data that is read does not match the expected checksum, ZFS will attempt to cover the data from any available redundancy (mirrors, RAID-Z). In place of something like fsck, ZFS has athe ‘scrub’ command called 'scrub', which reads all of the data stored on the pool, and verifies it against the known good checksums.. This This allows for a periodic check of all the data stored on the pool triggers, and the recovery of any corrupted blocks before they are needed. The available checksum algorithms include fletcher2, fletcher4 and sha256. The fletcher algorithms are faster, but sha256 is a strong cryptographic hash and has a much lower chance of collisions, at the cost of some performance. Checksums can be disabled, but it is considered a very "bad idea". ####Snapshots The copy-on-write design of ZFS allows for nearly instantaneous consistent snapshots with arbitrary names. After taking a snapshot of a dataset (or a recursive snapshot of a parent dataset that will include all child datasets), new data is written to new blocks (as described above), however the old blocks are not reclaimed as free space. There are then two versions of the file system, the snapshot (what the file system looked like before) and the live file system;, however no additional space is used. As new data is written to the live file system, new blocks are allocated to store this data. The apparent size of the snapshot will grow as the blocks are no longer used in the live file system, but only in the snapshot. These snapshots can be mounted (read only) to allow for the recovery of previous versions of files. It is also possible to 'rollback' a live file system to a specific snapshot, undoing any changes that took place after the snapshot was taken. Each block in the zpZPool has a reference counter which indicates how many snapshots, clones, datasets or volumes make use of that block. As files and snapshots are deleted, the reference count is decremented;, once a block is no longer referenced, it is reclaimed as free space. Snapshots can also be marked as 'held', using the 'zfs hold' and 'zfs release' commands. If a snapshot is held, any attempt to destroy it will return an EBUY error. Each snapshot can have multiple holds, each with a unique name. Snapshots can bealso taken on volumes, however they can only be cloned or rolled back, not mounted independently. (maybe explain how it works when you delete the middle of 3 snapshots) ####Clones Snapshots can also be cloned; , a clone is a writable version of a snapshot, allowing the file system to be 'forked' as a new dataset. As with a snapshot, a clone initially consumes no additional space, only as new data is written to a clone and new blocks are allocated does the apparent size of the clone grow. As blocks are overwritten in the cloned file system or volume, the reference count on the previous block is decremented. The snapshot thatupon which a clone is based on cannot be deleted because the clone is dependeant upon it (the snapshot is the parent, and the clone is the child). Clones can be 'promoted' to, reverseing this dependeancy, making the clone the parent, and the previous parents the children. This operation requires no additional spacece, but it, however will change the way the used space is accounted. ####Volumes In additional to regular file systems (datasets), ZFS can also create Volumes, which are block devices. Volumes have many of the same features, including copy-on-write, snapshots and clones, and checksumming. ####Compression Each dataset in ZFS has a compression property;. While the default is off. T, this property can be set to one of a number of compression algorithms, which will cause all new data that is written to this dataset to be compressed as it is written. In addition to the reduction in disk usagespace savings, this can also increase read and write throughput, as only the smaller compressed version of the file needs to be read or written. ####Deduplication ZFS canhas the ability to detect duplicate blocks of data as they are written (thanks to the checksumming mentioned above)., Iand if deduplication is enabled, instead of writing the block a second time, the reference count of the existing block will be increased, saving storage space. In order to do this, ZFS keeps a deduplication table (DDT) in memory, containing the list of unique checksums, the location of that block and a reference count. When new data is written, the checksum is calculated and compared to the list., Iif a match is found, the data is considered to be a duplicate. When deduplication is enabled, the checksum algorithm is changed to SHA256 to provide a secure cryptographic hash. ZFS deduplication is tunable;, if dedup is 'on', then a matching checksum is assumed to mean that the data is identical. If dedup is set to 'verify', then the data in the two blocks will be checked to ensure it is actually identical, and if it is not, the hash collision will be noted by ZFS and the two blocks will be stored separately. Due to the nature of the DDT, having to store the hash of each unique block, it consumes a very large amount of memory (a general rule of thumb is 5-6 GB of ram per TB of deduplicated data). In situations where it is not practical to have enough RAM to keep the entire DDT in memory, performance will suffer greatly as the DDT will need to be read from disk before each new block is written. Dedup can make use of the L2ARC to store the DDT, providing a middle ground between fast system memory and slower disks. It may be advisable to consider using ZFS compression instead, which often provides nearly as much space savingsreduction in disk usage without the additional memory requirement. (In some cases, recovering a deduplicated dataset will be impossible without enough memory) ####Accounting and Quotas ZFS provides very fast and accurate user and group space accounting, in addition to quotes and space reservations. The zfs command can produce a report on the disk usage on the pool by each user without requiring a traversal of the entire directory tree just by doing: # zfs userspace mypool POSIX User root 4.17G none POSIX User allan 618M none POSIX User stefanallan.jude 117K none Note: the userspace command only shows data for the specified dataset., Iit does not include child datasets;, each dataset or snapshot needs to be interrogated individually. ZFS has two types of quotas, dataset quotas, and user/group quotas., Tthe former applies to the entire dataset, regardless of which users owns the data. # zfs set quota=100G mypool/parent Note: Quotas on a dataset apply to all descendants and snapshots. If an additional quota is set on a child of a parent with a quota, both will be enforced on the child. It is also possible to limit a dataset based on its 'referenced' size, which does NOT include the space used by descendants and snapshots # zfs set refquota=200G mypool User and group quotas allow limiting the space consumed by specific users and groups. The space calculation does NOT include space used by descendant datasets such as snapshots and clones. # zfs set userquota@myuser=10G mypool/parent # zfs set groupquota@mygroup=25G mypool/parent/child Note: Enforcement of user and group quotas may be delayed by several seconds, meaning a user may exceed their quota before the system starts refusing additional writes. User and group quotas do not apply to volumes. In addition to limiting the Sspace that can be used by each dataset is restricted using quotas using quotas,and guaranteed with a reservation. sSpecific datasets can be guaranteed assured a minimum amount of the available space from a pool by settingusing a reservation. # zfs set reservation=1T mypool/parent/child When a reservation is created, and the dataset is using less space than the reservation, the dataset is treated as if it were using all of the reserved space, removing that reserved space from the free space available to other datasets. Reserved space is accounted for in the used space of the parent dataset, and counts against the quota of the parent dataset. **refreservation**s are an alternate version of reservations that only apply to referenced space and do not include snapshots and descendant datasets. # zfs set refreservations=100G mypool ####Delegation ####Replication ####VDEV Types #####Disk The most basic type of VDEV is a block device. This can be an entire disk (such as /dev/ada0 or /dev/da0) or a partition (/dev/ada0p3). Contrary to the Solaris documentation, on FreeBSD there is no performance penalty for using a partition rather than an entire disk. #####File In addition to disks, ZFS Pools can be backed by regular files. , Tthis is especially useful for testing and experimentation. Use the full path to the file as the device path in the zpool create command. #####Mirror When creating a mirror, specify the 'mirror' keyword followed by the list of member devices for the mirror. A mirror consists of two or more devices., Ddata will be written to all member devices in an identical fashion. A mirror VDEV will only hold as much data as its smallest member. A mirror VDEV can withstand the failure of all but one of its members without losing any data. Note: A regular single disk VDEV can be upgraded to a mirror VDEV using the zpool attach command. #####RAID-Z ZFS implements RAID-Z, a variation on RAID-5 that offers better distribution of parity and eliminates the "RAID-5 write hole" in which the data and parity information become inconsistent after an unexpected restart. ZFS provides 3 levels of RAID-Z, which provide varying levels of redundancy in exchange for different levels of usable storage. The types are named RAID-Z1 through Z3 based on the amount of parity in the array, and the number of disks that the pool can operate without. In a RAID-Z1 configuration with 4 disks, each 1TB, usable storage will be 3TB and the pool will still operate with one faulted disk. If an additional disk goes offline before the faulted disk is replaced and resilvered, all data in the pool can be lost. In a RAID-Z3 configuration with 8 disks of 1TB, the volume would provide 5TB of usable space, and still be able to operate with diskthree faulted disks. Sun recommends no more than 9 disks in a single VDEV. If the configuration has more disks, it is recommended to divide them into separate **VDEV**s, and the pool data will be striped across them. A configuration of 2 RAID-Z2 **VDEV**s consisting of 8 disks each would create something similar to a RAID 60 array. A RAID-Z group's storage capacity is approximately the size of the smallest disk, multiplied by the number of non-parity disks. 4x 1TB disks in Z1 has an effective size of approximately 3TB, and a 8x 1TB array in Z3 will yeild 5TB of usable space. #####Spare ZFS has a special pseuuedo-VDEV type for keeping track of available hot spares. N ote that installed hot sSpares are not deployused automatically;, they must manually be configured to replace the failed device, using the zfs replace command. #####Log ZFS Log Devices (also known as ZIL) creates a separate Intent Log. The ZIL accelerates synchronous transactions by using faster storage devices (such as SSDs) compared to those used for the main pool. When data is being written and the application requests a guarantee that the data has been safely stored, the data is written to the faster ZIL storage, then later flushed out to the regular disks, greatly reducing the latency of synchronous writes. Log devices can be mirrored, but RAID-Z is not supported. When specifying multiple log devices writes will be load balanced across all devices. #####Cache Adding a cache VDEV to a zpool will add the storage to the L2ARC described previously. Cache devices cannot be mirrored. Since a cache device only stores additional copies of content, there is no ##Using ZFS ZFS file systems are implemented as a service during startup. In order to have ZFS datasets mounted automatically at boot time, set zfs_enable to YES in /etc/rc.conf # echo 'zfs_enable="YES"' >> /etc/rc.conf Once it is enabled, it is also possible start ZFS manually: # service zfs start ###Creating a ZPool ####Single Disk Pool To create a simple non-redundant ZFS pool using a single disk device (where poolname is the name of the pool and /dev/ada0 is the disk to use): # zpool create poolname /dev/ada0 The new pool will automatically be mounted as /poolname ####Mirrored Disk Pool To create a mirrored pool to provide redundancy at the cost of half of the available disk space: # zpool create mypool mirror /dev/ada0 /dev/ada1 To create multiple VDEVs at once, the following will yield something similar to RAID 1+0: # zpool create mypool mirror /dev/ada0 /dev/ada1 mirror /dev/ada2 /dev/ada3 ####RAID-Z Pool RAID-Z pools require 3 or more disks, but yield more usable space than mirrored zpools # zpool create mypool raidz1 /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 # zpool create mypool raidz3 /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4 /dev/ada5 /dev/ada6 /dev/ada7 To complete the RAID-Z configuration, enable status updates about the file systems created during the nightly periodic(8) runs, by running the following command: # echo 'daily_status_zfs_enable="YES"' >> /etc/periodic.conf ###Recovering a Faulted Pool ###ZFS Datasets Once a pool has been created, additional datasets (file systems) can be created, each with its own ZFS options. Datasets are created as descendants (children) of the root pool, or another existing dataset. # zfs create mypool/parent # zfs create mypool/parent/child It is possible to adjust the 'mountpoint' property to mount a child dataset to a different location, rather than under the parent dataset # zfs set mountpoint=/usr/local/child mypool/parent/child ###ZFS Dataset Properties Each ZFS dataset has a number of properties that control and tune it. Each dataset inherits its properties from its parent, unless a property is specifically applied to the dataset. Some dataset properties are read only, providing useful information For example compression can be enabled on a dataset, after which all new files written to the dataset will be compressed. Files previously written will not be compressed (until they are overwritten?). # zfs set compression=on mypool/parent To check the efficiency of compression on a dataset use the compressionratio property: # zfs get compressionratio mypool/parent To view all of the properties of a dataset: # zfs get all mypool/parent/child The child dataset inherits the 'compression' property from its parent, unless it has its own compression property: # zfs set compression=gzip-9 mypool/parent/child ###Using Snapshots To create a snapshot of any dataset with the zfs snapshot command, reference the dataset name followed by the @ symbol and the arbitrary name of the snapshot # zfs snapshot mypool/parent@2013-05-14_15.27 A recursive snapshot creates a consistent snapshot of the dataset and all child datasets. To create a snapshot of the root dataset (mypool) as well as its child datasets : # zfs snapshot -r mypool@before_upgrade