20.3. zpool Administration

The administration of ZFS is divided between two main utilities. The zpool utility which controls the operation of the pool and deals with adding, removing, replacing and managing disks, and the zfs utility, which deals with creating, destroying and managing datasets (both filesystems and volumes).

20.3.1. Creating & Destroying Storage Pools

Creating a ZFS Storage Pool (zpool) involves making a number of decisions that are relatively permanent because the structure of the pool cannot be changed after the pool has been created. The most important decision is what types of vdevs to group the physical disks into. See the list of vdev types for details about the possible options. After the pool has been created, most vdev types do not allow additional disks to be added to the vdev. The exceptions are mirrors, which allow additional disks to be added to the vdev, and stripes, which can be upgraded to mirrors by attaching an additional disk to the vdev. Although additional vdevs can be added to a pool, the layout of the pool cannot be changed once the pool has been created, instead the data must be backed up and the pool recreated.

A ZFS pool that is no longer needed can be destroyed so that the disks making up the pool can be reused in another pool or for other purposes. Destroying a pool involves unmounting all of the datasets in that pool. If the datasets are in use, the unmount operation will fail and the pool will not be destroyed. The destruction of the pool can be forced with -f, but this can cause undefined behavior in applications which had open files on those datasets.

20.3.2. Adding and Removing Devices

Adding disks to a zpool can be broken down into two separate cases: attaching a disk to an existing vdev with zpool attach, or adding vdevs to the pool with zpool add. Only some vdev types allow disks to be added to the vdev after creation.

When adding disks to the existing vdev is not an option, as in the case of RAID-Z, the other option is to add a vdev to the pool. It is possible, but discouraged, to mix vdev types. ZFS stripes data across each of the vdevs. For example, if there are two mirror vdevs, then this is effectively a RAID 10, striping the writes across the two sets of mirrors. Because of the way that space is allocated in ZFS to attempt to have each vdev reach 100% full at the same time, there is a performance penalty if the vdevs have different amounts of free space.

Currently, vdevs cannot be removed from a zpool, and disks can only be removed from a mirror if there is enough remaining redundancy.

20.3.3. Checking the Status of a Pool

Pool status is important. If a drive goes offline or a read, write, or checksum error is detected, the error counter in status is incremented. The status output shows the configuration and status of each device in the pool, in addition to the status of the entire pool. Actions that need to be taken and details about the last scrub are also shown.

# zpool status
  pool: mypool
 state: ONLINE
  scan: scrub repaired 0 in 2h25m with 0 errors on Sat Sep 14 04:25:50 2013
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
            ada4p3  ONLINE       0     0     0
            ada5p3  ONLINE       0     0     0

errors: No known data errors

20.3.4. Clearing Errors

When an error is detected, the read, write, or checksum counts are incremented. The error message can be cleared and the counts reset with zpool clear mypool. Clearing the error state can be important for automated scripts that alert the administrator when the pool encounters an error. Further errors may not be reported if the old errors are not cleared.

20.3.5. Replacing a Functioning Device

There are a number of situations in which it may be desirable to replace a disk with a different disk. This process requires connecting the new disk at the same time as the disk to be replaced. zpool replace will copy all of the data from the old disk to the new one. After this operation completes, the old disk is disconnected from the vdev. If the new disk is larger than the old disk, it may be possible to grow the zpool, using the new space. See Growing a Pool.

20.3.6. Dealing with Failed Devices

When a disk in a ZFS pool fails, the vdev that the disk belongs to will enter the Degraded state. In this state, all of the data stored on the vdev is still available, but performance may be impacted because missing data will need to be calculated from the available redundancy. To restore the vdev to a fully functional state, the failed physical device must be replaced, and ZFS must be instructed to begin the resilver operation, where data that was on the failed device will be recalculated from available redundancy and written to the replacement device. After the process has completed, the vdev will return to Online status. If the vdev does not have any redundancy, or if multiple devices have failed and there is not enough redundancy to compensate, the pool will enter the Faulted state. If a sufficient number of devices cannot be reconnected to the pool then the pool will be inoperative, and data must be restored from backups.

20.3.7. Scrubbing a Pool

Pools should be Scrubbed regularly, ideally at least once every three months. The scrub operating is very disk-intensive and will reduce performance while running. Avoid high-demand periods when scheduling scrub or use vfs.zfs.scrub_delay to adjust the relative priority of the scrub to prevent it interfering with other workloads.

# zpool scrub mypool
# zpool status
  pool: mypool
 state: ONLINE
  scan: scrub in progress since Wed Feb 19 20:52:54 2014
        116G scanned out of 8.60T at 649M/s, 3h48m to go
        0 repaired, 1.32% done
config:

        NAME        STATE     READ WRITE CKSUM
        mypool       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
            ada4p3  ONLINE       0     0     0
            ada5p3  ONLINE       0     0     0

errors: No known data errors

20.3.8. Self-Healing

The checksums stored with data blocks enable the file system to self-heal. This feature will automatically repair data whose checksum does not match the one recorded on another device that is part of the storage pool. For example, a mirror with two disks where one drive is starting to malfunction and cannot properly store the data any more. This is even worse when the data has not been accessed for a long time in long term archive storage for example. Traditional file systems need to run algorithms that check and repair the data like the fsck(8) program. These commands take time and in severe cases, an administrator has to manually decide which repair operation has to be performed. When ZFS detects that a data block is being read whose checksum does not match, it will try to read the data from the mirror disk. If that disk can provide the correct data, it will not only give that data to the application requesting it, but also correct the wrong data on the disk that had the bad checksum. This happens without any interaction of a system administrator during normal pool operation.

The following example will demonstrate this self-healing behavior in ZFS. First, a mirrored pool of two disks /dev/ada0 and /dev/ada1 is created.

# zpool create healer mirror /dev/ada0 /dev/ada1
# zpool status healer
  pool: healer
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    healer      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
       ada0     ONLINE       0     0     0
       ada1     ONLINE       0     0     0

errors: No known data errors
# zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
healer   960M  92.5K   960M     0%  1.00x  ONLINE  -

Now, some important data that we want to protect from data errors using the self-healing feature is copied to the pool. A checksum of the pool is then created to compare it against the pool later on.

# cp /some/important/data /healer
# zfs list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
healer   960M  67.7M   892M     7%  1.00x  ONLINE  -
# sha1 /healer > checksum.txt
# cat checksum.txt
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f

Next, data corruption is simulated by writing random data to the beginning of one of the disks that make up the mirror. To prevent ZFS from healing the data as soon as it detects it, we export the pool first and import it again afterwards.

Warning:

This is a dangerous operation that can destroy vital data. It is shown here for demonstrational purposes only and should not be attempted during normal operation of a ZFS storage pool. Nor should this dd example be run on a disk with a different filesystem on it. Do not use any other disk device names other than the ones that are part of the ZFS pool. Make sure that proper backups of the pool are created before running the command!

# zpool export healer
# dd if=/dev/random of=/dev/ada1 bs=1m count=200
200+0 records in
200+0 records out
209715200 bytes transferred in 62.992162 secs (3329227 bytes/sec)
# zpool import healer

The ZFS pool status shows that one device has experienced an error. It is important to know that applications reading data from the pool did not receive any data with a wrong checksum. ZFS did provide the application with the data from the ada0 device that has the correct checksums. The device with the wrong checksum can be found easily as the CKSUM column contains a value greater than zero.

# zpool status healer
    pool: healer
   state: ONLINE
  status: One or more devices has experienced an unrecoverable error.  An
          attempt was made to correct the error.  Applications are unaffected.
  action: Determine if the device needs to be replaced, and clear the errors
          using 'zpool clear' or replace the device with 'zpool replace'.
     see: http://www.sun.com/msg/ZFS-8000-9P
    scan: none requested
  config:

      NAME        STATE     READ WRITE CKSUM
      healer      ONLINE       0     0     0
        mirror-0  ONLINE       0     0     0
         ada0     ONLINE       0     0     0
         ada1     ONLINE       0     0     1

errors: No known data errors

ZFS has detected the error and took care of it by using the redundancy present in the unaffected ada0 mirror disk. A checksum comparison with the original one should reveal whether the pool is consistent again.

# sha1 /healer >> checksum.txt
# cat checksum.txt
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f

The two checksums that were generated before and after the intentional tampering with the pool data still match. This shows how ZFS is capable of detecting and correcting any errors automatically when the checksums do not match any more. Note that this is only possible when there is enough redundancy present in the pool. A pool consisting of a single device has no self-healing capabilities. That is also the reason why checksums are so important in ZFS and should not be disabled for any reason. No fsck(8) or similar filesystem consistency check program is required to detect and correct this and the pool was available the whole time. A scrub operation is now required to remove the falsely written data from ada1.

# zpool scrub healer
# zpool status healer
  pool: healer
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Mon Dec 10 12:23:30 2012
        10.4M scanned out of 67.0M at 267K/s, 0h3m to go
        9.63M repaired, 15.56% done
config:

    NAME        STATE     READ WRITE CKSUM
    healer      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
       ada0     ONLINE       0     0     0
       ada1     ONLINE       0     0   627  (repairing)

errors: No known data errors

The scrub operation is reading the data from ada0 and corrects all data that has a wrong checksum on ada1. This is indicated by the (repairing) output from zpool status. After the operation is complete, the pool status has changed to the following:

# zpool status healer
  pool: healer
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
             using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012
config:

    NAME        STATE     READ WRITE CKSUM
    healer      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
       ada0     ONLINE       0     0     0
       ada1     ONLINE       0     0 2.72K

errors: No known data errors

After the scrub operation has completed and all the data has been synchronized from ada0 to ada1, the error messages can be cleared from the pool status by running zpool clear.

# zpool clear healer
# zpool status healer
  pool: healer
 state: ONLINE
  scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012
config:

    NAME        STATE     READ WRITE CKSUM
    healer      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
       ada0     ONLINE       0     0     0
       ada1     ONLINE       0     0     0

errors: No known data errors

Our pool is now back to a fully working state and all the errors have been cleared.

20.3.9. Growing a Pool

The usable size of a redundant ZFS pool is limited by the size of the smallest device in the vdev. If each device in the vdev is replaced sequentially, after the smallest device has completed the replace or resilver operation, the pool can grow based on the size of the new smallest device. This expansion can be triggered by using zpool online with -e on each device. After expansion of all devices, the additional space will become available to the pool.

20.3.10. Importing & Exporting Pools

Pools can be exported in preparation for moving them to another system. All datasets are unmounted, and each device is marked as exported but still locked so it cannot be used by other disk subsystems. This allows pools to be imported on other machines, other operating systems that support ZFS, and even different hardware architectures (with some caveats, see zpool(8)). When a dataset has open files, -f can be used to force the export of a pool. -f causes the datasets to be forcibly unmounted, which can cause undefined behavior in the applications which had open files on those datasets.

Importing a pool automatically mounts the datasets. This may not be the desired behavior, and can be prevented with -N. -o sets temporary properties for this import only. altroot= allows importing a zpool with a base mount point instead of the root of the file system. If the pool was last used on a different system and was not properly exported, an import might have to be forced with -f. -a imports all pools that do not appear to be in use by another system.

20.3.11. Upgrading a Storage Pool

After upgrading FreeBSD, or if a pool has been imported from a system using an older version of ZFS, the pool can be manually upgraded to the latest version of ZFS. Consider whether the pool may ever need to be imported on an older system before upgrading. The upgrade process is unreversible and cannot be undone.

# zpool status
  pool: mypool
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support feat
        flags.
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
	    ada0    ONLINE       0     0     0
	    ada1    ONLINE       0     0     0

errors: No known data errors

The newer features of ZFS will not be available until zpool upgrade has completed. -v can be used to see what new features will be provided by upgrading, as well as which features are already supported.

Warning:

Systems that boot from a pool must have their boot code updated to support the new pool version. Run gpart bootcode on the partition that contains the boot code. See gpart(8) for more information.

20.3.12. Displaying Recorded Pool History

ZFS records all the commands that were issued to administer the pool. These include the creation of datasets, changing properties, or when a disk has been replaced in the pool. This history is useful for reviewing how a pool was created and which user did a specific action and when. History is not kept in a log file, but is part of the pool itself. Because of that, history cannot be altered after the fact unless the pool is destroyed. The command to review this history is aptly named zpool history:

# zpool history
History for 'tank':
2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1
2013-02-27.18:50:58 zfs set atime=off tank
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank
2013-02-27.18:51:18 zfs create tank/backup

The output shows zpool and zfs commands that were executed on the pool along with a timestamp. Only commands that alter the pool in some way are recorded. Commands like zfs list are not included. When no pool name is given to zpool history, the history of all pools is displayed.

zpool history can show even more information when the options -i or -l are provided. The option -i displays user initiated events as well as internally logged ZFS events.

# zpool history -i
History for 'tank':
2013-02-26.23:02:35 [internal pool create txg:5] pool spa 28; zfs spa 28; zpl 5;uts  9.1-RELEASE 901000 amd64
2013-02-27.18:50:53 [internal property set txg:50] atime=0 dataset = 21
2013-02-27.18:50:58 zfs set atime=off tank
2013-02-27.18:51:04 [internal property set txg:53] checksum=7 dataset = 21
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank
2013-02-27.18:51:13 [internal create txg:55] dataset = 39
2013-02-27.18:51:18 zfs create tank/backup

More details can be shown by adding -l. History records are shown in a long format, including information like the name of the user who issued the command and the hostname on which the change was made.

# zpool history -l
History for 'tank':
2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 [user 0 (root) on :global]
2013-02-27.18:50:58 zfs set atime=off tank [user 0 (root) on myzfsbox:global]
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank [user 0 (root) on myzfsbox:global]
2013-02-27.18:51:18 zfs create tank/backup [user 0 (root) on myzfsbox:global]

This output clearly shows that the root user created the mirrored pool (consisting of /dev/ada0 and /dev/ada1). In addition to that, the hostname (myzfsbox) is also shown in the commands following the pool's creation. The hostname display becomes important when the pool is exported from the current and imported on another system. The commands that are issued on the other system can clearly be distinguished by the hostname that is recorded for each command.

Both options to zpool history can be combined to give the most detailed information possible for any given pool. Pool history provides valuable information when tracking down what actions were performed or when more detailed output is needed for debugging.

20.3.13. Performance Monitoring

A built-in monitoring system can display statistics about I/O on the pool in real-time. It shows the amount of free and used space on the pool, how many read and write operations are being performed per second, and how much I/O bandwidth is currently being utilized. By default, all pools in the system are monitored and displayed. A pool name can be provided to limit monitoring to just that pool. A basic example:

# zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         288G  1.53T      2     11  11.3K  57.1K

To continuously monitor I/O activity on the pool, a number can be specified as the last parameter, indicating the frequency in seconds to wait between updates. The next statistic line is printed after each interval. Press Ctrl+C to stop this continuous monitoring. Alternatively, give a second number on the command line after the interval to specify the total number of statistics to display.

Even more detailed pool I/O statistics can be displayed with -v. Each device in the pool is shown with a statistics line. This is useful in seeing how many read and write operations are being performed on each device, and can help determine if any individual device is slowing down the pool. This example shows a mirrored pool consisting of two devices:

# zpool iostat -v 
                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
data                      288G  1.53T      2     12  9.23K  61.5K
  mirror                  288G  1.53T      2     12  9.23K  61.5K
    ada1                     -      -      0      4  5.61K  61.7K
    ada2                     -      -      1      4  5.04K  61.7K
-----------------------  -----  -----  -----  -----  -----  -----

20.3.14. Splitting a Storage Pool

A pool consisting of one or more mirror vdevs can be split into a second pool. The last member of each mirror (unless otherwise specified) is detached and used to create a new pool containing the same data. It is recommended that the operation first be attempted with the -n parameter. The details of the proposed operation are displayed without actually performing it. This helps ensure the operation will happen as expected.

All FreeBSD documents are available for download at http://ftp.FreeBSD.org/pub/FreeBSD/doc/

Questions that are not answered by the documentation may be sent to <freebsd-questions@FreeBSD.org>.
Send questions about this document to <freebsd-doc@FreeBSD.org>.