Troubleshooting
Todo
This page is a draft.
This page contains tips for troubleshooting ZFS on Linux and what info developers might want for bug triage.
About Log Files
Log files can be very useful for troubleshooting. In some cases, interesting information is stored in multiple log files that are correlated to system events.
Pro tip: logging infrastructure tools like elasticsearch, fluentd, influxdb, or splunk can simplify log analysis and event correlation.
Generic Kernel Log
Typically, Linux kernel log messages are available from dmesg -T
,
/var/log/syslog
, or where kernel log messages are sent (eg by
rsyslogd
).
ZFS Kernel Module Debug Messages
The ZFS kernel modules use an internal log buffer for detailed logging
information. This log information is available in the pseudo file
/proc/spl/kstat/zfs/dbgmsg
for ZFS builds where ZFS module parameter
zfs_dbgmsg_enable =
1
Unkillable Process
Symptom: zfs
or zpool
command appear hung, does not return, and
is not killable
Likely cause: kernel thread hung or panic
Log files of interest: Generic Kernel Log, ZFS Kernel Module Debug Messages
Important information: if a kernel thread is stuck, then a backtrace of the stuck thread can be in the logs. In some cases, the stuck thread is not logged until the deadman timer expires. See also debug tunables
ZFS Events
ZFS uses an event-based messaging interface for communication of
important events to other consumers running on the system. The ZFS Event
Daemon (zed) is a userland daemon that listens for these events and
processes them. zed is extensible so you can write shell scripts or
other programs that subscribe to events and take action. For example,
the script usually installed at /etc/zfs/zed.d/all-syslog.sh
writes
a formatted event message to syslog
. See the man page for zed(8)
for more information.
A history of events is also available via the zpool events
command.
This history begins at ZFS kernel module load and includes events from
any pool. These events are stored in RAM and limited in count to a value
determined by the kernel tunable
zfs_event_len_max.
zed
has an internal throttling mechanism to prevent overconsumption
of system resources processing ZFS events.
More detailed information about events is observable using
zpool events -v
The contents of the verbose events is subject to
change, based on the event and information available at the time of the
event.
Each event has a class identifier used for filtering event types.
Commonly seen events are those related to pool management with class
sysevent.fs.zfs.*
including import, export, configuration updates,
and zpool history
updates.
Events related to errors are reported as class ereport.*
These can
be invaluable for troubleshooting. Some faults can cause multiple
ereports as various layers of the software deal with the fault. For
example, on a simple pool without parity protection, a faulty disk could
cause an ereport.io
during a read from the disk that results in an
erport.fs.zfs.checksum
at the pool level. These events are also
reflected by the error counters observed in zpool status
If you see
checksum or read/write errors in zpool status
then there should be
one or more corresponding ereports in the zpool events
output.