Embed Notice
HTML Code
Corresponding Notice
- Embed this notice@cult @feld It's not common and it's definitely not reasonable. If you want actual failure resiliency you either use a JBOD with SAS multipath or Ceph and similar. Those provide true resiliency to failures on nodes without affecting the whole array.
First reason is that you are trying to make a more resilient setup which still depends on a lot things working properly. Motherboard, memory, CPU(s). For true resiliency, the only proper way is multiple independent nodes.
Second reason is that PCIe _usually_ isn't hot-(un)pluggable, and motherboards that support it usually have bad implementations if they weren't made for PCIe flash storage. This means that once your HBA fails into a state where it completely stops responding, there's a high likelihood that the kernel will get confused or the PCIe bridge on the motherboard will get confused.
Third reason is about kernel safety and keeping a known good state. When you pull out random storage devices, the kernel subsystems receive a message about that and will act accordingly. Yanking out a USB drive on Linux won't crash the kernel, because the USB subsystem handles disconnects, the SCSI subsystem will destroy the device, leaving the FS driver unable to access blocks/journal and eventually giving up.
When your HBA fails, it likely sends garbage data either on the PCIe, the storage bus or both, which should immediately panic the kernel or at least taint it. Once you receive garbage which you can't handle/correct within spec, you can no longer be sure about the system integrity.
In this case the FS driver can no longer access the devices and the kernel does not know about a device disconnect (to the kernel the devices still exist). Once that happens the FS driver will give up, stop the journal and in the case of ZFS completely offline the whole pool. If the main root FS was on that array/device it's completely reasonable to panic the system. The system cannot function without the root FS. This is the equivalent of being surprised that your system won't boot past initramfs when there's no real root FS.