Conversation

Notices

Embed this notice
CULTPONY :verified: :verified: (cult@pony.social)'s status on Monday, 02-Sep-2024 06:49:34 JST CULTPONY :verified: :verified:

I wiped one of my older systems for reasons, I do love watching nwipe while it's deleting it's own root FS. See some services begin to fail as they are unable to continue.
Root FS was ZFS, it managed 11 minutes before an error was reported after I issued blkdiscard to the entire disk and then started nwipe on it. After that it crashed on a kernel panic within a minute.
On the one hand, understandable. But also for a filesystem touting it's safety and stability, I don't think it should kernel panic that easily.
BUt that's honestly part of my experience, I've sysadmin'ed ZFS for 5 years, it's only stable for common failure modes. If a controller breaks or disks do fun stuff like "return all zeroes and discard writes" then ZFS will crash your computer just as badly as the other filesystems will.
Soapboxing a tiny bit, we should write modern filesystems in a way that we assume that a malicious actor is gonna be messing with our ability to IO with it. That also includes assuming "the device is discarding writes and returning zeroes without error". ZFS is great if you limit yourself to common disk failures (ie, where errors are reported or disconnects). If the controller is faulty or the disk behaves in non-error ways, good chance ZFS will trash the pool.
ext4 and btrfs mostly differ in that they take longer to notice things wrong or the corruption is more extensive without notice. ZFS just crashes faster.

In conversation 6 months ago from pony.social permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Monday, 02-Sep-2024 06:49:23 JST feld
  in reply to
  
  @cult All of this sounds like a Linux problem not a ZFS problem. My experience is completely different having exclusively used ZFS on Solaris and FreeBSD
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Monday, 02-Sep-2024 06:49:30 JST CULTPONY :verified: :verified:
  in reply to
  
  Also this is the annual reminder that last time anyone did a survey on how good filesystems are at reporting up write errors, ZFS only qualified on reporting some errors and only common ones. Btrfs and ext4 both mostly swallowed write errors.
  Part of that is infra, in the modern async "issue write and OK it to the process before the device OKs it" world, the FS can't reliable report such things.
  That's where we got the postgresql "Fsync considered unreliable" from, a bug that persists on linux and can cause data loss on any DB setting other than "Fsync every write or do O_DIRECT"
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Monday, 02-Sep-2024 22:22:31 JST feld
  in reply to
  
  @cult I don't think any OS can survive an entire controller disappearing
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Monday, 02-Sep-2024 22:22:38 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld oh I've had these problems on Solaris too, this isn't a Linux exclusive!
  edit: To be fair, the Fsync issue is a linux exclusive, ZFS being badly written isn't a Linux exclusive, that's just the code base. I've had one Solaris machine deaedlock without even any kind of console output after a disk controller on the storage fabric stopped working.
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Monday, 02-Sep-2024 23:40:05 JST feld
  in reply to
  
  @cult show me a video of an OS surviving you physically disconnecting a controller while doing IO on it. Every OS I've ever used failed this, so a controller catastrophically failing and taking out the server does not surprise me
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Monday, 02-Sep-2024 23:40:06 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld No that's entirely common and reasonable, it's trivial to setup on Linux/Solaris? HBA/SC Failover isn't black magic, but if it fails in the right way, the FS driver will crash the system because the FS assumes that the HBA replying is something meaningful, when it's not.
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 01:35:02 JST feld
  in reply to
  
  @cult I know exactly what you're talking about but if a controller *dies* hard enough it will take down the OS with it. The hardware is not going to just ignore bad signals coming from the PCIE bus, it will kernel panic.
  
  Can it survive if connected drives die in mysterious ways? Yes
  
  Can it survive if the controller loses its link to the other end (JBOD or whatever)? yes
  
  Can it survive a dying controller sending indecipherable signals? No. For the same reason your OS will crash if a stick of RAM does this. Not even ECC will save you when it's that bad.
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Tuesday, 03-Sep-2024 01:35:10 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld just google for HBA/SC manuals? mdadm has a multipath flag for operating past controller failure in case the hardware driver can't do it on it's own but atleast can disconnect the HBA.
  Heck, even Windows Server supports this, LBFO/MPIO under Storage Spaces can handle controller failure in a context of HyperV setups.
  
  In conversation 6 months ago permalink
- Embed this notice
  Phantasm (phnt@fluffytail.org)'s status on Tuesday, 03-Sep-2024 05:24:08 JST Phantasm
  in reply to
  - feld
  @cult @feld It's not common and it's definitely not reasonable. If you want actual failure resiliency you either use a JBOD with SAS multipath or Ceph and similar. Those provide true resiliency to failures on nodes without affecting the whole array.
  
  First reason is that you are trying to make a more resilient setup which still depends on a lot things working properly. Motherboard, memory, CPU(s). For true resiliency, the only proper way is multiple independent nodes.
  
  Second reason is that PCIe _usually_ isn't hot-(un)pluggable, and motherboards that support it usually have bad implementations if they weren't made for PCIe flash storage. This means that once your HBA fails into a state where it completely stops responding, there's a high likelihood that the kernel will get confused or the PCIe bridge on the motherboard will get confused.
  
  Third reason is about kernel safety and keeping a known good state. When you pull out random storage devices, the kernel subsystems receive a message about that and will act accordingly. Yanking out a USB drive on Linux won't crash the kernel, because the USB subsystem handles disconnects, the SCSI subsystem will destroy the device, leaving the FS driver unable to access blocks/journal and eventually giving up.
  When your HBA fails, it likely sends garbage data either on the PCIe, the storage bus or both, which should immediately panic the kernel or at least taint it. Once you receive garbage which you can't handle/correct within spec, you can no longer be sure about the system integrity.
  In this case the FS driver can no longer access the devices and the kernel does not know about a device disconnect (to the kernel the devices still exist). Once that happens the FS driver will give up, stop the journal and in the case of ZFS completely offline the whole pool. If the main root FS was on that array/device it's completely reasonable to panic the system. The system cannot function without the root FS. This is the equivalent of being surprised that your system won't boot past initramfs when there's no real root FS.
  
  In conversation 6 months ago permalink
  
  feld likes this.
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 05:38:24 JST feld
  in reply to
  - Phantasm
  @phnt @cult best implementation of PCIE hotplugging is Thunderbolt and even that can still cause panics if the device behaves badly enough
  
  In conversation 6 months ago permalink
- Embed this notice
  Phantasm (phnt@fluffytail.org)'s status on Tuesday, 03-Sep-2024 05:39:21 JST Phantasm
  in reply to
  - Phantasm
  - feld
  @cult @feld In fact a non-zero amount of storage servers I touched had either panic_on_io_nmi or panic_on_oops set. When you are dealing with storage it's in my opinion always better to crash early instead of r/w garbage permanently and then crash. It doesn't really matter if your cold storage ZFS pool goes offline for 5 minutes when something goes horribly wrong and a properly architected storage solution for active use also won't fall apart when one of the systems goes offline.
  
  In conversation 6 months ago permalink
  
  feld likes this.
- Embed this notice
  Pete Wright (pete_wright@nlogic.systems)'s status on Tuesday, 03-Sep-2024 05:45:56 JST Pete Wright
  in reply to
  - feld
  @feld @cult wouldn't that be the preferable path as opposed to causing catastrophic filesystem corruption?
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 05:45:56 JST feld
  in reply to
  - Pete Wright
  @pete_wright @cult yes, I think that's one of the underlying motivations for never trying to recover from it. It's just too dangerous.
  
  And this still leaves ZFS as the only filesystem that will almost guaranteed survive this event without corruption. (There is one very rare exception, but I can't remember the ZFS terminology. Will update when I remember it)
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 05:48:54 JST feld
  in reply to
  - Phantasm
  @phnt @cult Take an external device and make it fast boot loop due to power cutting out and I'm pretty sure it still will panic from this but you can't be too mad about it really heh
  
  In conversation 6 months ago permalink
- Embed this notice
  Phantasm (phnt@fluffytail.org)'s status on Tuesday, 03-Sep-2024 05:48:56 JST Phantasm
  in reply to
  - feld
  @feld @cult Interestingly Thunderbolt didn't panic on me yet after 3 years. But what it does regularly is failing to send video signal after waking from S3 sleep and somewhat regular (monthly) hard lockups without a panic on Linux. On Windows it randomly sends USB HID devices into sleep or starts cutting out HID inputs under heavier IO.
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 05:57:26 JST feld
  in reply to
  
  @cult What is messy about it? That VDEVs are relatively permanent so shrinking is impossible and expanding is not economical?
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Tuesday, 03-Sep-2024 05:57:36 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld honestly, I'm not sure what the problem here is. The fact ZFS is just one of the better behaving filesystems doesn't make it good. The bar is just very low in filesystems aiming for recoverability and resistance to malicious attacks.
  The rest of ZFS being a mess of a design in terms of modern filesystem architecture is just part of it.
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 06:08:18 JST feld
  in reply to
  
  @cult I've seen that before -- agree it should never be possible to do that. They could probably mark it as readonly/locked with a zpool attribute so the writes will actually be denied
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Tuesday, 03-Sep-2024 06:08:42 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld Oh also superblock upgrades, which despite ZFS being usually fairly careful about feature upgrades, I've had cause a zpool to become a irreversible migration. In that case an old Solaris host having it's zpool moved to a linux machine and the pool imported read-only caused it to become unmountable by any installation of Solaris we had at the company.
  
  In conversation 6 months ago permalink
- Embed this notice
  CULTPONY :verified: :verified: (cult@pony.social)'s status on Tuesday, 03-Sep-2024 06:08:44 JST CULTPONY :verified: :verified:
  in reply to
  - feld
  @feld RAIDZ comes with so many footnotes about what it does to a pool it's not even funny.
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 06:14:06 JST feld
  in reply to
  
  @cult
  
  > The bar is just very low in filesystems aiming for recoverability and resistance to malicious attacks.
  
  CPUs and storage IOPS were not fast enough for more complex filesystems. There's nobody to be mad at. Now is the time to build it if that's your interest -- it's finally feasible. But remember filesystems take a solid decade to mature, so you'll be married to it 🥲
  
  In conversation 6 months ago permalink
- Embed this notice
  feld (feld@friedcheese.us)'s status on Tuesday, 03-Sep-2024 06:29:44 JST feld
  in reply to
  - Phantasm
  @phnt @cult if I run into a problem where I really need deduplication I'll deploy Dragonfly for HAMMER or some SAN that does it
  
  As for fragmentation -- yeah, annoying. I wrote my own tool to force copy files and then move in place to defrag them. It can get it down to 1-2%
  
  In conversation 6 months ago permalink
- Embed this notice
  Phantasm (phnt@fluffytail.org)'s status on Tuesday, 03-Sep-2024 06:29:45 JST Phantasm
  in reply to
  - feld
  @feld @cult ZFS fragmentation can be big problem for older deployments. Deduplication is very costly. What is also a problem is backwards compatibility with older ZFS after upgrading the pool.
  
  It also isn't that fast compared to other filesystems which isn't a problem for me. Sure, you can add metadata special devices, L2ARC, SLOG and tweak the filesystem, but that isn't its purpose.
  
  In conversation 6 months ago permalink

Public

Conversation

Notices

Feeds