Of all the worst possible fates someone could face, I’m sure the most sinister thing to wish upon someone would be “..I hope your ZFS server has intermittent hardware issues..”
I had a compound issue of a defective CPU and bad RAM on a TrueNAS Core setup, and I originally suspected it to be from a poorly designed cheap motherboard. There were times when the system went unresponsive, or kernel panicked, and it would usually take like 2-5 minutes to get past the boot screen (or sometimes only advance by pressing a key, despite no errors or prompts). Eventually it got so erratic that I absolutely had to take it offline and do an extensive memtest, but there’d never be any RAM errors, yet weirdly sometimes memtest itself would completely freeze. Then it couldn’t even start anymore. So I said ‘screw it’ and bought a new motherboard, moved over the CPU and RAM, and on the new motherboard it couldn’t boot either. So I took my desktop CPU, put it in the new motherboard, it booted fine. Also moved my desktop CPU back, and carried over the RAM from the server build into my desktop, that booted fine. So I deduced that apparently the CPU must be defective.
I had 3-year warranty on it with NewEgg, went to file a claim, got redirected to the insurer’s website (which got bought out by Allstate), and the dumbest thing was: the only option I had available was to bring it to a computer repair shop, and the insurer would pay for the cost of repair. But.. it’s a processor, it’s not a serviceable component, and I actually AM the repair shop (this is a commissioned home server build for someone else), and I concluded that it needs to be replaced. So I [politely] fought with them for days, and it was just the most patently absurd circular logic; because it was essentially as if they miscategorized the AMD processor SKU as a complete serviceable desktop computer, but even in explaining this, rationality wasn’t there. It was still under manufacturer warranty, so I contacted AMD instead and made my case, sent it in, they confirmed it was indeed defective, and sent a replacement (moral of the story: DO NOT buy warranty on a processor on NewEgg, the manufacturer warranty already exceeds what they offer, and will handle it more sanely).
Meanwhile, in the interim of the CPU RMA, I ordered a different processor just to get it going again, and it’s been running just fine ever since for them.
… But then, I have a nice spare AM4-socket server motherboard laying around, that was concluded to be fine, plus a replacement processor that arrived later per RMA, both unused. So I figure just use the parts for myself as an upgrade to my existing fileserver. As it was DDR4 while my present hardware was DDR3, I needed to order DDR4 RAM for it.
RAM comes in, I move my harddrives over to the new CPU/RAM/motherboard combo, and put it in a nice rackmount case. Make sure my VMs and everything start fine, and everything’s working. A month later, kernel error message and kernel panics. I test RAM, clearly defective. Send it in for RMA. Replacement RAM comes, I install it and test it, RAM test fails AGAIN. I move the presumed defective RAM into my desktop, test it there, it fails there too. I RMA the replacement RAM, and then finally get fully functional RAM.
Everything runs great for many months. Then I start to rework the setup of my VMs, moving them to a different system (such as a separate system intended as a game server), and/or shutting some unused VMs off. A day or two later, after I close my last SSH session to that server, suddenly the server goes offline a couple minutes later.
I turn it back on, everything’s running again, can’t find any clear cause, nothing in any logs, not anything that stands out in the BMC. Day or two later it happens again. I start digging around online for answers, especially for people using the exact same ‘server’ motherboard, and I find recommendations to change power management options in the motherboard firmware setting, one option originally called “Power Supply Idle Mode”. Apparently if the CPU usage is so low, it’ll go in a low-power idle, which some finnicky power supplies might just completely shut off at.
After correcting that setting, it’s been running for several months uninterrupted.
But man, that was the most stress series of events, especially that it just had to happen ONLY in my fileserver use-cases (two sets of bad RAM, even), but ZFS survived it. All of this occurred earlier this year.