Conversation
Notices
-
Embed this notice
nepi (nepi@fedi.snepi.space)'s status on Sunday, 27-Nov-2022 11:26:02 JST nepi Aaaand it's looking like I bought a replacement CPU for no reason. Oh well, guess it goes into the ZFS box! -
Embed this notice
arcanicanis (arcanicanis@were.social)'s status on Sunday, 27-Nov-2022 11:26:00 JST arcanicanis Of all the worst possible fates someone could face, I’m sure the most sinister thing to wish upon someone would be “..I hope your ZFS server has intermittent hardware issues..”
I had a compound issue of a defective CPU and bad RAM on a TrueNAS Core setup, and I originally suspected it to be from a poorly designed cheap motherboard. There were times when the system went unresponsive, or kernel panicked, and it would usually take like 2-5 minutes to get past the boot screen (or sometimes only advance by pressing a key, despite no errors or prompts). Eventually it got so erratic that I absolutely had to take it offline and do an extensive memtest, but there’d never be any RAM errors, yet weirdly sometimes memtest itself would completely freeze. Then it couldn’t even start anymore. So I said ‘screw it’ and bought a new motherboard, moved over the CPU and RAM, and on the new motherboard it couldn’t boot either. So I took my desktop CPU, put it in the new motherboard, it booted fine. Also moved my desktop CPU back, and carried over the RAM from the server build into my desktop, that booted fine. So I deduced that apparently the CPU must be defective.
I had 3-year warranty on it with NewEgg, went to file a claim, got redirected to the insurer’s website (which got bought out by Allstate), and the dumbest thing was: the only option I had available was to bring it to a computer repair shop, and the insurer would pay for the cost of repair. But.. it’s a processor, it’s not a serviceable component, and I actually AM the repair shop (this is a commissioned home server build for someone else), and I concluded that it needs to be replaced. So I [politely] fought with them for days, and it was just the most patently absurd circular logic; because it was essentially as if they miscategorized the AMD processor SKU as a complete serviceable desktop computer, but even in explaining this, rationality wasn’t there. It was still under manufacturer warranty, so I contacted AMD instead and made my case, sent it in, they confirmed it was indeed defective, and sent a replacement (moral of the story: DO NOT buy warranty on a processor on NewEgg, the manufacturer warranty already exceeds what they offer, and will handle it more sanely).
Meanwhile, in the interim of the CPU RMA, I ordered a different processor just to get it going again, and it’s been running just fine ever since for them.
… But then, I have a nice spare AM4-socket server motherboard laying around, that was concluded to be fine, plus a replacement processor that arrived later per RMA, both unused. So I figure just use the parts for myself as an upgrade to my existing fileserver. As it was DDR4 while my present hardware was DDR3, I needed to order DDR4 RAM for it.
RAM comes in, I move my harddrives over to the new CPU/RAM/motherboard combo, and put it in a nice rackmount case. Make sure my VMs and everything start fine, and everything’s working. A month later, kernel error message and kernel panics. I test RAM, clearly defective. Send it in for RMA. Replacement RAM comes, I install it and test it, RAM test fails AGAIN. I move the presumed defective RAM into my desktop, test it there, it fails there too. I RMA the replacement RAM, and then finally get fully functional RAM.
Everything runs great for many months. Then I start to rework the setup of my VMs, moving them to a different system (such as a separate system intended as a game server), and/or shutting some unused VMs off. A day or two later, after I close my last SSH session to that server, suddenly the server goes offline a couple minutes later.
I turn it back on, everything’s running again, can’t find any clear cause, nothing in any logs, not anything that stands out in the BMC. Day or two later it happens again. I start digging around online for answers, especially for people using the exact same ‘server’ motherboard, and I find recommendations to change power management options in the motherboard firmware setting, one option originally called “Power Supply Idle Mode”. Apparently if the CPU usage is so low, it’ll go in a low-power idle, which some finnicky power supplies might just completely shut off at.
After correcting that setting, it’s been running for several months uninterrupted.
But man, that was the most stress series of events, especially that it just had to happen ONLY in my fileserver use-cases (two sets of bad RAM, even), but ZFS survived it. All of this occurred earlier this year.
-
Embed this notice
nepi (nepi@fedi.snepi.space)'s status on Sunday, 27-Nov-2022 11:26:01 JST nepi The series of unfortunate events went something like this:
- pull out CPU from socket in the wrong way (red herring)
- in the process of removing/reinstalling the CPU, the PCIE riser cable goes from a marginal connection to one that no longer works
- machine no longer boots
- attempt to re-seat cable, but attempt wasn't good enough as it's really difficult to manipulate when the machine is fully installed in the case
- become increasingly convinced that the problem is the CPU even though the lights on the motherboard get stuck at the VGA step (initializing the VGA is handled directly by the CPU over PCIE so a problem with the CPU could manifest itself as being unable to initialize graphics)
- reset the BIOS, which changes the default PCIE protocol to PCIE 4.0, which the riser cable does not support (apparently there's no way to fall back to lesser protocols if both devices negotiate PCIE4 at initialization?)
- swap GPUs for a GPU that I thought was good, but appears to be bad or at least incompatible with ryzen 5000 systems
- install CPU in old motherboard with different RAM, with said bad GPU, machine gets stuck at DIMM training for an improbably long time
- become convinced problem is the cpu
- order new CPU and install it into current motherboard, problem still persists
- plug full-fat GPU into old motherboard with old CPU, machine boots just fine
- reconsider every choice in my life until this point
- plug full-fat GPU into current motherboard with new CPU and reconfigure the PCIE link to 3.0
- rebuild entire system, boots fine -
Embed this notice
arcanicanis (arcanicanis@were.social)'s status on Sunday, 27-Nov-2022 11:27:10 JST arcanicanis Of all the worst possible fates someone could face, I'm sure the most sinister thing to wish upon someone would be "..I hope your ZFS server has intermittent hardware issues.."
I had a compound issue of a defective CPU **and** later bad RAM on a TrueNAS Core setup, and I originally suspected it to be from a poorly designed cheap motherboard. There were times when the system went unresponsive, or kernel panicked, and it would usually take like 2-5 minutes to get past the boot screen (or sometimes only advance by pressing a key, despite no errors or prompts). Eventually it got so erratic that I absolutely had to take it offline and do an extensive memtest, but there'd never be any RAM errors, yet weirdly sometimes memtest itself would completely freeze. Then it couldn't even start anymore.
So I said 'screw it' and bought a new motherboard, moved over the CPU and RAM, and on the new motherboard it couldn't boot either. So I took my desktop CPU, put it in the new motherboard, it booted fine. Also moved my desktop CPU back, and carried over the RAM from the server build into my desktop, that booted fine. So I deduced that apparently the CPU must be defective.
I had 3-year warranty on it with NewEgg, went to file a claim, got redirected to the insurer's website (which got bought out by Allstate), and the dumbest thing was: the only option I had available was to bring it to a computer repair shop, and the insurer would pay for the cost of repair. But.. it's a processor, it's not a serviceable component, and I actually *AM* the repair shop (this is a commissioned home server build for someone else), and I concluded that it needs to be *replaced*. So I [politely] fought with them for days, and it was just the most patently absurd circular logic; because it was essentially as if they miscategorized the AMD processor SKU as a complete serviceable desktop computer, but even in explaining this, rationality wasn't there. It was still under manufacturer warranty, so I contacted AMD instead and made my case, sent it in, they confirmed it was indeed defective, and sent a replacement (moral of the story: **DO NOT** buy warranty on a processor on NewEgg, the manufacturer warranty already exceeds what they offer, and will handle it more sanely).
Meanwhile, in the interim of the CPU RMA, I ordered a different processor just to get it going again, and it's been running just fine ever since for them.
...
But then, I have a nice spare AM4-socket server motherboard laying around, that was concluded to be fine, plus a replacement processor that arrived later per RMA, both unused. So I figure just use the parts for myself as an upgrade to my existing fileserver. As it was DDR4 while my present hardware was DDR3, I needed to order DDR4 RAM for it.
RAM comes in, I move my harddrives over to the new CPU/RAM/motherboard combo, and put it in a nice rackmount case. Make sure my VMs and everything start fine, and everything's working. A month later, kernel error message and kernel panics. I test RAM, clearly defective. Send it in for RMA. Replacement RAM comes, I install it and test it, RAM test fails ***AGAIN***. I move the presumed defective RAM into my desktop, test it there, it fails there too. I RMA the replacement RAM, and then *finally* get fully functional RAM.
Everything runs great for many months. Then I start to rework the setup of my VMs, moving them to a different system (such as a separate system intended as a game server), and/or shutting some unused VMs off. A day or two later, after I close my last SSH session to that server, suddenly the server goes offline a couple minutes later.
I turn it back on, everything's running again, can't find any clear cause, nothing in any logs, not anything that stands out in the BMC. Day or two later it happens again. I start digging around online for answers, especially for people using the exact same 'server' motherboard, and I find recommendations to change power management options in the motherboard firmware setting, one option originally called "Power Supply Idle Mode". Apparently if the CPU usage is so low, it'll go in a low-power idle, which some finnicky power supplies might just completely shut off at.
After correcting that setting, it's been running for several months uninterrupted.
But man, that was the most stress series of events, especially that it just had to happen ONLY in my fileserver use-cases (two sets of bad RAM, even), but ZFS survived it. All of this occurred earlier this year. -
Embed this notice
arcanicanis (arcanicanis@were.social)'s status on Sunday, 27-Nov-2022 11:27:46 JST arcanicanis Of all the worst possible fates someone could face, I’m sure the most sinister thing to wish upon someone would be “..I hope your ZFS server has intermittent hardware issues..”
I had a compound issue of a defective CPU and later bad RAM on a TrueNAS Core setup, and I originally suspected it to be from a poorly designed cheap motherboard. There were times when the system went unresponsive, or kernel panicked, and it would usually take like 2-5 minutes to get past the boot screen (or sometimes only advance by pressing a key, despite no errors or prompts). Eventually it got so erratic that I absolutely had to take it offline and do an extensive memtest, but there’d never be any RAM errors, yet weirdly sometimes memtest itself would completely freeze. Then it couldn’t even start anymore. So I said ‘screw it’ and bought a new motherboard, moved over the CPU and RAM, and on the new motherboard it couldn’t boot either. So I took my desktop CPU, put it in the new motherboard, it booted fine. Also moved my desktop CPU back, and carried over the RAM from the server build into my desktop, that booted fine. So I deduced that apparently the CPU must be defective.
I had 3-year warranty on it with NewEgg, went to file a claim, got redirected to the insurer’s website (which got bought out by Allstate), and the dumbest thing was: the only option I had available was to bring it to a computer repair shop, and the insurer would pay for the cost of repair. But.. it’s a processor, it’s not a serviceable component, and I actually AM the repair shop (this is a commissioned home server build for someone else), and I concluded that it needs to be replaced. So I [politely] fought with them for days, and it was just the most patently absurd circular logic; because it was essentially as if they miscategorized the AMD processor SKU as a complete serviceable desktop computer, but even in explaining this, rationality wasn’t there. It was still under manufacturer warranty, so I contacted AMD instead and made my case, sent it in, they confirmed it was indeed defective, and sent a replacement (moral of the story: DO NOT buy warranty on a processor on NewEgg, the manufacturer warranty already exceeds what they offer, and will handle it more sanely).
Meanwhile, in the interim of the CPU RMA, I ordered a different processor just to get it going again, and it’s been running just fine ever since for them.
… But then, I have a nice spare AM4-socket server motherboard laying around, that was concluded to be fine, plus a replacement processor that arrived later per RMA, both unused. So I figure just use the parts for myself as an upgrade to my existing fileserver. As it was DDR4 while my present hardware was DDR3, I needed to order DDR4 RAM for it.
RAM comes in, I move my harddrives over to the new CPU/RAM/motherboard combo, and put it in a nice rackmount case. Make sure my VMs and everything start fine, and everything’s working. A month later, kernel error message and kernel panics. I test RAM, clearly defective. Send it in for RMA. Replacement RAM comes, I install it and test it, RAM test fails AGAIN. I move the presumed defective RAM into my desktop, test it there, it fails there too. I RMA the replacement RAM, and then finally get fully functional RAM.
Everything runs great for many months. Then I start to rework the setup of my VMs, moving them to a different system (such as a separate system intended as a game server), and/or shutting some unused VMs off. A day or two later, after I close my last SSH session to that server, suddenly the server goes offline a couple minutes later.
I turn it back on, everything’s running again, can’t find any clear cause, nothing in any logs, not anything that stands out in the BMC. Day or two later it happens again. I start digging around online for answers, especially for people using the exact same ‘server’ motherboard, and I find recommendations to change power management options in the motherboard firmware setting, one option originally called “Power Supply Idle Mode”. Apparently if the CPU usage is so low, it’ll go in a low-power idle, which some finnicky power supplies might just completely shut off at.
After correcting that setting, it’s been running for several months uninterrupted.
But man, that was the most stress series of events, especially that it just had to happen ONLY in my fileserver use-cases (two sets of bad RAM, even), but ZFS survived it. All of this occurred earlier this year.
-
Embed this notice
arcanicanis (arcanicanis@were.social)'s status on Sunday, 27-Nov-2022 11:28:11 JST arcanicanis (and holy heck I hate Delete & Redraft; I hope that didn't spam everyone's timeline with 3 copies of that post) -
Embed this notice
arcanicanis (arcanicanis@were.social)'s status on Sunday, 27-Nov-2022 11:39:20 JST arcanicanis This was a build with a Ryzen 9 3900X.
But yes, in my main desktop I originally started with a 1600X, and for some reason there'd be times I'd come back to my desktop the next day and it'd be unresponsive, and I'd have to restart it. Upgraded to a different processor, and haven't had an issue since.
Meanwhile, there is definitely a known defect of 1st generation Ryzen under oddly-specific heavy-load situations, and I have suspicion that the one I have may be a defect, per the serial number. But never really followed through completely on the RMA on that one. -
Embed this notice
nepi (nepi@fedi.snepi.space)'s status on Sunday, 27-Nov-2022 11:39:21 JST nepi @arcanicanis Holy shit, was this on a 1st gen Ryzen processor, by chance? I bought one off eBay and had similar issues in a B350 chipset board until I disabled C-States! I think there was a bad run of early first gen 1700s.
In another hilarious coincidence, I also had ZFS arrays hooked up to that thing when it started failing. It survived as well, but I was really sweating when I started getting drives dropping out of arrays due to write errors because of segfaults.arcanicanis likes this.
-
Embed this notice