@nihil@misskey.gg if I can get my AMD GPU to be stable with Stable Diffusion I'll make a LoRA for it maybe, made a issue to the AMDGPU driver freedesktop Gitlab
@mint@ryona.agency@nihil@misskey.gg the issue is a race condition in the AMDGPU driver itself I think related to AMDKFD, ROCm and PyTorch on RX 6800 series at least. I might try ZLUDA to workaround it again which failed, or maybe I boot into Windows and use DirectML, or maybe I try to use CPU, or hell maybe I can find a backend in either OpenCL or Vulkan not sure
@mint@ryona.agency@nihil@misskey.gg my solution for now is to run Stable Diffusion in a TTY so the GPU is basically idle besides image generation and it seems to happen less often
@mint@ryona.agency@nihil@misskey.gg it could always be something else than a race condition, but it seems like it's that judging off of these lines Feb 27 22:08:15 nixos systemd[1]: Starting Cleanup of Temporary Directories...
Feb 27 22:08:15 nixos systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Feb 27 22:08:15 nixos systemd[1]: Finished Cleanup of Temporary Directories.
Feb 27 22:08:24 nixos kernel: amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000028 SMN_C2PMSG_82:0x00000000
Feb 27 22:08:24 nixos kernel: amdgpu 0000:09:00.0: amdgpu: Failed to enable gfxoff!
Feb 27 22:08:29 nixos kernel: amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000028 SMN_C2PMSG_82:0x00000000
Feb 27 22:08:29 nixos kernel: amdgpu 0000:09:00.0: amdgpu: Failed to enable gfxoff!
Feb 27 22:08:30 nixos kernel: amdgpu 0000:09:00.0: [drm] *ERROR* [CRTC:95:crtc-1] flip_done timed out
Feb 27 22:08:31 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3400, emitted seq=3401
Feb 27 22:08:31 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Feb 27 22:08:31 nixos kernel: amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Feb 27 22:08:31 nixos kernel: amdgpu: Failed to suspend process 0x800cFeb 27 15:09:35 nixos kernel: mce: [Hardware Error]: Machine check events logged
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc0754f4a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1709068166 SOCKET 0 APIC 3 microcode 8701030
Feb 27 15:09:36 nixos kernel: MCE: In-kernel MCE decoding enabled. But the symptoms are 1. The screen turns black 2. Stays black 3. Shuts off suddenly
With ssh I'm able to reveal those logs, and when rebooting I get a MCE error. Luckily the MCE error seems unrelated to my GPU itself, it seems to be that wlwifi kernel driver freaks out at the same time as the GPU resets oddly. I have dmesg logs as well but they are two big to fit in this post
@fuggy@nihil I see. Haven't seen it on my old RDNA1 card, but it has another can of worms (they pretty much axed the support in ROCm 5.3, requiring you to get a PCIe 3.0 mobo because of some obscure instruction or something). What if you blacklist the wifi card's kernel module or bind it to some dud module lie vfio-pci?
@mint@ryona.agency@nihil@misskey.gg you can search for "HW problem" and see that it seems to be my WIFI chipset which I'm pretty sure internally uses PCIE so it makes sense I guess
@mint@ryona.agency@nihil@misskey.gg I would probably have to figure out how to get logs in other means, but I think the wifi driver freaking out is just a symptom not the cause. Also I'm limited to PCIE 3 because my motherboard is absolute dogshit also can't go above 64GBs of RAM for same reason which I would like