@nukie literally where the fuck did this dude go, he switched his xmpp pfp to hello kitty or some shit Im scared that he actually went trans for real this time
@Tij@nukie Actually I'm now tired of waiting plus I want to see how it performs now, so I merged the develop branch and changed oban dependency to be pulled from git, will update the instances later. Also @feld, apparently adding Lazarus plugin broke accessing DB config. adminfe would just show the empty settings dropdown while nu-PleromaFE throws TypeError: n.tuple is undefined. This doesn't happen after switching to develop branch, and I don't see anything out of place in /api/pleroma/admin/config aside from that. Nothing too critical, just had to change config.exs instead to reduce liability of a hostile actor on one of the instances. Screenshot_20240825_195527.png
@mint@feld I updated to the latest Oban git yesterday and it survived a DB repack which is an improvement I guess. Other actions that would crash Pleroma (related to the 502 gateway issue; that issue is a special case of this) no longer seem to do so.
Today Husky crapped out on me probably because it couldn't auth over the dropped API requests coming through db_connection. Increasing queue_target and queue_interval did help with that, but it might have other side effects.
It's too early to tell if the newer Oban helps with the stalling federation. At least the performance isn't worse. @nukie@Tij
@phnt@feld@nukie@Tij >It's too early to tell if the newer Oban helps with the stalling federation Description of the commit is a fairly close match to the symptoms we observed (Oban unaliving itself after receiving too many DB timeouts), so we'll see.
@phnt@Tij@feld@nukie Oh, it crashed in entirety, but got restarted by healthcheck script. Nothing in logs except hundreds of DBConnection.ConnectionErrors, but there's also a bunch of "ERROR 57014 (query_canceled)" and "unknown registry: Pleroma.Web.StreamerRegistry".
I have no idea how the supervisor tree in Pleroma looks like, but my theory is that after enough db_connection errors, the error slowly goes up and eventually reaches Pleroma's own supervisor. The maximum number of restarts is set to 3 in the default config and after that is exceeded, it exits and init restarts it.
There's a somewhat rare case when the Pleroma application completely shuts down, but the system process itself still exists and therefore it doesn't get restarted by init. That's the issue I talked about.
@phnt@nukie@Tij@mint if we can figure out what gets caught in the fast crash loop we can change the way it starts that service to prevent that from crippling the app
It's very hard to tell because even loading FE sometimes floods logs with DBConnection errors. Currently I have no way of at least somewhat reliably causing the crash.
There's one log in one of the other threads that at least partially crippled Pleroma into not listening on any ports and didn't cause a restart. https://fluffytail.org/notice/Al1dQDXk8Erhmg31sW
I'll look through my logs tomorrow for a proper log where Pleroma exited completely.
@feld Sorry for the delay. I've looked through my logs and the last time Pleroma shut down and restarted was 11 days (no crash or stalled federation since then) when I was still running Oban 2.13.6.
The logs are mostly the same as the ones in the other thread linked above. A lot of "connection not available and request was dropped from queue after X ms" or an occasional "connection closed by the pool, possibly due to a timeout..." messages coming from db_connection with even rarer "cancelling statement due to user request" messages coming from postgrex.
The db_connection errors always come in big batches usually when disk iowait increases which isn't under my control. It's not caused by Postgres autovacuum as that runs much more frequently.