Four months back I muted Poa.st for a day just to see what happened. The results were surprising. What I don't understand about defederation is, just because you don't want to talk to someone, nobody else should be able to either? That's fucked.
Bruh, I repeatedly said we need to Saigon taxi the worthwhile out. They don't want to move, that's new and scary. It's why S2 and Styx both have Gab accounts and are absent here.
I was too forgiving is my point. Some garbage should stay in the bin where you found it.
> ask me how I know he doesn't test shit before pushing updates.
"there isn't a Pleroma instance that exists which cannot handle the load on available hardware" still irks me because it means that the person that wrote it was ignoring the performance issues that had been reported in the bug tracker as well as in messages on fedi. (I don't think most people could have convinced Pleroma to stay up when subjected to FSE's load.) I guess nothing has changed since FSE's last merge with upstream.
@p@phnt@cvnt@ins0mniak@transgrammaractivist@Owl@dj@NonPlayableClown In his defense, he's been a big help with identifying the root cause of performance issue (a bug in Oban that made it crash its queue processing tasks) that until that point only affected my instances and one other guy that made a bug report.
The bae.st media import is actually running on a CM-4 and the actual bottleneck is the disk and it's chugging but I just sort of took it as a given that this is something I have to fix. There was an instance running on a Switch for a while, and I'm sure you're aware of mint's antics. lain cares about that kind of platform, so it's nice that lain is around.
@p@cvnt@ins0mniak@transgrammaractivist@Owl@dj@NonPlayableClown The main problem is that lain likely never expected to Pleroma grow this large to instances with hundreds of users. Something which would have influenced some of the design decisions. And that the developers don't run those active big instances.
I don't remember what version FSE runs on (2.2.X I think), but I can say that at least things didn't get worse with releases from 2.5.0 to current develop (apart from maybe increased IO usage I have yet to investigate in 2.7.0). I heard that some of the expensive DB queries were improved recently. Feld also committed some fixes found with static analysis and that's about it.
You can still run Pleroma on cheap hardware, it just requires a lot of know-how in optimizing Postgres and Linux. FluffyTail runs on BuyVMs $7 tier with the DB stored on a slab and it is yet to encounter a major performance issue after 1.5 years. Granted I'm the only user, so there isn't a lot of stress on the system.
> The main problem is that lain likely never expected to Pleroma grow this large to instances with hundreds of users.
Well, sure, I made that remark in the linked-to thing:
> FSE is an edge case, but will not be forever. I think it is a good problem to have that scaling is becoming an issue, you don't wanna build super scalability into something that nobody ends up using, but it is the case that there are scaling issues that presently affect pleroma instances.
And that's November 2020.
> And that the developers don't run those active big instances.
I offered perf datasets many times. Server logs, Postgres logs. At the time that I wrote that post, there was a Prometheus endpoint that was open to the public on every single instance, so an interested party could collect performance data on every live Pleroma instance; it closed in the release immediately after that post.
I get it doesn't touch them and I get that the idea wasn't to run instances with this many users, but it's not number of users that introduce the scaling issues: it's number of activities.
> I don't remember what version FSE runs on (2.2.X I think),
@p@phnt@cvnt@ins0mniak@Owl@dj@NonPlayableClown making a release in the middle of rewriting the work queue was the stupid bit, not making a release a month after it's fixed, and weeks after people complained about it, isn't any better
they could cherry pick that single commit change into 2.7.2 (2.7.1 was a pleroma-fe emoji picker bug fix release apparently) for all i care
pleroma's codebase needs a hard forking, but with federation details relying on implementation edge cases not accidentally changing, it's hard to muster up the will power to care enough to go through with it
I'm always running my changes live on my instances. They were massively overpowered. Now I have a severely underpowered server and it's still fine.
If I could reproduce reported issues it would be much easier to solve them but things generally just work for me.
A ton of work has been put into correctness (hundreds of Dialyzer fixes) and tracking down elusive bugs and looking for optimizations like reducing JSON encode/decode work when we don't need to, avoiding excess queries, etc.
I'm halfway done with an entire logging rewrite and telemetry integration which will make it even easier to identify bottlenecks.
> I mean, like I mentioned, the Prometheus endpoints were public at the time.
Problem is that this data is useful for monitoring overall health of an instance but doesn't give enough granular information to track down a lot of issues. With the metrics/telemetry work I have in progress we'll be able to export more granular Pleroma-specific metrics that will help a lot.
> The main bottleneck is the DB
So often it's just badly configured Postgres. If your server has 4 cores and 4 GB of RAM you can't go use pgtune and tell it you want to run Postgres with 4 cores and 4GB. There's nothing leftover for the BEAM. You want at least 500MB-1GB dedicated to BEAM, more if your server has a lot of local users so it can handle memory allocation spikes.
And then what else is running on your OS? That needs resources too. There isn't a good way to predict the right values for everyone. 😭 Like I said, it's running *great* on my little shitty thin client PC with old slow Intel J5005 cores and 4GB RAM. But I have an SSD for the storage and almost nothing else runs on the OS (FreeBSD). I'm counting a total of 65 processes before Pleroma, Postgres, and Nginx are running. Most Linux servers have way more services running by default. That really sucks when trying to make things run well on lower specced hardware.
You also have to remember that BEAM is greedy and will intentionally hold the CPU longer than it needs because it wants to produce soft-realtime performance results. This needs to be tuned down on lower resource servers because BEAM itself will be preventing Postgres from doing productive work. It's just punching itself in the face then. Set these vm.args on any server that isn't massively overpowered:
+sbwt none +sbwtdcpu none +sbwtdio none
> using an entire URL for an index is costing a lot in disk I/O
For the new Rich Media cache (link previews stored in the db so they're not constantly refetched) I hashed the URLs for the index for that same reason. Research showed a hash and the chosen index type were super optimal.
Another thing I did was I noticed we were storing *way* too much data in Oban jobs. Like when you federated an activity we were taking the entire activity's JSON and storing it in the jobs. Imagine making a post with 100KB of content that needs to go to 1000 servers? Each delivery job in the table was HUGE. Now it's just the ID of the post and we do the JSON serialization at delivery time. Much better, lower resource usage overall, lower IO.
Even better would be if we could serialize the JSON *once* for all deliveries but it's tricky because we gotta change the addressing for each delivery. Jason library has some features we might be able to leverage for this but it doesn't seem important to chase yet. Even easier might be to put placeholders in the JSON text, store it in memory, and then just use regex or cheaper string replacement to fill those fields at delivery time. Saves all that repeat JSON serialization work.
Other things I've been doing:
- making sure Oban jobs that have an error we should really treat as permanent are caught and don't allow the job to repeat. It's wasteful for us, rude to remote servers when we're fetching things
- finding every possible blocker for rendering activities/timelines and making those things asynchronous. One of the most recent ones I found was with polls. They could stall rendering a page of the timeline if the poll wasn't refreshed in the last 5 mins or whatever. (and also... I'm pretty sure polls were still being refreshed AFTER the poll was closed 🤬)
I want Pleroma to be the most polite Fedi server on the network. There are still some situations where it's far too chatty and sends requests to other servers that could be avoided, so I'm trying to plug them all. Each of these improvements lowers the resource usage on each server. Just gotta keep striving to make Pleroma do *less* work.
I do have my own complaints about the whole Pleroma releases situation. I wish we were cutting releases like ... every couple weeks if not every month. But I don't make that call.
> Nobody proved there was an *Oban* bottleneck and still haven't.
Well, this was a remark years back. (It does still irk me.) Everything I know about the current Oban bug is second-hand, I am running what might be the only live Pleroma instance with no Gleason commits (happy coincidence; I was actually dodging another extremely expensive migration and then kicked off the other project, which meant I don't want to have to hit a moving target if I can avoid it, so I stopped pulling); at present, I backport a security fix (or just blacklist an endpoint) once in a while.
Unless you mean the following thing, but I haven't run 2.7.0. I don't know what that bug is.
> If I could reproduce reported issues it would be much easier to solve them but things generally just work for me.
I mean, like I mentioned, the Prometheus endpoints were public at the time. You could see my bottlenecks. (I think that would be cool to reenable by default; they'd just need to stop having 1MB of data in them if people are gonna fetch them every 30s, because enough people doing that can saturate your pipe.)
> A ton of work has been put into correctness (hundreds of Dialyzer fixes) and tracking down elusive bugs and looking for optimizations like reducing JSON encode/decode work when we don't need to, avoiding excess queries, etc.
I'm not sure what the Dialyzer is (old codebase), but improvements are good to hear about. That kind of thing gets you a 5%, 10% bump to a single endpoint, though. The main bottleneck is the DB; some cleverness around refetching/expiration would get you some much larger performance gains, I think; using an entire URL for an index is costing a lot in disk I/O. There's a lot of stuff to do, just not much of it is low-hanging, I think.