@colonelj@Terry Something weird going on with bae.st, started a few weeks ago but there were some weird incidents before that with them sending double posts.
It's been hitting both the public inbox endpoint and the user-specific inboxes. I don't know why it's doing that, but it's even doing it for users that no longer exist.
I don't think it's that, though, I think it's probably that it takes us longer to do the insert than bae.st's local timeout. (cc @sjw ) Delivering a post is several steps¹, most of which have overlapping timers on both ends, so something can get delivered but might not be viewed as a success on the other end and they try to deliver it again.
There should probably just be a uniqueness constraint on (user_id,type,activity_id) for notifications. That would slow things down (explanation in the footnote), though, and it's already slow.
¹ Abridged and simplified:
1. Worker receives job to deliver post to FSE. Baest DB connection pool checkout timer starts. 2. Worker does DNS lookup. This is fast usually but should be locally cached because a lot of them happen and it is holding a DB connection from the pool. 3. Worker eventually sends the signed request to FSE to deliver the post. Web request timer starts. 4. FSE receives the post by HTTP. nginx and Pleroma both start incoming request timers. 5. FSE starts processing the post. The post gets verified, goes through the MRF chain, etc. FSE's DB connection pool checkout timer starts. 6. FSE completes processing, including inserting the notification into the notifications table, which is huge. 7. FSE finishes processing the request. FSE's DB connection pool timer stops. 8. FSE sends the request back out through nginx. Pleroma and nginx request timers stop. 9. Baest receives the response. Baest's web request timer stops. 10. Baest worker processes the request, noting that it is a success, and recording that to the DB. Baest DB connection pool checkout timer stops. 11. The job queue records the job's success and the job stops getting retried.
If any of the timers expires after step 6 is over but before step 11 is over, double deliveries could be attempted and, although the post would not get recorded twice, double notifications might be created. A number of these steps are slower than they might be, like step 6: FSE has received 9,679,872 notifications so far; several purges have occurred, it currently holds 3,385,727 rows, but is frequently and repeatedly accessed by reads/writes and is indexed all to hell (which makes writes to the table more expensive). Like Poast, Baest's outgoing requests go through proxies and some of these proxies (e.g., M247) have been responsible for spam accounts so FSE doesn't treat all IP ranges the same. (There are a lot of weird things going on, because bae.st and FSE are both old servers that have accumulated quirks and scar tissue, plus both servers are kind of, like, outliers in terms of use.)
Typically, the way you solve this kind of thing is a two-phase commit, but ActivityPub doesn't have a provision for that kind of thing. It's just "toss the activity at the other server and let's all hope for the best". Granted, the network is not designed to be bulletproof and it's difficult to design a decentralized protocol that *is* bulletproof, and even if it's designed well, it is more complicated to implement that sort of thing than the naïve version, so a two-phase commit might have seemed like overkill. (Something tells me that The Mastadan Netwark's German dictator was closer to "didn't know what a two-phase commit is" than "prudently decided not to complicate the protocol". In any case, if you want lots of implementations and broad adoption, it's better not to require more complicated implementations than you have to.)
So you end up having to deduplicate on both ends. (Content-addressed storage makes deduplication trivial. :revolvertan:)
@p@Terry@sjw@colonelj Worth noting that Pleroma also sends out notifications to user's websocket transitiently, before they even land into DB. I've seen it happen with @iamtakingiteasy's reportbot when there was an upstream bug with random reports not federating.
> Worth noting that Pleroma also sends out notifications to user's websocket transitiently,
If I've seen it happen, then it can't be that, because I'm using bloat. Now that I think about it, for all I know, though, there could be an off-by-one error in the paging of notifications: muted threads always create moderate flakiness. I haven't seen this in the ssh client, which ignores thread-muting.
Next time it happens, I will have a look at the DB.
@mint@Terry@colonelj@iamtakingiteasy@sjw Also, Terry, while you're here I wanted to watch that video again but I can't find it and I swear I saved it. The garbage robot future one.
> Next time it happens, I will have a look at the DB.
Okay, it is two rows in the DB. Activity AdVKpA5RgC3jKDypGa occupied rows 9680118 and 9680119 in the DB, both with my user_id and type "mention". So it's definitely a backend problem. It coincided with a burst of traffic. (The traffic was Applebot and archive.org, and apparently some dipshits are hitting an RSS feed on the blog every 30 seconds so I should probably trim it instead of just putting everything in there.)
@mint@Terry@colonelj@iamtakingiteasy@sjw Well, I have verified that there is double-delivery going on after dumping some of the traffic. I do not know why. FSE indicates a successful result and then bae.st keeps trying to re-deliver and then it gets 5xx responses starting with the second time.
I wish @sjw would take an interest in this because bae.st today has been responsible for 6.8% of *all* requests hitting FSE, and 63.6% of the 5xx errors, and yesterday it was 10.7% of all requests and 80.3% of the 5xx errors and the server keeps getting flooded because everything on his server is getting re-sent repeatedly. bae.st is responsible for 53.54% of the POST reqs. It is enough traffic to affect the ping times. I am going to start just knocking /16s off the wall until my pipe is no longer saturated because I have to work. screenshitter.png
@sjw@Terry@colonelj@p@mint Most likely entry isn't removed from oban federator_outgoing queue. Purely software errors might get recorded in errors column, that should be easiest to check first, if not, enabling debug logs, triggering/waiting for delivery attempt and then sifting through many megabytes of output would be required. Postgres logs at the same moment might also be helpful. Could be conflicting lock, transaction getting rolled back due to software error or some database error. Either way, combined output is likely to contain some hints.
Further speculations include runtime cache(s) not getting evicted, lingering oban workers from restarts/reloads without terminating main process. If issue stops reproducing after clean instance restart, could be the case, but locating such issues is ought to be involved and interactive process.
Could also be synchronization issue, if multiple workers somehow are getting the same task, but thats nearing improbability, it's not that difficult to implement row-locking postgres queue correctly to doubt oban implementation before other options are exhausted.
> Purely software errors might get recorded in errors column, that should be easiest to check first,
Seconded, yeah. The trick is figuring out which errors are real and which ones are incorrectly recorded as errors. @sjw , maybe try interacting with the @sjw account and see if you can double-notify yourself?
> Could also be synchronization issue, if multiple workers somehow are getting the same task, but thats nearing improbability,
Yeah, seems unlikely. I bet it's just number of workers is set too high so they are running into a bottleneck and timing out. (Seems unlikely that any of the timeouts is set too *low*.)
> Wonder if it's related to recently introduced priority queue.
Sounds likely. I had a look at the diff (https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4004/diffs) and it looks like it just uses the "cc" attribute and doesn't distinguish by type, which matches what I'm seeing: a big flood of "Like" activities. publisher.ex:141 and :217 and there's some discussion around the problem. When I did the dump, it was Likes getting delivered to specific people and to followers addresses. (e.g., Terry smashed that like on a post on some other server, but https://bae.st/users/Terry/followers was in the CC field.)
I was just talking to @lanodan (who did the merge, according to https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4004 ) about some unrelated stuff (periodic "PL complaints" thread :terrysmoking:), but I'm not sure there's enough information even to file an issue. Code seems a little shaky (I have only had an initial look at it, I could be wrong), it does seem like it matches the behavior I'm seeing here. The dedup is trivial string match on line 258, maybe should dedup by host. I'm not sure that I have read the bit that handles shared/private inboxes correctly.
> None of that would've been an issue if sjw didn't shit his instance up with hundreds of bots reposting images from Danbooru.
I think they're responsible for a small amount of the actual traffic, probably just a lot of people following each other on both servers.
@sjw@Terry@colonelj@p@mint There would be fewer reattempts, but the problem isn't in the reattempts themselves, but in that those occur after successful delivery. And for actually unsuccessful deliveries it would lower chances of eventually reaching the temporary offline instance.
> There would be fewer reattempts, but the problem isn't in the reattempts themselves, but in that those occur after successful delivery.
Yeah, I think that is it, I think it's the change mint mentioned. If it was related to some bottleneck like I was thinking, then lowering the rate would fix it. Since it's still happening even though he has lowered the rate, it's safe to say that it's not a bottleneck or a race condition, it's probably a logic problem.
> my posts here not federating to beast for 10 hours
I think "knocking off /16s" was probably thicker jargon than the audience would warrant. (In my defense, I was busy.) Basically I killed off bae.st traffic for a little while so that the box didn't get flooded excessively until I was less busy and could fix it. I freed up one of the addresses.
@sjw@Terry@colonelj@iamtakingiteasy@lanodan@mint It should be easy to tell if it fixes it because the vast majority of incoming POSTs (including interactive use) are from bae.st, and another ~15% are from ryona.agency, so I wonder if that patch is on there. 80+15=95, so 19 out of 20 requests are from two instances that I am fairly certain do not, when the software is working correctly, send enough activities to swamp the server. everything_is_baest.png
@p@Terry@sjw@iamtakingiteasy@colonelj@lanodan Yeah, it's been merged to upstream and I've synced the sources a while ago since there's a long-standing problem with delayed federation on cum.salon, hence I though partially mitigating it might be good.
@Terry hour the average journalist is more trustworthy than some anonymous retard. like that time you believed some Ukrainian guy did weird porn and made emojis of it
@colonelj they ran with that headline that both you and I know was ment to give the impression the country of Russia had completely run out of bullets.
@p@Terry@sjw@colonelj When did FSE fork from Pleroma? Because the behavior should be enqueuing messages right after HTTP Signatures were verified, so Transmogrifier/MRF/… don't count in the HTTP response lag. And that part of the code has last changed in 2019.
Well, not really a fork, just bit by some updates so I stopped updating so frequently, and then I don't want to have to deal with a moving target while I'm hacking on Revolver. FSE's 2.2.2, with some manually adapted security fixes and some dumb patches, etc. ( https://git.freespeechextremist.com/gitweb/?p=fse;a=summary ) I think FSE might be one of the few Pleromae that has no Gleason code on it.
> And that part of the code has last changed in 2019.
Well 2.2.2 is January 2021, so 2019 issues shouldn't apply. But the problem appears to be the priority delivery patch (merge request 4004); see mint's post and my confirmation.
@sjw@Terry@colonelj@iamtakingiteasy@lanodan@mint Okay, that's weird, it's like you don't have that patch included. So it's not a randomly determined timeout problem and it's not this patch because you don't have this patch.