Embed Notice
HTML Code
Corresponding Notice
- Embed this notice@colonelj @Terry Something weird going on with bae.st, started a few weeks ago but there were some weird incidents before that with them sending double posts.
It's been hitting both the public inbox endpoint and the user-specific inboxes. I don't know why it's doing that, but it's even doing it for users that no longer exist.
I don't think it's that, though, I think it's probably that it takes us longer to do the insert than bae.st's local timeout. (cc @sjw ) Delivering a post is several steps¹, most of which have overlapping timers on both ends, so something can get delivered but might not be viewed as a success on the other end and they try to deliver it again.
There should probably just be a uniqueness constraint on (user_id,type,activity_id) for notifications. That would slow things down (explanation in the footnote), though, and it's already slow.
¹ Abridged and simplified:
1. Worker receives job to deliver post to FSE. Baest DB connection pool checkout timer starts.
2. Worker does DNS lookup. This is fast usually but should be locally cached because a lot of them happen and it is holding a DB connection from the pool.
3. Worker eventually sends the signed request to FSE to deliver the post. Web request timer starts.
4. FSE receives the post by HTTP. nginx and Pleroma both start incoming request timers.
5. FSE starts processing the post. The post gets verified, goes through the MRF chain, etc. FSE's DB connection pool checkout timer starts.
6. FSE completes processing, including inserting the notification into the notifications table, which is huge.
7. FSE finishes processing the request. FSE's DB connection pool timer stops.
8. FSE sends the request back out through nginx. Pleroma and nginx request timers stop.
9. Baest receives the response. Baest's web request timer stops.
10. Baest worker processes the request, noting that it is a success, and recording that to the DB. Baest DB connection pool checkout timer stops.
11. The job queue records the job's success and the job stops getting retried.
If any of the timers expires after step 6 is over but before step 11 is over, double deliveries could be attempted and, although the post would not get recorded twice, double notifications might be created. A number of these steps are slower than they might be, like step 6: FSE has received 9,679,872 notifications so far; several purges have occurred, it currently holds 3,385,727 rows, but is frequently and repeatedly accessed by reads/writes and is indexed all to hell (which makes writes to the table more expensive). Like Poast, Baest's outgoing requests go through proxies and some of these proxies (e.g., M247) have been responsible for spam accounts so FSE doesn't treat all IP ranges the same. (There are a lot of weird things going on, because bae.st and FSE are both old servers that have accumulated quirks and scar tissue, plus both servers are kind of, like, outliers in terms of use.)
Typically, the way you solve this kind of thing is a two-phase commit, but ActivityPub doesn't have a provision for that kind of thing. It's just "toss the activity at the other server and let's all hope for the best". Granted, the network is not designed to be bulletproof and it's difficult to design a decentralized protocol that *is* bulletproof, and even if it's designed well, it is more complicated to implement that sort of thing than the naïve version, so a two-phase commit might have seemed like overkill. (Something tells me that The Mastadan Netwark's German dictator was closer to "didn't know what a two-phase commit is" than "prudently decided not to complicate the protocol". In any case, if you want lots of implementations and broad adoption, it's better not to require more complicated implementations than you have to.)
So you end up having to deduplicate on both ends. (Content-addressed storage makes deduplication trivial. :revolvertan:)