there is currently a bot inside MIT IP space, address 18[.]4[.]38[.]176, scanning fedi at large. i have confirmed this with 5+ unrelated instance admins, large and small instances, across mastodon/misskey/pleroma/akkoma.
the bot is poorly behaved. i have observed it making repeated requests, multiple times per second, for the exact same paths (the paths being, generally: user profiles, specific posts, and sometimes following links in posts). returning 403s does not stop this activity. one of my domains received hundreds of additional requests despite replying with 403 to all of them. i have also seen it make requests for paths containing html tags - seems like a badly written parser. the purpose of these requests and what data is being gathered is unclear.
PTR on the ip returns sts-drand03.mit.edu. a quick web search for "mit drand" brings back https://mitsloan.mit.edu/faculty/directory/david-g-rand and his personal website: https://davidrand-cooperation.com/ (note: other IPs in the /24 also have names in the PTR which match up with names of MIT faculty, but only the .176 IP appears to be involved in this activity). seems he's doing research into "misinformation" and "fake news" on social media. he also appears to be on fedi! so @Drand@techhub.social, given this activity is sourced from an IP with your name on it, could you share the purpose of this traffic? what data is being collected and how is it being used? do you plan to respect robots.txt or identify yourself in your useragent? is there a process for instance admins to opt out of this activity other than blocking the source IP?
for those who have checked logs on their instances, could you share the dates when the activity started? on this instance, the first request i have is from 2023/11/29, steadily ramping up since then
@natalie@Drand found it in my logs, with UA string "unshortenit 0.4.0", it only asked for my profile and nothing else. I'm running behind Cloudflare so maybe they started dropping him at some point.
aaronsw killed himself because he bulk-scraped some academic papers from MIT and MIT had him arrested and the federal prosecutor wanted 35 years for exceeding authorized access under the CFAA. So, now, MIT faculty member David Rand is exceeding authorized access in order to produce more academic papers for MIT. I wonder: if he produces a paper by scraping and then someone scrapes that paper, what happens?
> seems he's doing research into "misinformation" and "fake news" on social media.
He's a Professor of Marketing in the Brain and Cognitive Sciences Department at Sloan. It is safe to reason that (1) he is a psychopath, (2) he has no idea how any of the technical shit works and has a grad student doing it, the grad student in turn having hired an engineer undergrad.
> is there a process for instance admins to opt out of this activity
I guess, per his faculty page, you could just call him, or perhaps email him. Despite scraping FSE, he has chosen an instance that blocks FSE.
@Drand@natalie I can't reply to you because, although you are happy to scrape my instance, you have chosen an instance that blocks me.
I am skeptical that "researching content moderation policies" is an accurate way to describe what you are doing: I have not been asked about my content moderation policies. And how are you going to determine whether a given piece of content is in violation of an instance's policy if you don't even know whether the moderators have seen it? If you are interested in misinformation, you may wish to join or operate an instance that ensures that you can interact with the people you are exploiting for your career.
Can you explain whether or not you told the IRB that you were going to reimburse me for my bandwidth overages? Also, when I am done going through my logs to finish figuring out whose posts you have scraped and I notify them, where should I send them to request the removal of their posts from your dataset without compromising their anonymity?
@natalie Hi all, apologies for this, I didn't realize the bot was being poorly behaved - we've now stopped it. In terms of why we were doing the scraping, we are doing research on how content moderation policies vary across servers, and how this can help inform the Fediverse more broadly about effective approaches to content moderation. You can get more of a sense of the kind of research we do here: https://docs.google.com/document/d/1k2D4zVqkSHB1M9wpXtAe3UzbeE0RPpD_E2UpaPf6Lds/edit?usp=sharing
Sorry again about causing problems for folks! (And thanks to a couple of people for emailing me to let me know about this)
@Moon@natalie@Drand@coolboymew Same here. Incidentally, he has apparently screwed up some of his links, so his bot is fetching URLs like "GET /signin%3C/a%3E%3Cbr/%3E%3Cbr/%3EYang HTTP/1.1", "GET /about%3C/a%3E%3Cbr/%3ELooks HTTP/1.1", "/notice/ACV7evq9u0cd7bBo1Y%3C/a%3E".
Incidentally, he was scraping some Spinster user's profile from FSE, @Piss_Ant.
@p@natalie@Drand@boody I believe putting people on the spot, singling out or otherwise making somebody have to answer for something is against Mastodon ethos. it's baked in the CoC
@PurpCat@mischievoustomato@p@natalie@milk@Drand@boody i heard the full tilt homos were trying to make it 'their space' but i cant be assed to even bother going to look. not sure if gays fellating eachother on the timeline or people whining about xitter is worse tbh
@milk@p@natalie@graf@Drand@boody more importantly the quality users are leaving Twitter either because of musk derangement syndrome (twitter assfucking users now has a name and face), or because of bans/censorship, or fatigue with people of gender.
@PurpCat@p@natalie@graf@boody I'll post what I said in my mod chatroom. While fedi is comprised of lots of shitposting, i believe studies like these are a way to delegitimize anything posted as misinformation. these types like @Drand see themselves as authority of information if they are able to convince people that nothing on the fediverse is serious, it bolsters places that are sanctioned by the authorities of information like X...and the reason why X is sanctioned and endorsed by them and other kikes is because the ADL control the flow of information
X, the ADL, academia are worried that "freedom of speech doesn't mean freedom of reach" will eventually upset people fediverse poses a threat to these "people", always has
@natalie@Drand was just gonna say "who the fuck cares" but if my man is making me pay like an extra dollar for his spam im gonna have to bother doin somethin about that :happey:
@laurel@Drand@graf@natalie Or even relay endpoints. And if you make your relay endpoint follow people, sometimes the other server doesn't even report the follow.
It's where the catladies who listen to NPR and thom Hartmann and political theater about how trump is Hitler 2 go to bitch about trump and furries think they're being genocided.
@PurpCat@p@natalie@graf@milk@Drand@boody Twitter is an intel collection and influence platform and nothing more. There is no other logical reason why it is still operational; ad revenue doesn't cut it, and their stock value has always been dismal.
Elon was sent in to bring the right, who had been mass-censored and banned by the overzealous previous management, back into the platform so that they can be monitored and influenced more easily.
@PurpCat@p@natalie@graf@milk@Drand@boody yeah, that's incredibly annoying to see i started using lemmy back when the whole reddit debacle happened, and well, it was like 30+ posts about zomg lemmy so good zomg reddit so bad --- speaking of that, funnily, reddit is seemingly doing fine
@PurpCat@mischievoustomato@Drand@boody@graf@milk@natalie I am interested in the situation of a guy scraping fedi and I do not know how it turned into discussion of someone's polycule but I would like to opt out of that if possible so I don't have to mute the entire thread.
@laurel@p@natalie@graf@Drand Honestly, I gotta make a generic library for this so people do it right, and then me doing it looks just like "approved" people doing it
what's ironic is people are only finding out about MIT scraping fedi is because they're doing it wrong. if you're doing it right, you *need* to announce yourself like a cartoon villain in order for anybody to think it's happening.
everyone's on about "safety" and "security", but let me ask you this: Did the guy who called MGM and ask nicely for them to deploy ransomware cartoonishly announce himself as a social engineer?
@istvan@laurel@natalie@graf@Drand Well, this guy is the Sloan School of Business. Different crowd. (In fact, the rest of the crowd is a different crowd now that Microsoft bought up the CS department.)
>I would like to opt out of that if possible so I don't have to mute the entire thread I have always wanted that feature and am annoyed whenever people refuse to emulate it by untagging people on request.
:elliot: What do you know about Sloan? :oopsclumsyme: ...School of Business? :elliot: That's the one, yes. :oopsclumsyme: Scary math guys. All the statistics and econ classes are there, even undergrad.
your instance blocks everyone, you dont want a discussion you want to collect natsec money for being a script kiddie. fix your fucking shit spider you MIT skidmark fuck and then we'll tell you how moderation works, except you wont because you're a bad faith actor.
@atlas Thanks for your engagement! To clarify, the data we were collecting was on what news domains people on different servers share, and how toxic the language they use is etc; and then how that relates to the formal rules the server posts. But also we would LOVE to talk admins about how they actual think about / do content moderation etc (We have talked to some already, and are def interested in more!)
@Drand Speaking for myself, I appreciate the clarification and short explanation, as otherwise this can be perceived quite simply as an attack with no true proper reason.
One question, though. Yes, I acknowledge that information on my instance is public and I do not intend to intensely restrict access to it (as there is quite simply no good reason or way to do so without extensive compromises), but this does not explain why the bot needs access to either the public timelines or specific user accounts if it is just searching the vaguely-identified "content moderation policies." For convenience, I'm not going to question the relevance or possibility of moderation in a federated network, as it's quite simply irrelevant right now. There is a lot of ambiguity in that regard so rather than making assumptions, I'd rather ask: what's up with that? How is the information there relevant? (and also, why are single-user instances, like mine, that only have very few users, relevant to this as well, and is there a way to opt out?)
edit note (in response to the parent post's recent edit): I'm more or less sure if this answers some concerns both expressed on my end and on the O.P.'s end (natalie), perhaps I am missing an implication or misinterpreting certain things. Anywho, this is not my debate to start or partake in, so I'll digress. Blocking the IP remains a "good enough" opt-out mechanism, if a bit archaic, for my needs.
@sarvo@natalie@Drand this "misinformation" is just people telling lies on the internet for fun. studying random internet user's shitposts for natsec grant money is why MIT is a fucking embarrassment and shell of it's former self.
@Drand@techhub.social@natalie@nya.social quantitative statistical research is fucking garbage and it will not give you any light at all on how the fediverse works, if you are serious about it go ahead and ask instance admins and users directly. Otherwise keep scrapping data for even more garbage papers that say nothing but shit as your whole career.
2023/11/28: first contact via mastodonpy and afterwards browsing static content via Chrome 119 on Windows 10 (???) 2023/11/29: first traffic from the urllib bot
@natalie earliest one for me is 02/Dec/2023:18:45:36 and it seemed to have stopped on 3rd after 3344 requests to my profile and the profile of the only other active user on my instance, the user agent was "Python-urllib/3.9". Actually, now it seems like they're trying to connect via IPv6 which I have explicitly disabled so nginx is giving them a connection refused error, clearly that does not stop them because the last request was about 2 hours ago