Conversation
Notices
-
Embed this notice
Jonathan Corbet (corbet@social.kernel.org)'s status on Thursday, 23-Jan-2025 08:20:50 JST Jonathan Corbet
A followup for folks who are curious about the whole AI botswarm problem...
Some of these bots are clearly running on a bunch of machines on the same net. I have been able to reduce the traffic significantly by treating everything as a class-C net and doing subnet-level throttling. That and simply blocking a couple of them.
But that leaves a lot of traffic with an interesting characteristic: there are millions of obvious bot hits (following a pattern through the site, for example) that all come from a different IP. An access log with 9M lines as over 1M IP addresses, and few of them appear more than about three times.
So these things are running on widely distributed botnets, likely on compromised computers, and they are doing their best to evade any sort of recognition or throttling. I don't think that any sort of throttling or database of known-bot IPs is going to help here...not quite sure what to do about it.
What a world we have made for ourselves...-
Embed this notice
Jonathan Corbet (corbet@social.kernel.org)'s status on Thursday, 23-Jan-2025 12:25:08 JST Jonathan Corbet
@smxi @monsieuricon Suggestions for these countermeasures - and how to apply them without hosing legitimate users - would be much appreciated. I'm glad they are obvious to you, please do share! -
Embed this notice
smxi (smxi@fosstodon.org)'s status on Thursday, 23-Jan-2025 12:25:09 JST smxi
@monsieuricon @corbet so you know the behavior and the pattern. Construct countermeasures. I'm honestly astounded to see guys close to the kernel unable to do this. Think like your opponent. Find his weak spots. Nothing has changed since Sun Tzu made his observations. All bots have weak spots.
-
Embed this notice
K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Thursday, 23-Jan-2025 12:25:10 JST K. Ryabitsev ????
@smxi @corbet we're kinda trying to tell you that a single IP will hit 2-3 times an hour or so. You can't do behavioural analysis over 3 hits. They request 2-3 specific URLs with generic browser client strings and then aren't seen again. But multiply this by tens of thousands of IPs all coming from different subnets and you have a problem. -
Embed this notice
smxi (smxi@fosstodon.org)'s status on Thursday, 23-Jan-2025 12:25:11 JST smxi
@corbet IP based blocks have been useless for decades. Block behaviors. Most bots cost money to run via bot net rental fees.
-
Embed this notice
Jonathan Corbet (corbet@social.kernel.org)'s status on Thursday, 23-Jan-2025 23:39:10 JST Jonathan Corbet
@penguin42 They don't tell me what they are doing with the data... the distributed scraping is an easily observable fact, though. Perhaps they are firehosing the data back to the mothership for training? -
Embed this notice
penguin42 (penguin42@mastodon.org.uk)'s status on Thursday, 23-Jan-2025 23:39:11 JST penguin42
@corbet I'm trying to think of the AI training that would be using compromised hosts for scraping; I thought for training you had to do the training part on one or a small number of tightly coupled hosts; so then what is it?
-
Embed this notice