Should you be wondering why @LWN#LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.
This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.
@beasts@LWN We are indeed seeing that sort of pattern; each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming. Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick.
@johnefrancis@LWN Something like nepenthes (https://zadzmo.org/code/nepenthes/) has crossed my mind; it has its own risks, though. We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN. Tempting.
Thank you @corbet and all at @LWN for continuing the work of providing the excellent #LWN.
The "active defenses" against torrents of antisocial web scraping bots, has bad effects on users. They tend to be "if you don't allow JavaScript and cookies, you can't visit the site" even if the site itself works fine without.
I don't have a better defense to offer, but it's really closing off huge portions of the web that would otherwise be fine for secure browsers.
@monsieuricon@LWN@corbet are you implying that there are models that are busy being trained to call someone a fuckface over misunderstanding of some obscure arm coprocessor register or respond with viro insults to the most unsuspecting victims?
"Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "
if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).
Just set it up last night for my much-smaller-traffic instances, feel free to DM
@corbet@LWN would you be so kind as to write up whatever mitigations you come up with? I've been fighting this myself on our websites. You seeing semi-random user agents too?
@AndresFreundTec@LWN Yes, a lot of really silly traffic. About 1/3 of it results in redirects from bots hitting port 80; you don't see them coming back with TLS, they just keep pounding their head against the same wall.
It is weird; somebody has clearly put some thought into creating a distributed source of traffic that avoid tripping the per-IP circuit breakers. But the rest of it is brainless.
@corbet@johnefrancis@LWN Struggling with likely the same bots over here. I deployed a similar tarpit* on a large-ish site a few days ago - taking care not to trap the good bots - but can't say it's been very successful. It might have taken some load off of the main site, but not nearly enough to make a difference.
One more thing I'm considering is prefixing all internal links with a '/botcheck/' path for potentially suspicious visitors, set a cookie on that page and strip that prefix with JS. If the cookie is set on the /botcheck/ endpoint, redirect to the proper page, otherwise tarpit them. This way the site would still work as long as the user has *either* JS or cookies enabled. Still not perfect, but slightly less invasive than most common active defenses.
@daniel@LWN The problem with restricting reading to logged-in people is that it will surely interfere with our long-term goal to have the entire world reading LWN. We really don't want to put roadblocks in front of the people we are trying to reach.
@mcdanlj@LWN What a lot of people are suggesting (nepethenes and such) will work great against a single abusive robot. None of it will help much when tens of thousands of sites are grabbing a few URLs each. Most of them will never step into the honeypot, and the ones that do will not be seen again regardless.
@corbet@LWN I'm wondering if a link that a human wouldn't click on but an AI wouldn't know any better than to follow could be used in nginx configuration to serve AI robots differently from humans, in a configuration that excluded search crawlers from that configuration. What such a link would look like would be different on different sites. That would require thought from every site, but also that would create diversity which would make it harder to guard against on the scraper side, so possibly could be more effective.
I might be an outlier here for my feelings on whether training genai such as LLMs from publicly-posted information is OK. It felt weird decades ago when I was asked for permission to put content I posted to usenet onto a CD (why would I care whether the bits were carried to the final reader on a phone line someone paid for or a CD someone paid for?) so it's not inconsistent in my view that I would personally feel that it's OK to use what I post publicly to train genai. (I respect that others feel differently here.)
That said, I'm beyond livid at being the target of a DDoS, and other AI engines might end up being collateral damage as I try to protect my site for use by real people.