Conversation

Notices

Embed this notice
Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:12:37 JST Jonathan Corbet
- LWN.net
Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.

This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.

Happy new year :)

In conversation about 6 months ago from social.kernel.org permalink
- Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:20:59 JST Jonathan Corbet
  in reply to
  - LWN.net
  - Mythic Beasts
  @beasts @LWN We are indeed seeing that sort of pattern; each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming. Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Mythic Beasts (beasts@social.mythic-beasts.com)'s status on Wednesday, 22-Jan-2025 05:21:00 JST Mythic Beasts
  in reply to
  - LWN.net
  @corbet @LWN in our experience you should prepare for thousands of distinct IPs.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:26:17 JST Jonathan Corbet
  in reply to
  - LWN.net
  - John Francis 🦫🇨🇦🍁💪⬆️
  @johnefrancis @LWN Something like nepenthes (https://zadzmo.org/code/nepenthes/) has crossed my mind; it has its own risks, though. We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN. Tempting.
  In conversation about 6 months ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    ZADZMO code
    
    from https://zadzmo.org/humans.txt
- Embed this notice
  John Francis 🦫🇨🇦🍁💪⬆️ (johnefrancis@cosocial.ca)'s status on Wednesday, 22-Jan-2025 05:26:18 JST John Francis 🦫🇨🇦🍁💪⬆️
  in reply to
  - LWN.net
  @corbet @LWN sounds like you need an AI poisoner like Nerpenthes or iocaine.
  
  In conversation about 6 months ago permalink
  
  Valerie Aurora repeated this.
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:27:32 JST Jonathan Corbet
  in reply to
  - LWN.net
  - bignose
  @bignose @LWN We have gone far out of our way to never require JavaScript to read LWN; we're not going to back on that now.
  
  In conversation about 6 months ago permalink
- Embed this notice
  bignose (bignose@sw-development-is.social)'s status on Wednesday, 22-Jan-2025 05:27:33 JST bignose
  in reply to
  - LWN.net
  Thank you @corbet and all at @LWN for continuing the work of providing the excellent #LWN.
  The "active defenses" against torrents of antisocial web scraping bots, has bad effects on users. They tend to be "if you don't allow JavaScript and cookies, you can't visit the site" even if the site itself works fine without.
  I don't have a better defense to offer, but it's really closing off huge portions of the web that would otherwise be fine for secure browsers.
  It sucks. Sorry, and thank you.
  In conversation about 6 months ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    browsers.it
    
    This domain may be for sale!
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:46:04 JST Jonathan Corbet
  in reply to
  @lkundrak @monsieuricon @LWN It's a service we provide :)
  
  In conversation about 6 months ago permalink
- Embed this notice
  le petit printf 🇺🇦🇨🇿👃💨 (lkundrak@metalhead.club)'s status on Wednesday, 22-Jan-2025 05:46:05 JST le petit printf 🇺🇦🇨🇿👃💨
  in reply to
  - LWN.net
  - K. Ryabitsev ????
  @monsieuricon @LWN @corbet are you implying that there are models that are busy being trained to call someone a fuckface over misunderstanding of some obscure arm coprocessor register or respond with viro insults to the most unsuspecting victims?
  
  In conversation about 6 months ago permalink
- Embed this notice
  K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:46:06 JST K. Ryabitsev ????
  in reply to
  - LWN.net
  @corbet @LWN I feel your pain so much right now.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 05:56:42 JST Jonathan Corbet
  in reply to
  - LWN.net
  - Adelie
  @adelie @LWN Blocking a subnet is not hard; the harder part is figuring out *which* subnets without just blocking huge parts of the net as a whole.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Adelie (adelie@darkpenguin.social)'s status on Wednesday, 22-Jan-2025 05:56:48 JST Adelie
  in reply to
  - LWN.net
  @corbet @LWN
  "Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "
  if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).
  Just set it up last night for my much-smaller-traffic instances, feel free to DM
  https://ipset.netfilter.org/
  
  In conversation about 6 months ago permalink
- Embed this notice
  Ronny Adsetts (ronnyadsetts@mastodon.social)'s status on Wednesday, 22-Jan-2025 06:14:17 JST Ronny Adsetts
  in reply to
  - LWN.net
  @corbet @LWN would you be so kind as to write up whatever mitigations you come up with? I've been fighting this myself on our websites. You seeing semi-random user agents too?
  
  In conversation about 6 months ago permalink
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 06:14:17 JST Jonathan Corbet
  in reply to
  - LWN.net
  - Ronny Adsetts
  @RonnyAdsetts @LWN The user agent field is pure fiction for most of this traffic.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 07:05:41 JST Jonathan Corbet
  in reply to
  - LWN.net
  - AndresFreundTec
  @AndresFreundTec @LWN Yes, a lot of really silly traffic. About 1/3 of it results in redirects from bots hitting port 80; you don't see them coming back with TLS, they just keep pounding their head against the same wall.
  
  It is weird; somebody has clearly put some thought into creating a distributed source of traffic that avoid tripping the per-IP circuit breakers. But the rest of it is brainless.
  
  In conversation about 6 months ago permalink
- Embed this notice
  AndresFreundTec (andresfreundtec@mastodon.social)'s status on Wednesday, 22-Jan-2025 07:05:42 JST AndresFreundTec
  in reply to
  - LWN.net
  @corbet @LWN Do you see a lot of pointlessly redundant requests? I see a lot of related-seeming IPs request the same pages over and over.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Ayo (ayo@lonely.town)'s status on Wednesday, 22-Jan-2025 17:37:37 JST Ayo
  in reply to
  - LWN.net
  - John Francis 🦫🇨🇦🍁💪⬆️
  @corbet @johnefrancis @LWN
  Struggling with likely the same bots over here. I deployed a similar tarpit* on a large-ish site a few days ago - taking care not to trap the good bots - but can't say it's been very successful. It might have taken some load off of the main site, but not nearly enough to make a difference.
  One more thing I'm considering is prefixing all internal links with a '/botcheck/' path for potentially suspicious visitors, set a cookie on that page and strip that prefix with JS. If the cookie is set on the /botcheck/ endpoint, redirect to the proper page, otherwise tarpit them. This way the site would still work as long as the user has *either* JS or cookies enabled. Still not perfect, but slightly less invasive than most common active defenses.
  *) https://code.blicky.net/yorhel/infinite-slop
  In conversation about 6 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: code.blicky.net
    
    infinite-slop
    
    from yorhel
    
    Random garbage web page generator
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Wednesday, 22-Jan-2025 23:18:12 JST Jonathan Corbet
  in reply to
  - LWN.net
  - Daniel Bovensiepen
  @daniel @LWN The problem with restricting reading to logged-in people is that it will surely interfere with our long-term goal to have the entire world reading LWN. We really don't want to put roadblocks in front of the people we are trying to reach.
  
  In conversation about 6 months ago permalink
  
  sergiodj likes this.
- Embed this notice
  Daniel Bovensiepen (daniel@bovi.social)'s status on Wednesday, 22-Jan-2025 23:18:16 JST Daniel Bovensiepen
  in reply to
  - LWN.net
  @corbet @LWN how about restricting reading to logged-in people only and then block the bot requests early in the pipeline to reduce the load
  
  In conversation about 6 months ago permalink
- Embed this notice
  Jonathan Corbet (corbet@social.kernel.org)'s status on Friday, 24-Jan-2025 23:44:25 JST Jonathan Corbet
  in reply to
  - LWN.net
  - Michael K Johnson
  @mcdanlj @LWN What a lot of people are suggesting (nepethenes and such) will work great against a single abusive robot. None of it will help much when tens of thousands of sites are grabbing a few URLs each. Most of them will never step into the honeypot, and the ones that do will not be seen again regardless.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Michael K Johnson (mcdanlj@social.makerforums.info)'s status on Friday, 24-Jan-2025 23:44:26 JST Michael K Johnson
  in reply to
  - LWN.net
  @corbet @LWN I'm wondering if a link that a human wouldn't click on but an AI wouldn't know any better than to follow could be used in nginx configuration to serve AI robots differently from humans, in a configuration that excluded search crawlers from that configuration. What such a link would look like would be different on different sites. That would require thought from every site, but also that would create diversity which would make it harder to guard against on the scraper side, so possibly could be more effective.
  I might be an outlier here for my feelings on whether training genai such as LLMs from publicly-posted information is OK. It felt weird decades ago when I was asked for permission to put content I posted to usenet onto a CD (why would I care whether the bits were carried to the final reader on a phone line someone paid for or a CD someone paid for?) so it's not inconsistent in my view that I would personally feel that it's OK to use what I post publicly to train genai. (I respect that others feel differently here.)
  That said, I'm beyond livid at being the target of a DDoS, and other AI engines might end up being collateral damage as I try to protect my site for use by real people.
  
  In conversation about 6 months ago permalink

Public

Conversation

Notices

Feeds