@mcdanlj@LWN What a lot of people are suggesting (nepethenes and such) will work great against a single abusive robot. None of it will help much when tens of thousands of sites are grabbing a few URLs each. Most of them will never step into the honeypot, and the ones that do will not be seen again regardless.
@penguin42 They don't tell me what they are doing with the data... the distributed scraping is an easily observable fact, though. Perhaps they are firehosing the data back to the mothership for training?
@smxi@monsieuricon Suggestions for these countermeasures - and how to apply them without hosing legitimate users - would be much appreciated. I'm glad they are obvious to you, please do share!
To be clear, LWN has never "crashed" as a result of this onslaught. We'll not talk about what happened after I pushed up some code trying to address it...
Most seriously, though: I'm surprised that this situation is surprising to anybody at this point. This is a net-wide problem, it surely is not limited to free-software-oriented sites. But if the problem is starting to get wider attention, that is fine with me...
Some of these bots are clearly running on a bunch of machines on the same net. I have been able to reduce the traffic significantly by treating everything as a class-C net and doing subnet-level throttling. That and simply blocking a couple of them.
But that leaves a lot of traffic with an interesting characteristic: there are millions of obvious bot hits (following a pattern through the site, for example) that all come from a different IP. An access log with 9M lines as over 1M IP addresses, and few of them appear more than about three times.
So these things are running on widely distributed botnets, likely on compromised computers, and they are doing their best to evade any sort of recognition or throttling. I don't think that any sort of throttling or database of known-bot IPs is going to help here...not quite sure what to do about it.
@daniel@LWN The problem with restricting reading to logged-in people is that it will surely interfere with our long-term goal to have the entire world reading LWN. We really don't want to put roadblocks in front of the people we are trying to reach.
@AndresFreundTec@LWN Yes, a lot of really silly traffic. About 1/3 of it results in redirects from bots hitting port 80; you don't see them coming back with TLS, they just keep pounding their head against the same wall.
It is weird; somebody has clearly put some thought into creating a distributed source of traffic that avoid tripping the per-IP circuit breakers. But the rest of it is brainless.
It would appear to force readers to enable JavaScript, which we don't want to do. Plus it requires running all of our readers through cloudflare, of course...and I suspect that the "free tier" is designed to exclude sites like ours. So probably not a solution for us, but it could well work for others.
@johnefrancis@LWN Something like nepenthes (https://zadzmo.org/code/nepenthes/) has crossed my mind; it has its own risks, though. We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN. Tempting.
@beasts@LWN We are indeed seeing that sort of pattern; each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming. Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick.
Should you be wondering why @LWN#LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.
This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.
I bought this card in Korea some years ago after having seen this theme - a tiger and a rabbit seemingly getting stoned together - in a number of places. There must be a story behind it, but my meager search skills have never managed to turn it up. I do still love the image, though...
They informed me that a replacement system would be $700, seemingly including installation. It'll be a little while before I can generate enthusiasm for spending that money, certainly...
Some new form of SunPower resurrecting the current hardware would be nice. I'd say that the chances of them making it work again without demanding more money are pretty small, though. Such is the world we live in - we only *think* we own that device...
Rather than putting an rPi system in the box, though, I just ran the Ethernet cable to a system I had with both wireless and wired interfaces; the WiFi sits on the home net, while the wired interface does DHCP to get an address from the SunPower box, then polls it to get the data out.
Once that was set up, getting it into Home Assistant was mostly a matter of installing the integration. Figuring out which power signals belonged to which panel took a while; if you don't have it yet, use the SunPower app to make a map of the serial number for each panel and its location.
I'm debating whether to stick with this system, or to take up Enphase on its offer and swap out the SunPower box entirely. The Enphase monitor would be a supported product, and it seemingly has much better Home Assistant support.