Conversation
Notices
-
Embed this notice
AI farms can just up and die, please. Add overscraping to the long list of public resources being ruined by commercial greed -- to join overfishing, overlogging, and overgrazing.
-
Embed this notice
@monsieuricon To compare the waste of processor cycles, to public resources like a forest is questionable, as a processor doesn't wear any more from processing 1 billion cycles a second than 3.5 billion a second, while a forest runs out of trees very fast and likely forever.
It seems the only effective method to curtail that would be get your lawyers to send the scrapers Cease & Detest letters, noting that you're not clueless and know that they're clearly planning on infringing the GPLv2 and you'll be enforcing your copyright if it happens.
-
Embed this notice
And no, it's not easy to "just throttle them" when it's thousands of IPs all coming from public cloud subnets with user-agents matching common modern browsers.
-
Embed this notice
@monsieuricon Throttling is not what you want - you want to temporarily null route such scrapers until they go away.
Too bad git and also apt is very bursty, meaning you cannot add a ratelimit sufficiently low to stop auto ratelimit adjusting scrapers, while also never blocking people who merely run `git clone` (I was wondering why git clone wasn't working until I realized that a high ratelimit wasn't high enough).
The only thing that seems to reveal those scrapers is the hours of time they spend connected scraping - so maybe a rate+connection time ratelimit could work - too bad you can pull of a lot of scraping in a few hours.
-
Embed this notice
@wowaname It's much worse.
LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.
Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.
I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.
-
Embed this notice
@monsieuricon is it genuinely worse than all the misbehaved spiders we've always had? i'm asking because i haven't been able to keep any websites up this year to monitor my logs