Conversation

Notices

Embed this notice
K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Friday, 11-Oct-2024 00:31:56 JST K. Ryabitsev ????

AI farms can just up and die, please. Add overscraping to the long list of public resources being ruined by commercial greed -- to join overfishing, overlogging, and overgrazing.

In conversation about 9 months ago from social.kernel.org permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 00:31:54 JST 翠星石
  in reply to
  
  @monsieuricon To compare the waste of processor cycles, to public resources like a forest is questionable, as a processor doesn't wear any more from processing 1 billion cycles a second than 3.5 billion a second, while a forest runs out of trees very fast and likely forever.
  
  It seems the only effective method to curtail that would be get your lawyers to send the scrapers Cease & Detest letters, noting that you're not clueless and know that they're clearly planning on infringing the GPLv2 and you'll be enforcing your copyright if it happens.
  
  In conversation about 9 months ago permalink
- Embed this notice
  K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Friday, 11-Oct-2024 00:31:55 JST K. Ryabitsev ????
  in reply to
  
  And no, it's not easy to "just throttle them" when it's thousands of IPs all coming from public cloud subnets with user-agents matching common modern browsers.
  
  In conversation about 9 months ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:19:39 JST 翠星石
  in reply to
  - 翠星石
  @monsieuricon Throttling is not what you want - you want to temporarily null route such scrapers until they go away.
  
  Too bad git and also apt is very bursty, meaning you cannot add a ratelimit sufficiently low to stop auto ratelimit adjusting scrapers, while also never blocking people who merely run `git clone` (I was wondering why git clone wasn't working until I realized that a high ratelimit wasn't high enough).
  
  The only thing that seems to reveal those scrapers is the hours of time they spend connected scraping - so maybe a rate+connection time ratelimit could work - too bad you can pull of a lot of scraping in a few hours.
  
  In conversation about 9 months ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:07 JST 翠星石
  in reply to
  - opal
  @wowaname It's much worse.
  
  LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.
  
  Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.
  
  I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.
  
  In conversation about 9 months ago permalink
- Embed this notice
  opal (wowaname@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:08 JST opal
  in reply to
  
  @monsieuricon is it genuinely worse than all the misbehaved spiders we've always had? i'm asking because i haven't been able to keep any websites up this year to monitor my logs
  
  In conversation about 9 months ago permalink

Public

Notices

Feeds