Conversation

Notices

Embed this notice
Russ Garrett (russss@chaos.social)'s status on Monday, 07-Apr-2025 21:51:05 JST Russ Garrett
in reply to
- Harry Wood
@harry_wood oh it's probably LLMs or something LLM-adjacent. But I doubt it's the big AI players who are responsible for the excessive scraping.

In conversation about a month ago from chaos.social permalink
- clacke likes this.
- Embed this notice
  Harry Wood (harry_wood@en.osm.town)'s status on Monday, 07-Apr-2025 21:51:06 JST Harry Wood
  in reply to
  
  @russss Are you saying the data isn't even necessarily being used for training LLMs? The problem is just correlated to rise of LLMs because LLMs are making it a lot easier to write scrapers (and I guess chatGPT will also happily advise on how to bypass mitigations)
  
  In conversation about a month ago permalink
- Embed this notice
  Russ Garrett (russss@chaos.social)'s status on Monday, 07-Apr-2025 21:51:07 JST Russ Garrett
  in reply to
  - Harry Wood
  @harry_wood I don't think it's the big AI companies which are scraping excessively, it's random people who probably got a LLM to write their scraper bots... To make it more confusing, in some cases they're cloning the user-agents of the major AI bots.
  
  In conversation about a month ago permalink
- Embed this notice
  Harry Wood (harry_wood@en.osm.town)'s status on Monday, 07-Apr-2025 21:51:08 JST Harry Wood
  
  Vicious criticism of LLMs from this sysadmin who has to deal with their scrapers: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
  The LLM scraper problem seems surprising to me. The makers of big new generative AI systems are mostly big-tech firms. Don't they value their reputation, or even the reputation of the AI concept overall, better than to commission these cowboys to do their scraping? But maybe they've already decided that, due to the copyrights risk, it's best done at arms length via shady intermediaries.
  In conversation about a month ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    Please stop externalizing your costs directly into my face
  clacke repeated this.
- Embed this notice
  Harry Wood (harry_wood@en.osm.town)'s status on Friday, 25-Apr-2025 17:32:10 JST Harry Wood
  in reply to
  
  More on the LLM scraper problem https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html by @gluejar "Thousands of developer hours are being spent on defense against the dark bots and those hours are lost to us forever. We'll never see the wonderful projects and features they would have come up with in that time"
  In conversation about 22 days ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: blogger.googleusercontent.com
    
    AI bots are destroying Open Access
    
    There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...
  Haelwenn /элвэн/ :triskell: and clacke like this.

Public

Notices

Feeds