@russss Are you saying the data isn't even necessarily being used for training LLMs? The problem is just correlated to rise of LLMs because LLMs are making it a lot easier to write scrapers (and I guess chatGPT will also happily advise on how to bypass mitigations)
@harry_wood I don't think it's the big AI companies which are scraping excessively, it's random people who probably got a LLM to write their scraper bots... To make it more confusing, in some cases they're cloning the user-agents of the major AI bots.
Vicious criticism of LLMs from this sysadmin who has to deal with their scrapers: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html The LLM scraper problem seems surprising to me. The makers of big new generative AI systems are mostly big-tech firms. Don't they value their reputation, or even the reputation of the AI concept overall, better than to commission these cowboys to do their scraping? But maybe they've already decided that, due to the copyrights risk, it's best done at arms length via shady intermediaries.