Does anyone know the concrete technical reason(s) that LLM website scrapers have been so much nastier to deal with than the ones used by major search engines? Like do these people just not know how to write a scraper that won't DDOS (or equivalent effect) a server? Are they trying to get the data faster or more thoroughly than other scrapers? Do they just not care? Like obviously they don't care but I can't tell if that's the main reason they're so horrible or some more technical point.
Conversation
Notices
-
Embed this notice
JP (jplebreton@mastodon.social)'s status on Wednesday, 02-Jul-2025 09:24:57 JST
JP
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Wednesday, 02-Jul-2025 09:24:57 JST
Rich Felker
@jplebreton They used LLM codegen to write their scrapers, making them particularly shit. 🙃
-
Embed this notice
silverwizard (silverwizard@convenient.email)'s status on Wednesday, 02-Jul-2025 13:50:13 JST
silverwizard
@jplebreton part of it is that people are hitting everything randomly. My node gets 10+ hits an hour to its search endpoint for random subjects, which is just berzerk behaviour for a crawler in general
-
Embed this notice