翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:07 JST
-
Embed this notice
@wowaname It's much worse.
LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.
Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.
I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.