Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://freesoftwareextremist.com/objects/8072411c-fc33-4f05-b72f-52bbce539ac1">翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:07 JST</a><a href="https://freesoftwareextremist.com/users/Suiseiseki" title="suiseiseki@freesoftwareextremist.com"><img src="https://gnusocial.jp/avatar/789-48-20220724040913.webp" width="48" height="48" alt="翠星石" style="position: absolute; left: 0; top: 0;">翠星石</a><div><a href="https://freesoftwareextremist.com/objects/e98f1aaf-54ac-4ff1-acfc-939b77cad6af" rel="in-reply-to">in reply to</a><ul><li></ul></div></section><article><a href="https://freesoftwareextremist.com/users/wowaname">@wowaname</a> It's much worse.<br><br>LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.<br><br>Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.<br><br>I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.</article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/3806014#notice-7445980">In conversation</a><time datetime="2024-10-11T01:25:07+09:00" title="Friday, 11-Oct-2024 01:25:07 JST">about 9 months ago</time> <span>from <span><a href="https://gnusocial.jp/notice/7445980" rel="external" title="Sent from gnusocial.jp via ActivityPub">gnusocial.jp</a></span></span><a href="https://gnusocial.jp/notice/7445980">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:07 JST翠星石
in reply to
- opal
@wowaname It's much worse.

LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.

Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.

I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.
In conversationabout 9 months ago from gnusocial.jppermalink

Public

Embed Notice

HTML Code

Corresponding Notice