Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://fluffytail.org/objects/5b7fd688-9ffa-4ea1-b87f-de8fa270474b">Phantasm (phnt@fluffytail.org)'s status on Wednesday, 05-Mar-2025 01:11:10 JST</a><a href="https://fluffytail.org/users/phnt" title="phnt@fluffytail.org"><img src="https://gnusocial.jp/avatar/158739-48-20230808204705.webp" width="48" height="48" alt="Phantasm" style="position: absolute; left: 0; top: 0;">Phantasm</a><div><a href="https://fsebugoutzone.org/objects/5191b8ed-755f-4e2b-a809-c37b0887aaee" rel="in-reply-to">in reply to</a></div></section><article><p><a href="https://fsebugoutzone.org/users/p">@p</a> <a href="https://mikoshidata.cloud/users/ins0mniak">@ins0mniak</a> </p><p>Are they one of the ones that tries the "/ai.txt" or something or do they just fucking scrape?</p><p>Nope, they ask for robots.txt and then immediately ignore it.</p>18.119.253.53 - - [23/Feb/2025:02:08:20 +0000] "GET /robots.txt HTTP/2.0" 200 1833 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
<p>I ended up just killing off their IPs, but because I also had to wipe the logs (media.fse ran out of space on /var) I can't check if they did.</p><p>With Claude it's at least easy. Return 403 to the UA and you are done. Which btw still does not stop their attempts at scraping. They will continue to hit webserver even when they obviously aren't let through. From there a log monitor will do the job.</p><p>With the Chink scrapers, it's a bit harder than automated log monitoring. They are clever in a way, where they will not send you more than approx. 3 requests from one IP, meaning that the typical monitoring tools like fail2ban or something custom won't work as all of the ones I know of don't do subnet/ASN detection, or it will be very trigger-happy.</p><p>Thankfully they are retarded in other ways which make them stick out like a sore thumb in the logs. Currently I just look at the logs every few days unless they trigger alerts and throw the whole announced prefix into the trash. So far that has worked out great.</p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/4672341#notice-9148233">In conversation</a><time datetime="2025-03-05T01:11:10+09:00" title="Wednesday, 05-Mar-2025 01:11:10 JST">about 4 months ago</time> <span>from <span><a href="https://fluffytail.org/objects/5b7fd688-9ffa-4ea1-b87f-de8fa270474b" rel="external" title="Sent from fluffytail.org via ActivityPub">fluffytail.org</a></span></span><a href="https://fluffytail.org/objects/5b7fd688-9ffa-4ea1-b87f-de8fa270474b">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
Phantasm (phnt@fluffytail.org)'s status on Wednesday, 05-Mar-2025 01:11:10 JST Phantasm
in reply to
@p @ins0mniak
Are they one of the ones that tries the "/ai.txt" or something or do they just fucking scrape?
Nope, they ask for robots.txt and then immediately ignore it.
18.119.253.53 - - [23/Feb/2025:02:08:20 +0000] "GET /robots.txt HTTP/2.0" 200 1833 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
I ended up just killing off their IPs, but because I also had to wipe the logs (media.fse ran out of space on /var) I can't check if they did.
With Claude it's at least easy. Return 403 to the UA and you are done. Which btw still does not stop their attempts at scraping. They will continue to hit webserver even when they obviously aren't let through. From there a log monitor will do the job.
With the Chink scrapers, it's a bit harder than automated log monitoring. They are clever in a way, where they will not send you more than approx. 3 requests from one IP, meaning that the typical monitoring tools like fail2ban or something custom won't work as all of the ones I know of don't do subnet/ASN detection, or it will be very trigger-happy.
Thankfully they are retarded in other ways which make them stick out like a sore thumb in the logs. Currently I just look at the logs every few days unless they trigger alerts and throw the whole announced prefix into the trash. So far that has worked out great.
In conversationabout 4 months ago from fluffytail.orgpermalink

Public

Embed Notice

HTML Code

Corresponding Notice