Are they one of the ones that tries the "/ai.txt" or something or do they just fucking scrape?
Nope, they ask for robots.txt and then immediately ignore it.
18.119.253.53 - - [23/Feb/2025:02:08:20 +0000] "GET /robots.txt HTTP/2.0" 200 1833 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"I ended up just killing off their IPs, but because I also had to wipe the logs (media.fse ran out of space on /var) I can't check if they did.
With Claude it's at least easy. Return 403 to the UA and you are done. Which btw still does not stop their attempts at scraping. They will continue to hit webserver even when they obviously aren't let through. From there a log monitor will do the job.
With the Chink scrapers, it's a bit harder than automated log monitoring. They are clever in a way, where they will not send you more than approx. 3 requests from one IP, meaning that the typical monitoring tools like fail2ban or something custom won't work as all of the ones I know of don't do subnet/ASN detection, or it will be very trigger-happy.
Thankfully they are retarded in other ways which make them stick out like a sore thumb in the logs. Currently I just look at the logs every few days unless they trigger alerts and throw the whole announced prefix into the trash. So far that has worked out great.