Conversation
Notices
-
Embed this notice
Christmas Sun (sun@shitposter.world)'s status on Saturday, 13-Apr-2024 21:33:34 JST Christmas Sun @beep if a bot honors robots.txt then the htaccess rule is unnecessary. if it doesn't honor robots.txt it's probably going to fake its user agent too unfortunately. private scrapers just fake a browser user agent.
Another thing you can do in robots.txt is disallow all and then whitelist specific crawlers:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
the empty disallow permits googlebot while the wildcard blocks everything else.
I can tell you how to block a lot of evil scrapers though. you put a nonexistent URL in your robots.txt. then you can use something like fail2ban to look for any requests for that URL in your web server logfile and block any IP that accesses it. This takes more complex setup than just Apache unfortunately, but it actually enforces blocking a crawler.
Have a nice morning!-
Embed this notice
Ethan Marcotte (beep@follow.ethanmarcotte.com)'s status on Saturday, 13-Apr-2024 21:33:41 JST Ethan Marcotte 🦊
wrote up how i’m blocking “artificial intelligence” bots from accessing my website, with some copy-and-paste code that should (🤞🏻) stay up-to-date whenever i update my blocklist https://ethanmarcotte.com/wrote/blockin-bots/
-
Embed this notice