Embed Notice
HTML Code
Corresponding Notice
- Embed this notice@beep if a bot honors robots.txt then the htaccess rule is unnecessary. if it doesn't honor robots.txt it's probably going to fake its user agent too unfortunately. private scrapers just fake a browser user agent.
Another thing you can do in robots.txt is disallow all and then whitelist specific crawlers:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
the empty disallow permits googlebot while the wildcard blocks everything else.
I can tell you how to block a lot of evil scrapers though. you put a nonexistent URL in your robots.txt. then you can use something like fail2ban to look for any requests for that URL in your web server logfile and block any IP that accesses it. This takes more complex setup than just Apache unfortunately, but it actually enforces blocking a crawler.
Have a nice morning!