Conversation

Notices

Embed this notice
Christmas Sun (sun@shitposter.world)'s status on Saturday, 13-Apr-2024 21:33:34 JST Christmas Sun
in reply to
- Ethan Marcotte
@beep if a bot honors robots.txt then the htaccess rule is unnecessary. if it doesn't honor robots.txt it's probably going to fake its user agent too unfortunately. private scrapers just fake a browser user agent.

Another thing you can do in robots.txt is disallow all and then whitelist specific crawlers:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

the empty disallow permits googlebot while the wildcard blocks everything else.

I can tell you how to block a lot of evil scrapers though. you put a nonexistent URL in your robots.txt. then you can use something like fail2ban to look for any requests for that URL in your web server logfile and block any IP that accesses it. This takes more complex setup than just Apache unfortunately, but it actually enforces blocking a crawler.

Have a nice morning!

In conversation about 10 months ago from shitposter.world permalink
- Embed this notice
  Ethan Marcotte (beep@follow.ethanmarcotte.com)'s status on Saturday, 13-Apr-2024 21:33:41 JST Ethan Marcotte
  
  🦊
  wrote up how i’m blocking “artificial intelligence” bots from accessing my website, with some copy-and-paste code that should (🤞🏻) stay up-to-date whenever i update my blocklist https://ethanmarcotte.com/wrote/blockin-bots/
  In conversation about 10 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: ethanmarcotte.com
    
    Blockin’ bots. — ethanmarcotte.com
    
    Here’s how I’m blocking “artificial intelligence” bots, crawlers, and scrapers.

Public

Conversation

Notices

Feeds