... Summing up the top UA groups, it looks like my server is doing 70% of all its work for these fucking LLM training bots that don’t to anything except for crawling the fucking internet over and over again. Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki. And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users. If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.
https://files.mastodon.social/media_attachments/files/113/724/239/725/832/913/original/ced017dd3cd7a9c6.png