Embed this noticedjsumdog (djsumdog@djsumdog.com)'s status on Saturday, 10-May-2025 16:00:16 JST
djsumdogWhy does Arch Wiki, a public open source documentation system, not want a crawler to index their site? People can scream AI all they want, but the admins are also destroying any new attempts to break into the search engine market. Do they think Google/Bing/Yandex don't already get past this, or do their servers return different results for the big search bots?
People use to be able to view Arch wiki without Javascript. Now they can't. 😡
@shortstories@djsumdog Arguably not much more than they can spy on you anyway [especially if you have scripts enabled] The main objection is that javascript is fucking shit and makes everything worse.
uBlock Origin makes it easy to disable JS universally and enable it selectivity, without the complexity of uMatrix. The speed differences is noticeable.
Also, the Gentoo wiki has no such Javascript "proof of work" for accessing it.
@shortstories@djsumdog Browse the web with scripts disabled, then browse it with scripts enabled. Observe, with your own eyes, the performance difference.
If I had to guess it is not javascript that is the problem but javascript libraries plus trying to keep things updated that is the problem
If you write code to do certain basic things that do not need constant updating and do not access a library then if it works it should continue to work
Once someone puts library software in the code working code can malfunction when the library changes
I have no idea what you're arguing. The problem I had was I couldn't see a static website because it requires Javascript, NOT for functionality, not even for DDoS protection, but to solve a proof of work because they don't want to be scraped by "AI" ... an open source documentation site not wanting to be scraped. Let that sink in.
@Zergling_man@djsumdog If someone does not know how to program something or is too lazy to do it themself they will use someone else's library
These libraries can change at any time
So what might have worked might stop working when the library is changed
These libraries allow people who do not know what they are doing to look competant and slip in bad code that will malfunction later after they get paid to do their job.
I would suggest that these libraries are an additional serious problem
I would suggest there are two reasons for the difference
1 Having more code to run slows down everything in exchange for doing whatever adfitional feature is provided by running the code
2 The problem is not primarily from Javascript itself other than the mistake of whoever put the library feature. I would suggest the problem is with people who are bad at computer programming writing the code in Javascript using libraries because they do not know how to program
@shortstories@djsumdog It's just 1. It's all 1. If it were actually an "additional feature" it would make sense. 99% of the time it is not. Like loading a form; as if there isn't a standard way to do that already.
@tyil@djsumdog Please do not immorally attack people with proprietary software tyil - you know better.
One way to solve that issue is to set bait with gzip bombs; https://idiallo.com/blog/zipbomb-protection ("Content-Encoding: deflate, gzip" is incorrect, should be; "Content-Encoding: gzip") - many bots will fetch such bombs and crash.
Most scraper bots seem to use Apple useragents and just 403 .*AppleWebKit.* fixes that issue for cgit (or if you still want to allow isheep access to your website, maybe attacking apple used with more proprietary malware is what they deserve).
@djsumdog@djsumdog.com Depending on if they suffer the same issues as my cgit instance, the choice is to be down completely because LLM scrapers overload the instance constantly, or force JS so at least some people can use the site.
I don't enjoy using Anubis, I think it is stupid to waste CPU cycles like this. I do use Anubis on services that are go down all day because I currently have no better solution to fight back against LLMs. I don't have the money for infinite resources, and I don't have the time to constantly log potential LLM bots and block them.