@vnpower It's a real shame that codeberg now sends malware to browsers, instead of the file to those who just want to look at a few source files without a full git clone huh?
@vnpower Unfortunately, I didn't find any free software, as the repos either had no license, or had a license dumped into root (which means nothing without a comment stating which files the license applies to, or better a license header on each nontrivial file).
Would you please consider releasing your software as free software?
@vnpower Anubis is malware as it sends malware to users that wastes CPU cycles for the sole reason of wasting them (the world wide web of arbitrary remote code execution doesn't grant the user the 4 freedoms, even if the JavaScript is under a free license).
There are actually effective ways to defend against LLM scraping without attacking users with malware - therefore it's always unfair for innocent people to get attacked with malware, even if the target is malicious scrapers (but that doesn't seem to be the case, as "mozilla" is targeted, rather than the scraper favorite of a useragent that contains "AppleWebKit", or "Amazon" or "GPTbot").
Unfortunately, due to how all creative works are automatically copyrighted, all software is by default proprietary unless it is validly placed under a license or validly released to the public domain (although some countries do not recognize the public domain).
@tyil@vnpower The vast majority of LLM scrapers identify themselves with a useragent that contains "AppleWebKit" and/or "Chrome" and/or "Safari".
It seems that the implementers haven't bothered to actually read the useragents past the Mozilla bit at the start!
Those who use browsers that insert "AppleWebKit" and/or "Chrome" and/or "Safari" into the useragents string clearly aren't human, thus I have no comment about such sort being served malware - but those using GNU icecat for example should not be served malware.
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io "mozilla" is targeted, rather than the scraper favorite of a useragent that contains "AppleWebKit", or "Amazon" or "GPTbot"That's because the vast majority of LLM scrapers try to impersonate a regular browser. The ones that identify themselves appropriately with their User Agent are easily blocked. Anubis exists exactly to stop those that _don't_ identify themselves properly.
@vnpower Please remember to add an unambiguous note as to which file(s) the license apply to in the README, or better a license header to each nontrivial file - after all, a license file just sitting there doesn't mean anything - it could be just there to look nice after all.
@tyil@vnpower I'm not pretending to be a browser - I am a browser.
The user being forced to use workarounds that change the useragent isn't reasonable (as it's trivial for scrapers to change their useragents, but not so for browsers with resistFingerprinting enabled).
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io The vast majority of LLM scrapers identify themselvesFactually incorrect, but it doesn't matter much either. Anubis doesn't block user agents that don't match those used by browsers. It specifically only injects itself when a connection pretends to be a browser, and asks the client to prove it. You can alter your user agent and Anubis won't hurt you.
@tyil@vnpower >because we (website admins) cannot differentiate if you all use the same user agent. It's quite easy. If it says "AppleWebKit" and/or "Chrome" and/or "Safari" it's not human.
It's a massive SKILL ISSUE putting "Mozilla" where "AppleWebKit" belongs; "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
As you can see, most LLM scrapers are banking on blocking or attacking iSheep or Chrome used being too costly, but for free software projects, it isn't.
There are also some scrapers that say "Mozilla", but you clearly 403 the "DotBot" & "Barkrowler" part, not the Mozilla part; "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)" "Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)"
Clearly, if it says GNU/Linux or IceCat etc, it's human (even though it also contains "Mozilla").
>Please make them do so, because then those unique user agents can be blocked with a single line. If they want to scrape with "AppleWebKit" and/or "Chrome" and/or "Safari" blocked, then they'd have to change their useragent clearly.
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io I am a browser.So you say. But so do thousands of other connections that are not browsers. Anubis thus asks you to prove it, because we (website admins) cannot differentiate if you all use the same user agent.
I don't like that this is needed either, but there's no better solution so far that can be implemented cheaply.it's trivial for scrapers to change their useragentsPlease make them do so, because then those unique user agents can be blocked with a single line.
@tyil@vnpower >Actual browsers use those in their user agent Only proprietary browsers use those, which are of lesser concern than free browsers.
>will still serve literal tens of thousands of requests to connections pretending to be users Tell me, what percentage of those LLM scraper connection requests contain "AppleWebKit" and/or "Chrome" and/or "Safari"? (I suspect it'll be >90%).
Another effective mitigation is to run a tor middle on the same IPv4 address as your cgit - a mere 7Mbit/second of bandwidth allocated and the great firewall of China will handle the problem of most Chinese (computer utilizing) scrapers for you and you help the tor network too (just make sure your cgit is also reachable via IPv6, so people from China can access it).
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io It's quite easy. If it says "AppleWebKit" and/or "Chrome" and/or "Safari" it's not human.Whether you like it or not, that's not the case. Actual browsers use those in their user agent, and you said yourself earlier we cannot expect users to fix this.It's a massive SKILL ISSUE putting "Mozilla" where "AppleWebKit" belongs;Then so too its a skill issue to not identify IceCat as IceCat/1.0 or whatever version it is.As you can see, most LLM scrapers are banking on blocking or attacking iSheep or Chrome used being too costlyI can see only 3 potential LLM scrapers there, out of many. For the record, I am blocking those, and without Anubis my cgit instance will still serve literal tens of thousands of requests to connections pretending to be users. So no, not "most", only a few, maybe two dozen or so, are identifying themselves appropriately. The other dozens if not hundreds do not.
@tyil@vnpower Without Anubis, with (quite high to allow for git clone) nginx rate-limiting (which aggressive scrapers will still hit), LLM scraper useragents 403'd, with gzip bombs and with a tor middle hosted, rapid scraping stops (all that is left is very slow scraping bots, that probably can be dealt with heuristics).
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io Only proprietary browsers use thoseObjectively false. Ungoogled chromium for instance uses it, and that is a free as in freedom browser.Another effective mitigation is to run a tor middleI am running a Tor relay already. It is not effective in the least.just make sure your cgit is also reachable via IPv6It already is.
@Suiseiseki@freesoftwareextremist.com@vnpower@mstdn.maud.io I am dubious of that claim.You are free to be dubious, but you are also wrong. While I don't like that browser, it is free software and therefore I have no moral issues with it.Idk, works on other machines.Clearly it does not work on other machines, hence Anubis is used so much. In this thread alone, the majority uses it to bar LLM scrapers from disrupting their services.