Conversation

Notices

Embed this notice
VnPower (vnpower@mstdn.maud.io)'s status on Wednesday, 21-May-2025 22:00:27 JST VnPower

1000!
In conversation about 9 days ago from mstdn.maud.io permalink
Attachments
1. Untitled attachment
  https://s3-mstdn.maud.io/media_attachments/files/114/545/928/720/008/879/original/aa7c5645896510f4.png
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:00:25 JST 翠星石
  in reply to
  
  @vnpower It's a real shame that codeberg now sends malware to browsers, instead of the file to those who just want to look at a few source files without a full git clone huh?
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:04:58 JST 翠星石
  in reply to
  - 翠星石
  @vnpower Unfortunately, I didn't find any free software, as the repos either had no license, or had a license dumped into root (which means nothing without a comment stating which files the license applies to, or better a license header on each nontrivial file).
  
  Would you please consider releasing your software as free software?
  
  In conversation about 9 days ago permalink
- Embed this notice
  VnPower (vnpower@mstdn.maud.io)'s status on Wednesday, 21-May-2025 22:09:52 JST VnPower
  in reply to
  - 翠星石
  @Suiseiseki Are you talking about Anubis? How is Anubis a malware?
  TBH Codeberg had been under attack for several times this year, I think it's fair for them to implement some kind of protection.
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:09:52 JST 翠星石
  in reply to
  
  @vnpower Anubis is malware as it sends malware to users that wastes CPU cycles for the sole reason of wasting them (the world wide web of arbitrary remote code execution doesn't grant the user the 4 freedoms, even if the JavaScript is under a free license).
  
  There are actually effective ways to defend against LLM scraping without attacking users with malware - therefore it's always unfair for innocent people to get attacked with malware, even if the target is malicious scrapers (but that doesn't seem to be the case, as "mozilla" is targeted, rather than the scraper favorite of a useragent that contains "AppleWebKit", or "Amazon" or "GPTbot").
  
  In conversation about 9 days ago permalink
- Embed this notice
  VnPower (vnpower@mstdn.maud.io)'s status on Wednesday, 21-May-2025 22:09:55 JST VnPower
  in reply to
  - 翠星石
  @Suiseiseki Whose repos are you mentioning about?
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:11:45 JST 翠星石
  in reply to
  
  @vnpower https://codeberg.org/vnpower
  
  https://www.gnu.org/licenses/license-list.html#NoLicense
  
  Unfortunately, due to how all creative works are automatically copyrighted, all software is by default proprietary unless it is validly placed under a license or validly released to the public domain (although some countries do not recognize the public domain).
  In conversation about 9 days ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    Various Licenses and Comments about Them - GNU Project - Free Software Foundation
    
    from mailto:webmasters@gnu.org
  2. Domain not in remote thumbnail source whitelist: codeberg.org
    
    VnPower
    
    from Codeberg
    
    Can you believe the machine?
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:22:51 JST 翠星石
  in reply to
  - tyil
  @tyil @vnpower The vast majority of LLM scrapers identify themselves with a useragent that contains "AppleWebKit" and/or "Chrome" and/or "Safari".
  
  It seems that the implementers haven't bothered to actually read the useragents past the Mozilla bit at the start!
  
  Those who use browsers that insert "AppleWebKit" and/or "Chrome" and/or "Safari" into the useragents string clearly aren't human, thus I have no comment about such sort being served malware - but those using GNU icecat for example should not be served malware.
  
  In conversation about 9 days ago permalink
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Wednesday, 21-May-2025 22:22:53 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io "mozilla" is targeted, rather than the scraper favorite of a useragent that contains "AppleWebKit", or "Amazon" or "GPTbot"That's because the vast majority of LLM scrapers try to impersonate a regular browser. The ones that identify themselves appropriately with their User Agent are easily blocked. Anubis exists exactly to stop those that _don't_ identify themselves properly.
  
  In conversation about 9 days ago permalink
- Embed this notice
  VnPower (vnpower@mstdn.maud.io)'s status on Wednesday, 21-May-2025 22:24:22 JST VnPower
  in reply to
  - 翠星石
  @Suiseiseki This is my bad for not remembering to add a license some of my repos. I'll fix that later on.
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:24:22 JST 翠星石
  in reply to
  
  @vnpower Please remember to add an unambiguous note as to which file(s) the license apply to in the README, or better a license header to each nontrivial file - after all, a license file just sitting there doesn't mean anything - it could be just there to look nice after all.
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:31:21 JST 翠星石
  in reply to
  - tyil
  @tyil @vnpower I'm not pretending to be a browser - I am a browser.
  
  The user being forced to use workarounds that change the useragent isn't reasonable (as it's trivial for scrapers to change their useragents, but not so for browsers with resistFingerprinting enabled).
  
  In conversation about 9 days ago permalink
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Wednesday, 21-May-2025 22:31:25 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io The vast majority of LLM scrapers identify themselvesFactually incorrect, but it doesn't matter much either. Anubis doesn't block user agents that don't match those used by browsers. It specifically only injects itself when a connection pretends to be a browser, and asks the client to prove it. You can alter your user agent and Anubis won't hurt you.
  
  In conversation about 9 days ago permalink
- Embed this notice
  VnPower (vnpower@mstdn.maud.io)'s status on Wednesday, 21-May-2025 22:36:02 JST VnPower
  in reply to
  - 翠星石
  @Suiseiseki Turns out I have been using the license wrong... https://ao.bloat.cat/exchange/softwareengineering.stackexchange.com/questions/317041/should-i-add-the-license-in-every-header-and-source-file
  Well, you're the first one to tell me this.
  In conversation about 9 days ago permalink
  Attachments
  1. Untitled attachment
    
    Invalid filename.
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:36:02 JST 翠星石
  in reply to
  
  @vnpower Good to know that finally someone told you, even though that it had to be me.
  
  As for SPDX headers, those should only be used in addition to a copyright header or copyright note, as legally a SPDX header means nothing.
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 22:49:08 JST 翠星石
  in reply to
  - tyil
  @tyil @vnpower >because we (website admins) cannot differentiate if you all use the same user agent.
  It's quite easy. If it says "AppleWebKit" and/or "Chrome" and/or "Safari" it's not human.
  
  It's a massive SKILL ISSUE putting "Mozilla" where "AppleWebKit" belongs;
  "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
  "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
  "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
  
  As you can see, most LLM scrapers are banking on blocking or attacking iSheep or Chrome used being too costly, but for free software projects, it isn't.
  
  There are also some scrapers that say "Mozilla", but you clearly 403 the "DotBot" & "Barkrowler" part, not the Mozilla part;
  "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
  "Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)"
  
  Clearly, if it says GNU/Linux or IceCat etc, it's human (even though it also contains "Mozilla").
  
  >Please make them do so, because then those unique user agents can be blocked with a single line.
  If they want to scrape with "AppleWebKit" and/or "Chrome" and/or "Safari" blocked, then they'd have to change their useragent clearly.
  In conversation about 9 days ago permalink
  Attachments
  1. ByteDance - Inspire Creativity, Enrich Life
    
    ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures and geographies.
  2. Domain not in remote thumbnail source whitelist: moz.com
    
    SEO Software for Smarter Marketing
    
    from @Moz
    
    Backed by the largest community of SEOs on the planet, Moz builds the tools that make SEO, content marketing, market research, digital PR, and local SEO easy. Start your free trial today!
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Wednesday, 21-May-2025 22:49:11 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io I am a browser.So you say. But so do thousands of other connections that are not browsers. Anubis thus asks you to prove it, because we (website admins) cannot differentiate if you all use the same user agent.
  
  I don't like that this is needed either, but there's no better solution so far that can be implemented cheaply.it's trivial for scrapers to change their useragentsPlease make them do so, because then those unique user agents can be blocked with a single line.
  In conversation about 9 days ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    browser.so - このウェブサイトは販売用です！ - browser リソースおよび情報
    
    このウェブサイトは販売用です！ browser.so は、browserに関する情報用の最新かつ最適なソースです。一般的興味の問題に関連するトピックもここから検索できます。お探しの内容が見つかることを願っています！
  2. No result found on File_thumbnail lookup.
    
    http://cheaply.it/
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 23:04:38 JST 翠星石
  in reply to
  - tyil
  @tyil @vnpower >Actual browsers use those in their user agent
  Only proprietary browsers use those, which are of lesser concern than free browsers.
  
  >will still serve literal tens of thousands of requests to connections pretending to be users
  Tell me, what percentage of those LLM scraper connection requests contain "AppleWebKit" and/or "Chrome" and/or "Safari"? (I suspect it'll be >90%).
  
  Another effective mitigation is to run a tor middle on the same IPv4 address as your cgit - a mere 7Mbit/second of bandwidth allocated and the great firewall of China will handle the problem of most Chinese (computer utilizing) scrapers for you and you help the tor network too (just make sure your cgit is also reachable via IPv6, so people from China can access it).
  
  In conversation about 9 days ago permalink
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Wednesday, 21-May-2025 23:04:40 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io It's quite easy. If it says "AppleWebKit" and/or "Chrome" and/or "Safari" it's not human.Whether you like it or not, that's not the case. Actual browsers use those in their user agent, and you said yourself earlier we cannot expect users to fix this.It's a massive SKILL ISSUE putting "Mozilla" where "AppleWebKit" belongs;Then so too its a skill issue to not identify IceCat as IceCat/1.0 or whatever version it is.As you can see, most LLM scrapers are banking on blocking or attacking iSheep or Chrome used being too costlyI can see only 3 potential LLM scrapers there, out of many. For the record, I am blocking those, and without Anubis my cgit instance will still serve literal tens of thousands of requests to connections pretending to be users. So no, not "most", only a few, maybe two dozen or so, are identifying themselves appropriately. The other dozens if not hundreds do not.
  In conversation about 9 days ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: www.this.it
    
    Progetti architettura e servizi tecnici per immobili
    
    Consulenza tecnica di architettura ed ingegneria per progettazione, ristrutturazione di immobili, pratiche edilizie, perizie. Investimenti, valorizzazione e trasformazione di immobili
  2. Untitled attachment
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Wednesday, 21-May-2025 23:09:19 JST 翠星石
  in reply to
  - 翠星石
  - tyil
  @tyil @vnpower Without Anubis, with (quite high to allow for git clone) nginx rate-limiting (which aggressive scrapers will still hit), LLM scraper useragents 403'd, with gzip bombs and with a tor middle hosted, rapid scraping stops (all that is left is very slow scraping bots, that probably can be dealt with heuristics).
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Thursday, 22-May-2025 00:31:52 JST 翠星石
  in reply to
  - tyil
  @tyil @vnpower >that is a free as in freedom browser.
  I am dubious of that claim.
  
  I don't use chrome or chromium and nobody should.
  
  >It is not effective in the least.
  Idk, works on other machines.
  
  In conversation about 9 days ago permalink
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Thursday, 22-May-2025 00:31:54 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io Only proprietary browsers use thoseObjectively false. Ungoogled chromium for instance uses it, and that is a free as in freedom browser.Another effective mitigation is to run a tor middleI am running a Tor relay already. It is not effective in the least.just make sure your cgit is also reachable via IPv6It already is.
  
  In conversation about 9 days ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Thursday, 22-May-2025 00:57:29 JST 翠星石
  in reply to
  - Pi_rat
  @Pi_rat Everyone on fedi can mute threads just fine really.
  
  I usually remove non-relevant users, but I don't always.
  
  In conversation about 9 days ago permalink
- Embed this notice
  Pi_rat (pi_rat@freesoftwareextremist.com)'s status on Thursday, 22-May-2025 00:57:31 JST Pi_rat
  in reply to
  - 翠星石
  - tyil
  @tyil @Suiseiseki Good discussion ya both but consider not mentioning op as Im sure he did not want to be part of this.
  
  In conversation about 9 days ago permalink
- Embed this notice
  tyil (tyil@fedi.tyil.nl)'s status on Thursday, 22-May-2025 00:57:33 JST tyil
  in reply to
  - 翠星石
  @Suiseiseki@freesoftwareextremist.com @vnpower@mstdn.maud.io I am dubious of that claim.You are free to be dubious, but you are also wrong. While I don't like that browser, it is free software and therefore I have no moral issues with it.Idk, works on other machines.Clearly it does not work on other machines, hence Anubis is used so much. In this thread alone, the majority uses it to bar LLM scrapers from disrupting their services.
  
  In conversation about 9 days ago permalink

Public

Conversation

Notices

Feeds