The irony is not lost on me that the Internet Archive went out of its way to acquire the physical versions of millions of books and loan them out carefully and in a limited way, and is facing a near-extinction-level event over it, while for-profit and VC-backed companies are just stealing people’s content and making up excuses to validate the bad behavior.
@feld@mitch@ernie there's no way to square this circle anyway, copyright is a nonsensical 'right' and i'm happy that llms are exposing it in the way that they are
@ernie after all, coaxing an LLM to reproduce a reference work basically in full is pretty established research at this point. We know it's possible — it's how the tech started, by being able to reproduce a ground truth image despite never having actually been exposed to the original file.
@ernie you know, I have to wonder if the inaction on prosecuting LLM training companies actually introduced a legal loophole for libraries.
Consider that right now, the American legal standard is that GenAI output is considered a derivative work, even if it derived it from 30 billion works. I wonder if the Internet Archive "chunked" editions of books together into a specialized model, could they then "loan" the book out by inferencing a near exact but legally 'distinct' copy of that work?
your post gave me the following idea:
the archive should train some LLM on all of those books, and then publish the trained model.
who'd want to borrow the books under DRM if they can have a locally-running LLM that can search, summarize or even "write" them on demand?
crossing these rays would pit the LLM giants against the book MAFIAA. in such a fight, we should all be rooting for the fight, but if it brings LLM giants to defend the Internet Archive, that could be good?
cc: @brewsterkahle
@ernie @mybarkingdogs @pbaesse @brewsterkahle
Yes Alexandre, I basically had the same thoughts when reading what Ernie wrote. Thanks Ernie. And the picture is even worse for the Internet Archive, and other archives, given that these #LLM (both of those 'L' stand for 'leeching', don't they?) are at the same time acting as denial of service attackers.
There is another aspect, how can it be that basically our media is owned by a handful of corporations, that are also able to not only profit from the sale of the content, but are now able to watch the people watching the content. Its abuse, that a well trained LLM could help mitigate.
I'm an #I2P maximalist and I've seen these LLM supposedly available over I2P torrent, they are in the tens to hundreds of gigabytes. I don't know if its worth the FSF operating one that is able to be trained on different softwares so as to help people setup, customise and enjoy freedom softwares. The subdomain might be a derogatory take on the #AI acronym, like aidiot.fsf.org and maybe it can be trained to tell people that it may talk complete garbage, so check sources and or man(ual) pages if in doubt. Anyway I'm no expert in this area. I just don't like the (a)idea of it being used only by the large players to further entrench their power.
@lxo @ernie @mybarkingdogs @pbaesse @brewsterkahle
This is not to disparage but it must be pointed out that the Internet Archive is huge and maybe it is wrong that we burden the IA with the task of sole #archivist across the web. I know that not everyone can operate an archive but we ought to have at least a couple archives on each continent that can at a minimum archive the content that originates on their continent, or content in their language. I'm very interested in finding a way to store and deliver content in a decentralised manner, where people might even be rewarded in some small way for hosting it.
Maybe I need to ask #aidiot 🤖 how to set this up. 🤣🤣🤣
(As an aside, if I may, I'd like to say a word related to the health of the fediverse that relates to the above point. And that is please be careful of large content delivery systems for fedi. This might include mastoHsot, which appears to host writing.exchange. Also includes Cloudflare, which delivers images for freeradical.zone or in the case of ursal.zone, afaict, delivers their entire service.)
> lets focus on the point at hand (...) too many good discussions get derailed
Yes, I put that last note in parentheses and started with "as an aside" so as just provide a cautionary semi-related note. A lot of fedizens don't know this side of things, and I find many appreciate learning about it. I can't say I've ever seen a derailment by talking about this.
@dcent@lxo to nail into your broader point, having a single archive work as the internet's custodian doesn't make sense long term as it puts us at risk of broader legal challenges as we're facing.
The LLM idea is very clever but I will point out that it seems like the courts don't have a great appetite for novel legal theories based on the fact that this whole debate hinges on controlled digital lending.
We need better strategies for protecting archives.
@ernie @lxo @brewsterkahle I don't have experience with courts in this area, but you're probably right and thus an LLM might be an area to explore, not as an arguement in court but as a practical method (of burning more energy than is needed) and to help people find and understand works.
I get the sense that the corporate actors have minimal or no interest in #archives, only until they need to cite something, then they appreciate it. Many websites are #copyright. How then can a distinction be made between someone requesting a copyrighted webpage and a copyrighted book. Is it a matter of payment? Webpages are delivered freely, unlike #books but if one were to bring together enough #webpages with enough quotes from a book or film you might be able to collect something that encompasses the entire work. Should works be immune from being archived because someone demands payment for them?
Maybe #lawsuits of this nature are a sign that in an epoch where everything seems to be on a cost curve approaching zero, people are desperate to claw back as much income as possible and are now starting to attack basic #institutions. Maybe a universal basic income (#UBI) is needed to allow artists and writers of all ilks to survive, including #journalists, whose plight seems to be a recurring topic here in #Australia.