I am, on balance, opposed to AI art, but there are some important caveats to that position. For starters, I think it's unequivocally wrong - as a matter of law - to say that scraping works and training a model with them infringes copyright. This isn't a moral position (I'll get to that in a second), but rather a technical one.
Break down the steps of training a model and it quickly becomes apparent why it's technically wrong to call this a copyright infringement. First, the act of making transient copies of works - even billions of works - is unequivocally fair use. Unless you think search engines and the @internetarchive shouldn't exist, then you should support scraping at scale:
When it comes to AI art (or "art"), it's hard to find a nuanced position that respects creative workers' labor rights, free expression, copyright law's vital exceptions and limitations, and aesthetics.
--
If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
And unless you think that Facebook should be allowed to use the law to block projects like Ad Observer, which gathers samples of paid political disinformation, then you should support scraping at scale, *even when the site being scraped objects* (at least sometimes):
After making transient copies of lots of works, the next step in AI training is to subject them to mathematical analysis. Again, this isn't a copyright violation.
Making quantitative observations about works is a longstanding, respected and important tool for criticism, analysis, archiving and new acts of creation. Measuring the steady contraction of the vocabulary in successive Agatha Christie novels turns out to offer a fascinating window into her dementia:
Programmatic analysis of scraped online speech is also critical to the burgeoning formal analyses of the language spoken by minorities, producing a vibrant account of the rigorous grammar of dialects that have long been dismissed as "slang":
Since 1988, UCL Survey of English Language has maintained its "International Corpus of English," and scholars have plumbed its depth to draw important conclusions about the wide variety of Englishes spoken around the world, especially in postcolonial English-speaking countries:
The final step in training a model is publishing the conclusions of the quantitative analysis of the temporarily copied documents as software code. Code itself is a form of expressive speech - and that expressivity is key to the fight for privacy, because the fact that code is speech limits how governments can censor software:
Are models infringing? Well, they certainly *can* be. In some cases, it's clear that models "memorized" some of the data in their training set, making the fair use, transient copy into an infringing, permanent one. That's generally considered to be the result of a programming error, and it could certainly be prevented (say, by comparing the model to the training data and removing any memorizations that appear).
Not every seeming act of memorization *is* a memorization, though. While specific models vary widely, the amount of data from each training item retained by the model is *very* small. For example, Midjourney retains about one byte of information from each image in its training data. If we're talking about a typical low-resolution web image of say, 300kb, that would be one three-hundred-thousandth (0.0000033%) of the original image.
By Cory Doctorow (GPG 0xBF3D9110957E5F4C)@doctorow.Archived at pluralistic.netI post long threads. If you don't like these in your timeline but want to read them, I suggest unfollowing me here and subscribing to my RSS, or my newsletter, or any of my various long-form feeds. Links at https://pluralistic.net.tootfinder