Conversation

Notices

Embed this notice
see shy jo (joeyh@hachyderm.io)'s status on Sunday, 17-Mar-2024 01:41:46 JST see shy jo

I am disappointed in Software Heritage.
They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi
These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.
In conversation about a year ago from hachyderm.io permalink
Attachments
1. Domain not in remote thumbnail source whitelist: www.softwareheritage.org
  
  Software Heritage Statement on Large Language Models for Code
  
  from Marla da Silva
  
  Our mission at Software Heritage is to collect, preserve, and make publicly available the entire body of software, in the preferred form for making modifications to it. We consider that publicly available source code, and even more so Free and Open Source Software (FOSS), is a digital commons that embodies decades of human creative effort. […]
- Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  tech? no! man, see... (technomancy@icosahedron.website)'s status on Sunday, 17-Mar-2024 01:41:44 JST tech? no! man, see...
  in reply to
  
  @joeyh did you see what happens when you click opt out?
  https://github.com/bigcode-project/opt-out-v2/issues
  there's 247 open issues they're ignoring that go back nearly a year; meanwhile only one opt-out request has ever been actually closed
  
  In conversation about a year ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Sunday, 17-Mar-2024 01:41:45 JST see shy jo
  in reply to
  
  "3. Mechanisms should be established, where possible, for authors to exclude their archived code from the training inputs before model training begins. "
  But in practice, they seem ok with this post-training removal process: https://huggingface.co/spaces/bigcode/in-the-stack
  In conversation about a year ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: cdn-thumbnails.huggingface.co
    
    Am I in The Stack? - a Hugging Face Space by bigcode
    
    Discover amazing ML apps made by the community
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:37:11 JST see shy jo
  in reply to
  
  The insufficiency is simple: When a LLM trained on software can output portions of copyrighted software, which they absolutely can and do, and when that gets used in proprietary software, all the provinance tracking of the dataset used to train it becomes irrelevant. At that point my license has been violated.
  Software Heratige's statement's silence on this topic, in their list of principles, is deafening.
  
  In conversation about a year ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:38:01 JST see shy jo
  in reply to
  
  Yes, the terms of use of The Stack require updating your copy of the dataset when it's updated to remove software https://huggingface.co/datasets/bigcode/the-stack-v2
  But they say nothing about stopping using models already trained on that data.
  And "the most recent usable version" gives considerable leeway. Presumably if we all removed all our software from The Stack, it would no longer be usable.
  Also, interesting how THEIR terms matter, but MY terms don't
  In conversation about a year ago permalink
  Attachments
  1. Untitled attachment
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:39:00 JST see shy jo
  in reply to
  
  (I should note that I've had considerable difficulty getting my software into Software Heritage in the first place, since I refuse to host it on Github. The irony.)
  
  In conversation about a year ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:39:01 JST see shy jo
  in reply to
  
  By facilitating a corporation that is attempting to set itself up as a governance over my community, how is Software Heritage not behaving in a way that runs counter to their mission statement of preserving software?
  My immediate reaction is to consider removing my software from Software Heritage itself!
  Asking to be removed from The Stack would implicitly legitimize this claim of governance over me.
  
  In conversation about a year ago permalink
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:39:02 JST see shy jo
  in reply to
  
  "The Stack is an open governance interface between the AI community and the open source community."
  This is a seizure of power. It is not legitimate governance.
  
  In conversation about a year ago permalink
- Embed this notice
  Stefano Zacchiroli (zacchiro@mastodon.xyz)'s status on Tuesday, 19-Mar-2024 00:41:34 JST Stefano Zacchiroli
  in reply to
  
  @joeyh I'd also love for the legal reality of the world to be that, if you train on copyleft code, the output is copylefted. But the judge is still out on that (and it doesn't look good).
  Waiting for that, I'd settle for the complete and corresponding code (CCS) of a code LLM to include all of its training set, which is part of what the principles entail.
  
  In conversation about a year ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  James Just James (purpleidea@mastodon.social)'s status on Tuesday, 19-Mar-2024 00:41:36 JST James Just James
  in reply to
  - Stefano Zacchiroli
  @joeyh @zacchiro might be a knowledgeable person to discuss with.
  
  In conversation about a year ago permalink
- Embed this notice
  Stefano Zacchiroli (zacchiro@mastodon.xyz)'s status on Tuesday, 19-Mar-2024 00:41:36 JST Stefano Zacchiroli
  in reply to
  - James Just James
  @purpleidea Thanks James. Hello @joeyh .
  I'd love to hear how you think the principles can be made stronger. (Disclosure: I've contributed inputs to those principles, but I'm not the decision maker.)
  For context, my general take is that: given code LLMs exist anyway, we (= free software activists) need them to be free/open (in its various parts) to create more free software.
  
  In conversation about a year ago permalink
- Embed this notice
  see shy jo (joeyh@hachyderm.io)'s status on Tuesday, 19-Mar-2024 00:41:38 JST see shy jo
  in reply to
  
  By the way, I'd love for someone to tell me I've gotten some or all of this wrong! I really want to not lose my respect for SWH.
  (No interest in debating LLM-as-copyright laundring here or ever tho. Or with any apologists for any corporations.)
  
  In conversation about a year ago permalink