@mishari "We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT"
@dentangle I don't think it's that simple. I was reading a commentary that says with model sizes, it is very unlikely a single byte of the original code is stored in the model in any meaningful way.
"The output from an LLM is a derivative work of the data used to train the LLM.
If we fail to recognise this, or are unable to uphold this in law, copyright (and copyleft on which it depends) is dead. Copyright will still be used against us by corporations, but its utility to FOSS to preserve freedom is gone."