> I wouldn't be surprised if some Wikipedia editors balk at their volunteer work being actively marketed and reformatted for ease of LLM training
As someone who avidly edited Wikipedia for 6-8 years, I am happy to see my volunteer work used for LLM training. I also agree some other editors likely aren't.
Given that all Wikipedia editors have explicitly consented to their content being released under the Creative Commons Attribution-ShareAlike 4.0 License, they don't get a choice about their content being used for any purpose.
Redistribution of content is an entirely different matter, and the legal status of copyrighted material in relation to LLM training is an open issue that is currently the subject of litigation.
> "it is important to note that Creative Commons licenses allow for free reproduction and reuse, so AI programs like ChatGPT might copy text from a Wikipedia article or an image from Wikimedia Commons. However, it is not clear yet whether massively copying content from these sources may result in a violation of the Creative Commons license if attribution is not granted. Overall, it is more likely than not if current precedent holds that training systems on copyrighted data will be covered by fair use in the United States, but there is significant uncertainty at time of writing."
The new Wikimedia Enterprise APIs facilitate attribution. For example, the "api.enterprise.wikimedia.com/v2/structured-contents/{name}" response [2] includes an "editor" object in a "version" object. So the Wikipedia editor who most recently edited the article seems quite feasible to attribute. ML apps could incorporate such attribution in their offering, and help satisfy the "BY" clause in the underlying CC-BY-SA 4.0 license for Wikipedia content.
As someone who avidly edited Wikipedia for 6-8 years, I am happy to see my volunteer work used for LLM training. I also agree some other editors likely aren't.