> You absolutely can; the model is quite capable of reproducing works it was tra...

> You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output. The same model can be used across multiple software applications for different purposes. If I were to go to https://huggingface.co/deepseek-ai/DeepSeek-V3.2/tree/main (for example) and download those files, I wouldn't be able to reverse-engineer the training data without building more software.

Compare that to a search database, which needs the full text in an indexable format, directly associated with the document it came from. Although you can encrypt the database, at some point, it needs to have the text mapped to documents, which would make it much easier to reconstruct the complete original documents.

> That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The threshold of originality defines whether something can be protected by copyright. There are plenty of small snippets of code that can't be protected. But there are still questions about these small snippets that were consumed in the context of a larger, protected work, especially when there are only so many ways to express the same concept in a given language. It's definitely easier in written text than code to reason about.