Wonder how much the addition of copyrighted material affects how smart the resulting model is. If it's even 20% better LLM makers could be forced out of the US into jurisdictions that allow use of copyrighted data.
I suspect most LLM users will ~always choose the smartest model.
> most LLM users will ~always choose the smartest model
Most LLM users will choose the cheapest model which is good enough.
I think that LLMs' performance is already "good enough" for a lot of applications. We're in the diminishing returns part of the curve.
There are two other concerns:
1. being able to run the model on trusted infrastructure locally (so some jerk won't turn it off on a whim, and the data will remain safe and comply with the local data protection laws and policies)
2. having good tools to create AI applications (like how easy it is to fine-tune it to customer needs)
> how much the addition of copyrighted material affects how smart the resulting model is
Copyrighted material improve the models, not by making it smart, but more factually correct, because it will be trained on reputable, reliable and up-to-date sources.
The jump from llama2 to llama3 had something to do with meta downloading every textbook ever published and using it as training data.
The arguments by meta so far in that court case are absolutely terrible and I'm half expecting to see the world's first trillion dollar copyright infringement award.
I suspect most LLM users will ~always choose the smartest model.