More training data at this point leads to marginal improvements, curve is flattening. So advantage is low. Especially when Anthropic definitely has the budget and talent to carry out the same study.
On the other hand, having it leak that you train on your customers data, ignoring the opt-out, is probably existential when close alternatives exist in the market.
You probably also thought Anthropic did not use pirated PDFs. You don't know how these companies actually operate & you don't know what weasel language they use in their contracts to get away w/ exactly what I assume to be the case.
There is no AI, all these companies have is the chat logs so unless you have further evidence on what they do or don't do behind the scenes I recommend you use a more conservative approach in your assumptions about what they use or don't use for training.
No, why would they care about using pirated PDFs? Did you actually read/understand what I wrote? Violating their customers comes with risk for them. Violating the copyright of unrelated texbook authors does not. If that's even what they did.
They are currently paying book authors over a billion dollars in damages. You're out of your depth in this discussion so further engagement is not going to be fruitful for anyone involved. Good luck.
On the other hand, having it leak that you train on your customers data, ignoring the opt-out, is probably existential when close alternatives exist in the market.