Well open AI raised eye brows by crawling the internet and using everyone's data to make a commercial product
One day some new startup will train on all of libgen and torrent networks, but it will be very hard to prove. You'll keep getting these gaps up in questionable morality and legality, and even openai will complain about playing fair
Google Classroom, teenager's essays, written by humans, for learning what it means to be human, and graded by humans, is a richer dataset than anything else I can think of that anyone else couldn't get their hands on.
Yup, and they're doing it the whole country over, and putting that data in to Google Classrooms for Bard to know "this is C-grade work" and "this is A-grade work". Knowing what's deemed good and bad writing is where I'm thinking this dataset shines for training LLMs.
One day some new startup will train on all of libgen and torrent networks, but it will be very hard to prove. You'll keep getting these gaps up in questionable morality and legality, and even openai will complain about playing fair