Well open AI raised eye brows by crawling the internet and using everyone's data...

pas · on July 11, 2023

ThePile already contains some content from a torrent, and there's as lawsuit alleging that Meta has committed copyright infringement by using it.

https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

why_only_15 · on July 11, 2023

Many people train on libgen/torrent in the form of books3 (e.g. LLaMa does this).

fragmede · on July 11, 2023

Google Classroom, teenager's essays, written by humans, for learning what it means to be human, and graded by humans, is a richer dataset than anything else I can think of that anyone else couldn't get their hands on.

londons_explore · on July 11, 2023

An awful lot of teachers can grade a 10 page essay in about 90 seconds...

Skim read it, mark out some grammar errors, assign it a grade based on the quality of the opening and closing paragraphs.

fragmede · on July 13, 2023

Yup, and they're doing it the whole country over, and putting that data in to Google Classrooms for Bard to know "this is C-grade work" and "this is A-grade work". Knowing what's deemed good and bad writing is where I'm thinking this dataset shines for training LLMs.