Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not at all.

At the very minimum, you can assume every piece of text data pre Dec-2022, and every image before Aug-2022 to be completely human made. That still leaves decades of pure human digital data, and multiple centuries of distilled human data (books) to be trainable on.

And we haven't gotten into videos yet, which is another giant source of data yet unexplored.

Never forget, humans train on human-generated data. There's no impossible theoretical reason why AI cannot train on AI-generated data.



Humans may train on human-generated data, but humans have many other ways of gaining knowledge about the world than reading. This means that human-generated data may be rich with information not present in the writings or recordings of previous humans. Current LLMs are only trained on existing text for the moment (video and images and sounds soon), but aren’t given access to raw natural input.


To extend the lossy compression hypothesis, human generated text is lossy compression of our sensory experience of reality while LLMs are lossy compression of that.


Prediction: post 2022 content will be presented as vintage pre 2023




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: