I would like to take a parallel view to the BitterLesson and how its playing out...

sashank_1509 · on July 6, 2023

Both your points are not really valid. There have been decades of theoretical breakthroughs in computational linguistics too (Have there been any in Deep Learning?). There has also been a large amount of human creativity and perseverance in computational linguistics, arguably more than the amount I have seen in Deep Learning. Yet, not one useful algorithm has come from linguistics. In fact the old adage on speech processing can be applied to Natural Language Processing: "Every time I fire a linguist my performance improves by a few percent".

The bitter lesson is bitter and important to keep in mind exactly because human creativity and perseverance do not matter in front of it. Consistently, the only methods that work are those that scale with computation, everything else does not matter. I would take a more extreme view, if computation didn't follow Moore's law, we wouldn't have invented alternate methods that do not require massive computation, we would just simply fail to do even the most basic tasks of intelligence and be stuck in the 1960s. A scary thought, but a true one I reckon. If computation kept following Moore's law but a few stalwarts like Yann Le Cun etc didn't exist, we would likely have found alternative architectures that scale and work, maybe not as good as ConvNets but transformers aren't as good as ConvNets either, they just need to scale.

Majromax · on July 6, 2023

I'm not sure that the Bitter Lesson is the end of the story. The Bitter Corollary seems to be that scaling computation also requires scaling data.

Sometimes that's easy; self-play in Go, for example, can generate essentially infinite data.

On the other hand, sometimes data isn't infinite. It can seem infinite, such as the aforementioned NLP work, where computation-heavy ML system can process more data than a human can read in their lifetime. However, our LLMs are already within an order of magnitude of reading every bit of human writing ever, and we're scaling our way to that data limit.

"Clever" human algorithms are all a way of doing more with less. People are still more data-efficient learners than large ML systems, and I'm less sure that we'll be able to compute our way to that kind of efficiency.

sashank_1509 · on July 6, 2023

I think Geoffrey Hinton addresses this point well in his recent podcast with Pieter Abbeel. He says and I paraphrase, current Deep Learning methods are great at learning from large amounts of data with a relatively small amount of compute. Human brain on the other hand, with around 150 trillion synapses/ parameters has the opposing problem, parameters/ compute is cheap but data is expensive. It needs to learn a large amount from very less data and it is likely a large amount of regularization (things like dropout) will be required to do this without over-fitting. I think we will have a real shot at AGI once 100Trillion param models become feasible which might happen within this decade.