I would like to take a parallel view to the BitterLesson and how its playing out. There are exceptions. Its not only computation but also a mix of:
1. decades of theoretical breakthroughs coming together.
2. there is also collective human creativity and perseverance.
Like Yann Le Cunn, Geoff Hinton etc have been working since the 90's and there were several milestones that were hit and it only caught on fire/went on steroids once the application(and the associated funding) was found due to creativity in the tech sector. But if the computation was somehow available before I am not sure it would have happened so quickly.
Another example is that all methods under the AI umbrella are not dependent on crazy amounts of computation and data. Take the field of AutoRegressive models in Social/Life Sciences field. For example lets look at the STAN which broadly does heirarchical Bayesian Inference using MonteCarlo based methods in social science.
It took some hard theoretical advancements to move the needle on MonteCarlo Simulation methods like detecting convergence and ability to have non conjugated priors for posterior sampling to work etc. The new methods are better by leaps and bounds over the conventional methods in the field. The computation for running the modern models from 2013 would be enough to run em for most cases.
Both your points are not really valid. There have been decades of theoretical breakthroughs in computational linguistics too (Have there been any in Deep Learning?). There has also been a large amount of human creativity and perseverance in computational linguistics, arguably more than the amount I have seen in Deep Learning. Yet, not one useful algorithm has come from linguistics. In fact the old adage on speech processing can be applied to Natural Language Processing: "Every time I fire a linguist my performance improves by a few percent".
The bitter lesson is bitter and important to keep in mind exactly because human creativity and perseverance do not matter in front of it. Consistently, the only methods that work are those that scale with computation, everything else does not matter. I would take a more extreme view, if computation didn't follow Moore's law, we wouldn't have invented alternate methods that do not require massive computation, we would just simply fail to do even the most basic tasks of intelligence and be stuck in the 1960s. A scary thought, but a true one I reckon. If computation kept following Moore's law but a few stalwarts like Yann Le Cun etc didn't exist, we would likely have found alternative architectures that scale and work, maybe not as good as ConvNets but transformers aren't as good as ConvNets either, they just need to scale.
I'm not sure that the Bitter Lesson is the end of the story. The Bitter Corollary seems to be that scaling computation also requires scaling data.
Sometimes that's easy; self-play in Go, for example, can generate essentially infinite data.
On the other hand, sometimes data isn't infinite. It can seem infinite, such as the aforementioned NLP work, where computation-heavy ML system can process more data than a human can read in their lifetime. However, our LLMs are already within an order of magnitude of reading every bit of human writing ever, and we're scaling our way to that data limit.
"Clever" human algorithms are all a way of doing more with less. People are still more data-efficient learners than large ML systems, and I'm less sure that we'll be able to compute our way to that kind of efficiency.
I think Geoffrey Hinton addresses this point well in his recent podcast with Pieter Abbeel. He says and I paraphrase, current Deep Learning methods are great at learning from large amounts of data with a relatively small amount of compute. Human brain on the other hand, with around 150 trillion synapses/ parameters has the opposing problem, parameters/ compute is cheap but data is expensive. It needs to learn a large amount from very less data and it is likely a large amount of regularization (things like dropout) will be required to do this without over-fitting. I think we will have a real shot at AGI once 100Trillion param models become feasible which might happen within this decade.
1. decades of theoretical breakthroughs coming together. 2. there is also collective human creativity and perseverance.
Like Yann Le Cunn, Geoff Hinton etc have been working since the 90's and there were several milestones that were hit and it only caught on fire/went on steroids once the application(and the associated funding) was found due to creativity in the tech sector. But if the computation was somehow available before I am not sure it would have happened so quickly.
Another example is that all methods under the AI umbrella are not dependent on crazy amounts of computation and data. Take the field of AutoRegressive models in Social/Life Sciences field. For example lets look at the STAN which broadly does heirarchical Bayesian Inference using MonteCarlo based methods in social science.
It took some hard theoretical advancements to move the needle on MonteCarlo Simulation methods like detecting convergence and ability to have non conjugated priors for posterior sampling to work etc. The new methods are better by leaps and bounds over the conventional methods in the field. The computation for running the modern models from 2013 would be enough to run em for most cases.