I think it’s evidence that timing is everything. In the 2010s deep learning was ...

I think it’s evidence that timing is everything. In the 2010s deep learning was still figuring out how to leverage GPUs. The scale of compute required for everything after GPT-2 would have been nearly impossible in 2017/2018–our courses at Udacity used a few hours of time on K80 GPUs. By 2020 it was becoming possible to get unbelievable amounts of compute to throw at these models to test the scale hypothesis. The rise of LLMs is stark proof of the Bitter Lesson because it’s at least as much the story of GPU advancement as it is about algorithms.