If that was the case we would be finding globally optimal solutions for complica...

blt · on May 8, 2024

That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.

In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.

l33tman · on May 8, 2024

Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).

Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.

patrick451 · on May 8, 2024

I have read this in several places and want to learn more. Do you have a reference handy?

blt · on May 8, 2024

[1] Was an empirical paper that inspired much theoretical follow-up.

[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.

[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)

[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.

---

[1] Understanding deep learning requires rethinking generalization. Zhang et al., arXiv, 2016. https://arxiv.org/abs/1611.03530

[2] Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. Azizan et al., arXiv, 2019. https://arxiv.org/abs/1906.03830.

[3] Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Jacot et al., NeurIPS, 2018. https://proceedings.neurips.cc/paper/2018/hash/5a4be1fa34e62...

[4] A Universal Law of Robustness via Isoperimetry. Bubeck et al., NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/f197002b9a085...

patrick451 · on May 15, 2024

Awesome, thank you!

nephanth · on May 8, 2024

This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one

https://arxiv.org/abs/2003.00307