If that was the case we would be finding globally optimal solutions for complicated non-convex optimization problems.
The reality is different, you need to really explore the space to find the truly global optimal solution.
A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.
That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.
In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.
Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).
Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.
[1] Was an empirical paper that inspired much theoretical follow-up.
[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.
[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)
[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.
This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one
The reality is different, you need to really explore the space to find the truly global optimal solution.
A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.