Does gradient descent really do well for deep learning when the gradient is computed with respect to the whole dataset? I assumed that the noise in SGD played an important role for escaping local minima.
There aren't really local minima in most deep networks. When you get into millions/billions of parameters, there will essentially always be some directions that point downwards. You have to get really really close to the true minimum for there to be no direction to go that improves the loss.
Incidentally this same phenomenon is IMO how evolution is able to build things like the eye. Naively you'd think that since you need so many parts arranged so well, it's impossible to find a step by step path where fitness goes up at every step, i.e. if you just have a retina with no lens or vice-versa, it doesn't work. However, due to the high dimensionality of DNA, there is essentially guaranteed to be a path with monotonically increasing fitness just because there are so many different possible paths from A to B in the high dimensional space that at least one is bound to work.
Now this isn't strictly true for every high dimensional system. You need to have a degree of symmetry or redundancy in the encoding for it to work. For example, in convolutional neural networks, you see this phenomenon where some filters get "stuck" in training, and those are local minima for that subspace. What happens though is that if one filter gets stuck, the network will just use another one that had a better initialization. This is why pruning works, lottery tickets, etc. Things like residual connections enhance this effect since you can even be stuck in a whole layer and the training process can just bypass it.
You see the same thing with life, where you could put a sequence for the same protein in different parts of the genome and it could still be produced, regardless of the position. There are also many different ways to encode the exact same protein, and many different possible proteins that will have the same shape in the critical areas. Life finds a way.
If that was the case we would be finding globally optimal solutions for complicated non-convex optimization problems.
The reality is different, you need to really explore the space to find the truly global optimal solution.
A better explanation is that for ml you don't want a globally optimal solution that overindexes on your training set. You want a suboptimal solution that might also fit an unseen data set.
That is what people thought until around 2018, but it was wrong. It turns out that deep learning optimization problems have many global optima. In fact, when the #parameters exceeds the #data, SGD reliably finds parameters that interpolate the training data with 0 loss. Surprisingly, most of these generalize well and overfitting is not a big problem.
In other words, deep learning is a very special nonconvex optimization problem. A lot of our old intuition about optimization for ML is invalid in the overparameterized regime.
Why DL generalizes well is still an open research problem AFAIK. I've read numerous papers that tries to argue one way or another why this works, and they are all interesting! One paper (that I found compelling, even though it didn't propose a thorough solution) showed that SGD successfully navigated around "bad" local minimas (with bad generalization) and ended up in a "good" local minima (that generalized well), and their interpretation was that due to the S in SGD, it will only find wide loss basins, and thus the conclusion was that solutions that generalize well seem to have wider basins (for some reason).
Another paper showed that networks trained on roughly the same dataset but initialized from different random initializations, had a symmetry relation in the loss landscape by a permutation of all the weights. You could always find a permutation that allowed you to then linearly interpolate between the two weight sets without climbing over a loss mountain. Also very interesting even if it wasn't directly related to generalization performance. It has potential applications in network merging I guess.
[1] Was an empirical paper that inspired much theoretical follow-up.
[2] Is one such follow-up, and the references therein should point to many of the other key works in the years between.
[3] Introduces the neural tangent kernel (NTK), a theoretical tool used in much of this work. (Not everyone agrees that reliance on NTK is the right way towards long-term theoretical progress.)
[4] Is a more recent paper I haven't read yet that goes into more detail on interpolation. Its authors were well known in more "clean" parts of ML theory (e.g. bandits) and recently began studying deep learning.
This is something I saw a talk about a while ago. There are probably more recent papers on this topic, you might want to look browse the citations of this one
> There aren't really local minima in most deep networks.
How so?
If there are no local minima other than the global one there are convex optimization methods that are far faster than SGD or Adam. The only reason these methods exist is because deep networks is a non-convex optimization problem.
There are many nonconvex functions where every local minimum is global, and even many nonconvex functions with a unique local minimum (which is de facto global). Convex methods can fail on those.
The reason why GD and friends are a good choice in deep learning is that computing the gradient is cheap (and approximating it even more so). Every descent method relies on solving a subproblem of sorts, typically projecting the current iterate on a sublevel set of an approximation of the function, for some definition of projection. With GD, it's as cheap as it gets, just subtract a shrinked version of the gradient. Subproblems in other algorithms are a lot more expensive computationally, particularly at high dimensions. So more efficient as in requiring fewer function evaluations, yes, but at the cost of doing a lot more work at each step.
To me, the clearest way to think about signed vs. unsigned integers is that different representatives of the integers modulo n are chosen.
For example, for 8-bit signed integers we choose the representatives -128, -127, ..., 127 of the residue classes -128 + 256Z, -127 + 256Z, ..., 127 + 256Z in the ring of integers modulo 256.
For 8-bit unsigned integers, we instead choose the representatives 0, 1, ..., 255.
Mathematically, I do not see how anything is "breaking" as the article claims.
From page 6, when they describe their synthetic textbook dataset:
"Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is singular or nonsingular using the determinant function. [...]"
No. The determinant is not a suitable way to do that. A proper way to numerically measure singularity would be to compute the condition number of the matrix (the ratio of its largest to smallest singular value).
While you're correct that condition number is a more robust numerical method for arbitrary matrices, the determinant is certainly suitable for many matrices. For small matrices with small integer values such as this one, there is no issue with the determinant.
There is no one bulletproof general method to approximate mathematical calculations with floating point numbers. More context is generally required, including the actual problem that is being approximated, to determine if a method is reliable. Painting this as a black and white situation where the determinant is wrong and the condition number is right gives a misleading picture of how we evaluate numerical methods for fit-to-purpose.
Am I misreading this, or is it really about a 2x2 matrix? (In which case computing the determinant involves just two multiplications and one subtraction.)
They define a generic "is_singular" function and test it with a 2x2 matrix.
The problem with the determinant is not about performance. It is just useless for determining if a matrix is singular. The thing that gives it away is that the determinant is influenced by a rescaling of the matrix:
det(s A) = s^n det(A) where A is a n x n matrix
As an example, would you say that
[[1e-10, 0], [0, 1e-10]]
is singular? It has condition number 1.
It does work in theory and for integer numbers. Such an algorithm might be used in practice when numerical stability is concerned, but condition number can certainly be defined somewhere in another textbook and you just need more context to tell it to use another method.
Edit: Sorry, I completely messed up my original answer here. A better version:
Let's say we are in a setting where we only work with integers. A matrix is invertible iff its determinant is invertible in the underlying ring. The only invertible elements in Z are -1 and 1.
So, the code is also incorrect in the integer setting. Here, we should not check for 0, but for -1 or 1.
If I'm reading you correctly, you'd also need to flip the response to the check: where the original test for 0 determines that a matrix is singular if it does find 0, the new test for ±1 should determine that a matrix is singular if it does not find ±1?
Saying that the determinant is useless to determine whether a matrix is incorrect, or misleading at best.
This is purely a matter of numerical stability. Of course [[1e-10, 0], [0, 1e-10]] is nonsingular, and its determinant is 1e-20, which does not equal zero.
Yes, when it comes to floating point issues we might want to use something else, and that's a valid complaint when it comes to NumPy code, but from a theoretical perspective the determinant is an excellent tool to determine singularity.
Of course. Theoretically, the determinant answers the binary question "singular" or "nonsingular".
Numerically, such a binary answer is pretty useless. Here, we need a measure of how singular/nonsingular a matrix is relative to the numerical precision we are working with.
GPT-4's explanation of its optimization does not make sense to me. It writes
"Instead of moving it to P, we can directly use S in the following comparisons, saving one instruction." but then proceeds to use P as if that mov had happened.
AlphaDev's optimization relies on the fact that B and C are already in the correct order. This precondition is missing from the prompt given to GPT-4. It seems that GPT-4 is hallucinating something that only resembles the correct optimization at first glance.
And that's not enough. Think of the swap sequence t=a,a=b,b=t: all three moves are necessary.
While removing moves is a common optimization, it can't be done willy nilly.
As mentioned in the comment you replied to, the algorithm is wrong without the extra precondition B<C that is not included in the prompt, as should be clear from the fact that the largest value is computed as max(A,C) rather than max(A,B,C). This is the same precondition that allows to remove the move, effectively computing the smallest value as min(A,B) instead of min(A,B,C).
So the only correct answer to the prompt is to find a counterexample which does not sort correctly, for example 3 7 5. Not to remove an instruction randomly.
There was already a paper on Transformer models and their inability to redefine variables. They had real problems with swapping variables and would forget that it had happened.