My work might make me biased but I think there’s a lot of room to make more general purpose code differentiable.
I work on a ton of stuff that would very much benefit from being differentiable, but also very much can’t fit at all into the “stack a ton of linalg-like layers”.
It’s a huge engineering effort to even try to think about how I might start taking out derivatives. It’s possible but there’s so much overhead to do anything with it.
Julia has metaprogramming as a fundamental principle of the language. This makes for a very concise and powerful system that, together with a well-written AD framework like Zygote, makes every expression differentiable, meaning that effectively the entire language is differentiable.
And that isn't even the coolest thing about working in Julia: just wait til you see what people can squeeze out of macros.
so why it hasn't taken over the ML world already? or it did? or there are too many ML "researchers" who haven't bothered to improve their own tooling and are trapped in Anaconda?
The Julia community is small and has no large commercial backers. Projects such as TF/PyTorch require community support and a lot of investment which Julia just doesn't have. In fact, Julia isn't even trying at the moment to "compete" with TF/PyTorch [1, 2].
I've worked at 2 companies that would have liked to use Julia but it wasn't (and still isn't) product ready for anything involving high reliability or robustness.
Pytorch can be used in a very general purpose way. It's essentially numpy + automatic differentiation + GPU support. All the 'linalg-like layers' are entirely optional. If you write y = A*x+b in pytorch, that works, and is differentiable.
Non-smooth functions (e.g. abs(x)) can be handled with bundle methods, but how would one make inherently discontinuous (non-convex) functions differentiable? (e.g. if x then 1 else 5)
Discrete problems are inherently non-differentiable. There are approaches like complementarity methods and switching functions (tanh) usually end up with numerical issues.
This already happens to an extent in existing ML pipelines. The ReLU activation function, is discontinuous in its derivative, and is one of the most widely used functions in neural networks. Its derivative looks like this.
if(i<0) return 0;
else return 1;
Now ReLU is continuous itself (as well as being monotonic) so it still cooperates relatively well with gradient descent algorithms. I think this is where the problem lies - not with differentiability itself, but with gradient descent not working due to the highly non-convex search space that such general programming constructs will produce.
ReLU is not discontinuous; it is nonsmooth but continuous, hence derivatives exist except at the hinge points.
Inherently discontinuous functions OTOH are disconnected and nonconvex. Gradient descent works, but you have to add a step to first partition the discrete space like branch and bound. This involves solving the continuous relaxation to find a bound. This does not require differentiability (it is not differentiable), but the price to pay is that it is combinatorial (NP hard)
The OP was talking about general differentiability but inherently discontinuous functions form a large and important class of functions (from software programming) that are not differentiable.
> Now ReLU is continuous itself (as well as being monotonic) so it still cooperates relatively well with gradient descent algorithms.
A function with a discontinuous derivative cannot cooperate with gradient descent algorithms. That's why you have the famous problem of "dead neurons".
Imagine an alternative ReLU which had a narrow curved section, to smooth out the discontinuity. Now it has a continuous derivative, but that gradient is still zero for values < 0. This flat region is the cause of dead neurons, because backprop multiplies the propagated error by the gradient, to update the weight - and if the gradient is 0, the result of the multiplication is 0 , and the neuron's weights do not get adjusted.
And then people go to empirical data and apply the Great Smoothing: By dropping ML/DL methods on the data (also including results from discontinuous behavior), continuity is often implicitly assumed.
I work on a ton of stuff that would very much benefit from being differentiable, but also very much can’t fit at all into the “stack a ton of linalg-like layers”.
It’s a huge engineering effort to even try to think about how I might start taking out derivatives. It’s possible but there’s so much overhead to do anything with it.