My understanding of automatic differentiation (AD) is that it's only really poss...

abeppu · on Sept 6, 2019

It's certainly not the case that autodiff is only possible at the compiler level. I've implemented forward mode (via dual numbers) and reverse mode (via tapes / wengert nodes) autodiff in libraries before.

mlevental · on Sept 6, 2019

notice the qualifier "really". obviously you can implement autodiff kind of outside the complier since pytorch and tensor flow exist. but those implementations constrain you to a select few compositions (please no comments on Turing completeness with just loops and conditionals). so for example if statements in pytorch are not differentiable (they might have piece wise continuous derivates) because pytorch doesn't actually trace the ast. I'm not a languages expert but outside of implementing in the compiler I imagine you'd need a homoiconic language to implement as a library.

chillee · on Sept 6, 2019

If statements aren't really meaningfully differentiable, regardless of how you do it.

Take

    if x == 59:
        return 1000
    else if x > 59:
        return -x
    else:
        return x

How do you optimize this to maximize x, regardless of what language you're in?

It's true that you can get a derivative, but the derivative is essentially meaningless.

gradientfree · on Sept 6, 2019

  if x <= 0:
      return 0
  else:
      return x

Is in the core of most neural networks today (relu activation), so it is definitely useful.

mlevental · on Sept 6, 2019

I don't understand? It's a piecewise differentiable function and you maximize how you maximize any such function: do gradient ascent where it's differentiable and compare against values at the boundary points (ie start, end of interval and points at which there's a removable discontinuity).

chillee · on Sept 6, 2019

I don't disagree that a gradient exists. As the other commenter commented, the gradient/subgradient will usually exist.

What I'm arguing is that this gradient will not allow you to optimize anything of interest for the vast majority of programs.

alexmlamb · on Sept 6, 2019

Yeah sometimes this is called a "subgradient".

seanmcdirmid · on Sept 6, 2019

There is a derivative, but since the function is non continuous, the derivative will likewise be messed up (but just around x=59). You really don’t want to be climbing a gradient around non continuous functions!

joe_the_user · on Sept 6, 2019

The function has a derivative by some notions of derivative.

But the function's derivative can't be derived by an application of the chain rule and the know derivatives of primitive functions, which is what Algorithmic/automatic differentiation ultimately does (though it does this at run time, not compile time, since ordinary, symbolic differentiation explodes in memory for a complicated functions).

Also:

The continuous function

    int f(int x) {
        if(x > 2) 
            return x + x;
        else 
            return x*2;
    }

Is not automatically or symbolically differentiable when represented that way.

Certhas · on Sept 6, 2019

I believe a source to source differentiator could deal with all these (where well defined of course), e.g.:

https://github.com/FluxML/Zygote.jl

    julia> fs = Dict("sin" => sin, "cos" => cos, "tan" => tan);
    
    julia> gradient(x -> fs[readline()](x), 1)
    sin
    0.5403023058681398

joe_the_user · on Sept 6, 2019

Yeah, Automatic differentiation is essentially only usable on functions specified the way mathematicians specify functions; the compositions of a series of primitives (plus operators like the integral and the differential itself, as well inverse relations).

AD is not usable on loops, conditionals or recursive calls.

So basically, whatever way you specify your functions, you are effectively going to have DSL (within a general purpose language or otherwise) since not all the functions you form are going to be differentiable by AD (and there's some confusion between differentiable in the abstract and differentiable by the methods of AD).

Edit: actually, it's pretty easy to extend AD to functions defined piecewise on intervals to be in the class of function amenable of AD. What's hard/impossible is extending functions defined by loops or recursion.

neilparikh · on Sept 6, 2019

AD on loops/recursion works fine when you implement it using dual numbers (see for https://news.ycombinator.com/item?id=20893414 an example). If you used this to implement x^n using a for loop, you would get the correct derivative (n * x ^ (n - 1)).

seanmcdirmid · on Sept 6, 2019

AD is totally possible at the library level (I did it in C# 10 years ago using Conal Elliott’s ideas), however if you want to autodiff more than an expression-only language, compiler support is useful.

abeppu · on Sept 6, 2019

Maybe this is the piece I don't understand. What do you mean by "more than an expression-only language"? What is a non-expression in a language? And what would it mean to have a derivative for your non-expression?

seanmcdirmid · on Sept 6, 2019

You can easily use expressions to create trees rather than values in a library. Eg a + b need not compute a value, through a bit of operating overloading it can compute the tree plus(tree-a, tree-b).

This “trick” does not extend to statements, however. You can’t override if or semicolon in most languages. You can encode statements as expressions, but then you have to worry about things like variable bindings on your own.

xscott · on Sept 6, 2019

Building expression trees is not the only way to do automatic differentiation. A simpler way is to just carry the value and it's derivative as a pair:

    #include <math.h>
    #include <stdio.h>

    struct autodiff { double value, deriv; };

    autodiff just(double x) { return { x, 1 }; }

    autodiff operator +(autodiff a, autodiff b) {
        return { a.value + b.value, a.deriv + b.deriv };
    }

    autodiff operator *(autodiff a, autodiff b) {
        return { a.value * b.value, a.deriv*b.value + a.value*b.deriv };
    }

    autodiff sin(autodiff a) {
        return { sin(a.value), cos(a.value)*a.deriv };
    }

    int main() {
        autodiff x = just(.1);
        for (int ii = 0; ii<4; ii++) {
            x = x*x + sin(x);
        }
        printf("value: %lf, deriv: %lf\n", x.value, x.deriv);
        return 0;
    }

There is no need to differentiate the for loop or the semicolons. This way is not doing symbolic differentiation. It's implementing the differentiation rules in parallel to calculating the values at run time.

This generalizes to partial derivatives for multivariate functions too:

     template<int dims>
     struct autograd { double value, grad[dims}; };

joe_the_user · on Sept 6, 2019

This only works if you don't have x as a value determining the length of the for loop. If you try to take the derivative of x^x using a loop multiplying x times, you'll get a wrong answer (admittedly, expecting a right answer would be foolish).

But this goes to question at hand, whether AD should be a library/DSL or whether it should be a primitive of a general purpose language. The thing is a general purpose language lets you present sorts of things as functions that can't be even dual numbers won't take the derivative of correctly - a dual scheme can't distinguish variable order loops from constant order loops.

xscott · on Sept 7, 2019

That's a fair point (and worth documenting as part of the library), but I'd be much more likely to implement pow(x, x) carefully than switch to a different language or compiler.

joe_the_user · on Sept 7, 2019

I appreciate the library approach and I basically agree with you. If anything, I wanted to point out that a normal general language is so general that adding differentiation into it would be highly costly - just this feature would require that every loop and every conditional return be watched (relative to pow(x,x), an even more challenging example is the Newton's method implementation of a continuous function using a loop).

An alternative would be creating a general purpose language just for this feature - an interesting though rather specialized project.

xscott · on Sept 7, 2019

I'm not certain, but I suspect you'll hit the halting problem if you tried to be completely general purpose.