It's impossible for a piecewise linear function to be anything other than linear outside the training sample. They are by their definition unable to do anything but interpolate.
(Side note: Transformers aren’t piecewise linear. The dot products are bilinear, and feeding in the same input twice (under different linear maps) into a bilinear map, produces a quadratic map, not a linear one.)
People arguing about this are basically all speaking ambiguously, in ways that tend to either make an apparent disagreement when there is none, or hide the location of the actual disagreement.
It is true that a piecewise-linear function, within any linear component, will have any convex combination of some points be sent to the corresponding convex combination of where it sends those points.
It is not true that a piecewise-linear model trained on a set of data points will produce only outputs which are a convex combination of outputs that appear in the training set.
These are both obvious.
If one person takes the former to be what “it just interpolates between points” means, and another takes the latter to be what “it doesn’t just interpolate between points” means, and then they argue about which of them is right, then both are being silly.
I’m not saying that this is literally what is happening. This is meant as a metaphor for somewhat more sophisticated/reasonable interpretations of “(doesn’t) just interpolate(s) between points in the data set”.
_____________
A model trained on images which produced only convex combinations of images in its training set, would clearly be producing what could be called “interpolations between images in its training set”, and taking convex combinations of images is unimpressive.
This is obviously not what today’s image ML models do.
And, of course, you aren’t claiming that they do.
______
I should speak plainly.
Much of where the disagreement is, or is hidden behind, is disagreement as to the meaning of “just interpolation”.
At one end, “just interpolation” could refer to “take the Voronoi cells of the inputs in the training set (or maybe the dual of it, whatever), and at runtime, find the nearest neighbors of the point and take the linear combination of their assigned outputs, weighted according to the distances to the point.”
This would certainly be “interpolation”, and is not impressive, calling it “just interpolation” seems quite fitting. However, it is obviously not what ML models do.
On the other end of the scale, “interpolation” could be interpreted as meaning “any process whatsoever, except that the process is required to be based primarily on the training data, with the process being generated mechanically from the training data, of computing an output for a given input.”
And, certainly today’s ML models satisfy this description, but with this description the moniker “just” seems, inappropriate. It is like saying “just a process”. Well, yeah, everything is a process.
__________
It seems to me like much of what the disagreement ought to be about (which might not be what it is about)
is along the lines of, how many conceptual layers of something are captured?
Like, say something modeled images of faces as “linear combinations of images from this list of images of faces”. That’s an extremely basic thing.
Then, very slightly more, would be something that determines positions of facial features, and then does stretching etc. of images to make them line up with the image to reproduce, and then does linear combinations.
Then, suppose something takes like, the parts of the images of the face which are just skin and not like lip skin or eyes, and takes local averages of this in a number of general locations of the face (relative to locations of facial features), and takes principal components of this across the training set (with principle components perhaps corresponding to perhaps, 1 or 2 for skin tone, and then directionality for the lighting in the image, and maybe a component for how shiny the skin is).
A model which represents a face in terms of variables which we can interpret as things like “position of eyes”, “skin tone”, “lighting”, seems notably less in the direction which one might call “interpolating” than one which just lists a coefficient for each image in the training set (or each principal component of images (taken as plain vectors) in the dataset).
And, of course, one can go farther than this in this direction.
And the further one goes in this direction, (so, like, the more that what the individual images in the training set tell the model is “here is more data about an overarching pattern”), the less it seems like what one might be inclined to call “just interpolation”.
>It is not true that a piecewise-linear model trained on a set of data points will produce only outputs which are a convex combination of outputs that appear in the training set.
No, and I didn't claim that. I said that, outside the training sample, the model is linear (or quadratic in the case of transformers, thanks for pointing that out) Whether linear or quadratic, a model that has a fixed structure outside the training sample, will obviously not fit data which lies far away from the training sample - i.e. it will not extrapolate. This isn't controversial - it's just something people like to forget about.
>A model trained on images which produced only convex combinations of images in its training set, would clearly be producing what could be called “interpolations between images in its training set”, and taking convex combinations of images is unimpressive.
True! I should have clarified that it's not linear interpolation in pixel space (or input space generally), but interpolation on the latent manifold. This is where the power, as well as limitations of deep learning come from.
It's definitely non-trivial to identify the latent manifold of data - different dimensions of the manifold may sometimes even correspond to independent components, as you mention (position of eyes, skin tone,...) (though empirically, finding disentangled latent codes is mostly a function of the random seed).
How does an NN process a new input? It maps the input to the latent manifold.
In the input space, it will be some highly non-linear, non-trivial combination of points, which in terms of Euclidean distance in the input space, could be arbitrarily close or far away.
In the latent space, the output will be some convex combination of nearby points.
Here's the kicker - even if your problem happens to be well-modeled as a continuous, low-dimensional manifold embedded in a high-dimensional space (and many, many problems aren't), and even if you manage to obtain a super dense sampling of input space, so that the manifold can be well-approximated (which is impractical or impossible for most problems),
you will never be able to generalize beyond the data distribution.
Our brains don't stop working as soon as conditions are slightly different from what we've seen before. If there's a slight fog on a Stop sign, we can still see a stop sign. If the Go board is 9x9 rather than 19x19, we can still play Go. If we can play Starcraft on one map, we're pretty much as good on a different map, we don't need to relearn the game over the next several thousand years.
How come? Because we aren't just latent space interpolators. We can extrapolate.