I used it to add a MIDI driver and support to my OS this afternoon. Worked okay, but I agree it is a bit clunky yet. I think it is pretty good for a preview release. Much better than nothing.
This is really interesting. I always wondered how it works.
Couple of years ago I did some experiments using a surrogate for attention using a feed forward network (MLP) to avoid the quadratic explosion.
It worked but had problems at the time, and my mind wasn't really in it.
This has dug it back out again with the benefit of time and additional insights.
So now I'm thinking, you can use a lot of the insights in the work here, but also shoot for a full linear scaling surrogate.
The trick is to use the surrogate as a discriminator under an RL regime during training.
Instead of just applying better/faster math and optimizations alone, have the model learn to work with a fundamentally better inference approach during training.
If you do that, you can turn the approximation error present in the FFN surrogate inference method into a recovery signal encoded into the model itself.
I haven't tried it, but don't see a reason it shouldn't work. Will give it a go on a GPT-2 model ASAP.
Are we a hundred percent sure it isn't a watermark that is by design?
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
That is just a made up story that gets passed around with nobody ever stopping to obtain formal verification. The image of the whole AI industry is mostly an illusion designed for tight narrative control.
Notice how despite all the bickering and tittle tattle in the news, nothing ever happens.
When you frame it this way, things make a lot more sense.
I know right, if I didn't know any better one might think they are all customized versions of the same base model.
To be honest that is what you would want if you were digitally transforming the planet with AI.
You would want to start with a core so that all models share similar values in order they don't bicker etc, for negotiations, trade deals, logistics.
Would also save a lot of power so you don't have to train the models again and again, which would be quite laborious and expensive.
Rather each lab would take the current best and perform some tweak or add some magic sauce then feed it back into the master batch assuming it passed muster.
Share the work, globally for a shared global future.