Pytorch is a pretty basic building block when you get to some degree of model complexity. It wouldn't really be interesting to implement autograd or some other things pytorch provides imo when the goal is to show a reimplantation of something as "higher" level as SD. It's similar to how I don't mind it when someone doesn't reimplement an OS, or a JavaScript engine when writing a web app from scratch.
And there's been a recent surge in abstractions over pytorch, and even standalone packages for models that you are just expected to import and use as an API (which are very useful, don't get me wrong!). So it's nice to see an implementation that doesn't have 10 different dependencies that each abstract over something pytorch does.
I agree, great series of videos, but there's a dependent clause:
> ...when the goal is to show a reimplantation of something as "higher" level as SD.
Implementing autograd is interesting, but it's not directly in service to our main subject (Stable Diffusion) and would be a major yak shave. Comparable in complexity to the original project.
For mathematical use, NaN payloads shouldn’t matter, and behave identically (aside from quiet vs. signaling NaNs). It also doesn’t matter for equality comparison, because NaNs always compare unequal.
from the user perspective it's not too bad, but from the compiler perspective it is. The result of this is that LLVM has decided that trying to figure out which nan you got (e.g. by casting to an Int and comparing) is UB, which means pretty much every floating point operation becomes non-deterministic.
This also adds extra complexity to the CPU. you need special hardware for == rather than just using the perfectly good integer unit, and every fpu operation needs to devote a bunch of transistors to handling this nonsense that buys the user absolutely nothing.
there are definitely things to criticize about the design of Posits, but the thing they 100% get right is having a single NaN and sane ordering semantics
The significance of this is that we can fully understand this problem because it’s only 3 lines of code.
Like for learning the English language we don’t fully understand the way LLMs work. We can’t fully characterize it. So we have debates on whether the LLM actually understands English or understands what it’s talking about. We simply don’t know.
The results of this show that the transformer understands the game of life. Or whatever the transformer does with the rules of the game of life it’s safe to say that it fits a definition of understanding as mankind knows it.
Like much of machine learning where we use the abstraction of curve fitting to understand higher dimensional learning we can do the same extrapolation here.
If the transformer understands the game of life then that understanding must translate over to the LLM. The LLM understands English and understands the contents of what it is talking about.
There was a clear gradient of understanding before understanding the game of life hit saturation. The transformer lived in a state where it didn’t get everything right but it understood the game of life to a degree.
We can extrapolate that gradient to LLMs as well. LLMs are likely on that gradient, not yet at saturation. Either way, I think it’s safe to say that LLMs understand what they are talking about. It’s just that they haven’t hit saturation yet. There’s clearly things that we as humans understand better than the LLM.
But let’s extrapolate this concept to an even higher level:
It's a theoretical result to help determine what they're capable of, not a practical solution. Of course you can write the code yourself - but that's not the point!
Well, you could also implement this by hand-writing weights for one convolution layer.
There are only 512 training examples needed for that, and it would be a lot more interesting if a learning algorithm were able to fit that 3x3 convolution layer from those 512 examples. IIRC, and don't quote me on that, but that's not been done.
Exactly my thoughts. This is not useful at all. We already know how to write exact and correct code to implement that. This is no task that we should throw ANNs at.
Basic research has non-obvious utility and it deserves its own spotlight.
It’s similar to comparing hardware radio and software-defined radio: Yes, we already know how to build a radio with hardware but a software-defined one offers greater flexibility.