More

eapriv · 2025-06-28T19:28:15 1751138895

It’s not “an uncomputable number”.

eapriv · 2025-06-24T15:09:28 1750777768

Spoiler: it’s not about how GPUs work, it’s about how to use them for machine learning computations.

oivey · 2025-06-24T16:45:34 1750783534

It’s a pretty standard run down of CUDA. Nothing to do with ML other than using relu in an example and mentioning torch.

eapriv · 2025-06-21T06:02:53 1750485773

um, “Earth’s”?

eapriv · 2025-06-14T16:44:16 1749919456

I find it hilarious that “from scratch” now somehow means “in PyTorch”.

monsieurbanana · 2025-06-14T16:54:35 1749920075

If any "from scratch" post doesn't start with linking to a Primitive Technology video, I'm closing the tab

mkoubaa · 2025-06-14T19:04:34 1749927874

Unless the author was raised by chimps I'm out

0cf8612b2e1e · 2025-06-14T20:14:12 1749932052

Not fusing heavier elements from hydrogen? I’m out.

mardifoufs · 2025-06-14T20:51:01 1749934261

Pytorch is a pretty basic building block when you get to some degree of model complexity. It wouldn't really be interesting to implement autograd or some other things pytorch provides imo when the goal is to show a reimplantation of something as "higher" level as SD. It's similar to how I don't mind it when someone doesn't reimplement an OS, or a JavaScript engine when writing a web app from scratch.

And there's been a recent surge in abstractions over pytorch, and even standalone packages for models that you are just expected to import and use as an API (which are very useful, don't get me wrong!). So it's nice to see an implementation that doesn't have 10 different dependencies that each abstract over something pytorch does.

eapriv · 2025-06-15T06:00:43 1749967243

> It wouldn't really be interesting

Andrej Karpathy did exactly that, and I think it’s quite interesting.

yorpinn · 2025-06-16T02:29:51 1750040991

I agree, great series of videos, but there's a dependent clause:

> ...when the goal is to show a reimplantation of something as "higher" level as SD.

Implementing autograd is interesting, but it's not directly in service to our main subject (Stable Diffusion) and would be a major yak shave. Comparable in complexity to the original project.

chairmansteve · 2025-06-14T18:50:40 1749927040

Yeah. Should have done it in assembly.

eapriv · 2025-06-12T04:25:40 1749702340

This is true almost by definition, and doesn’t tell us anything interesting about black holes.

eapriv · 2025-06-01T07:48:01 1748764081

What does it have to do with using LLVM?

eapriv · 2025-05-31T14:06:12 1748700372

Why is it not commutative?

layer8 · 2025-05-31T14:40:45 1748702445

It actually is commutative according to IEEE-754, except that in the case of a NaN result you might get a different NaN representation.

adgjlsfhk1 · 2025-05-31T22:03:31 1748729011

having multiple NaNs and no spec for how they should behave feels like such an unforced error to me

layer8 · 2025-06-01T00:59:10 1748739550

For mathematical use, NaN payloads shouldn’t matter, and behave identically (aside from quiet vs. signaling NaNs). It also doesn’t matter for equality comparison, because NaNs always compare unequal.

adgjlsfhk1 · 2025-06-01T12:57:51 1748782671

from the user perspective it's not too bad, but from the compiler perspective it is. The result of this is that LLVM has decided that trying to figure out which nan you got (e.g. by casting to an Int and comparing) is UB, which means pretty much every floating point operation becomes non-deterministic.

This also adds extra complexity to the CPU. you need special hardware for == rather than just using the perfectly good integer unit, and every fpu operation needs to devote a bunch of transistors to handling this nonsense that buys the user absolutely nothing.

there are definitely things to criticize about the design of Posits, but the thing they 100% get right is having a single NaN and sane ordering semantics

eapriv · 2025-05-24T03:09:41 1748056181

How does it make it less horrible?

calt · 2025-05-24T03:59:02 1748059142

Because it’s a joke. It’s satire, and it hits the nail on the head of so-called vibe coding enlightenment.

eapriv · 2025-05-17T14:48:33 1747493313

Performance of any given CPU instruction is negligible, yet somehow they accumulate to noticeable values.

drob518 · 2025-05-17T16:00:14 1747497614

Amen.

eapriv · 2025-05-17T11:23:00 1747480980

Great, we can spend crazy amount of computational resources and hand-holding in order to (maybe) reproduce three lines of code.

ninetyninenine · 2025-05-17T12:23:40 1747484620

The significance of this is that we can fully understand this problem because it’s only 3 lines of code.

Like for learning the English language we don’t fully understand the way LLMs work. We can’t fully characterize it. So we have debates on whether the LLM actually understands English or understands what it’s talking about. We simply don’t know.

The results of this show that the transformer understands the game of life. Or whatever the transformer does with the rules of the game of life it’s safe to say that it fits a definition of understanding as mankind knows it.

Like much of machine learning where we use the abstraction of curve fitting to understand higher dimensional learning we can do the same extrapolation here.

If the transformer understands the game of life then that understanding must translate over to the LLM. The LLM understands English and understands the contents of what it is talking about.

There was a clear gradient of understanding before understanding the game of life hit saturation. The transformer lived in a state where it didn’t get everything right but it understood the game of life to a degree.

We can extrapolate that gradient to LLMs as well. LLMs are likely on that gradient, not yet at saturation. Either way, I think it’s safe to say that LLMs understand what they are talking about. It’s just that they haven’t hit saturation yet. There’s clearly things that we as humans understand better than the LLM.

But let’s extrapolate this concept to an even higher level:

Have we as humans hit saturation yet?

Philpax · 2025-05-17T12:24:29 1747484669

It's a theoretical result to help determine what they're capable of, not a practical solution. Of course you can write the code yourself - but that's not the point!

lynndotpy · 2025-05-17T12:30:14 1747485014

Well, you could also implement this by hand-writing weights for one convolution layer.

There are only 512 training examples needed for that, and it would be a lot more interesting if a learning algorithm were able to fit that 3x3 convolution layer from those 512 examples. IIRC, and don't quote me on that, but that's not been done.

zelphirkalt · 2025-05-17T12:21:30 1747484490

Exactly my thoughts. This is not useful at all. We already know how to write exact and correct code to implement that. This is no task that we should throw ANNs at.

gessha · 2025-05-17T12:55:51 1747486551

Basic research has non-obvious utility and it deserves its own spotlight.

It’s similar to comparing hardware radio and software-defined radio: Yes, we already know how to build a radio with hardware but a software-defined one offers greater flexibility.