I find their critique compelling, particularly their emphasis on the disconnect ...

seagullz · on Jan 11, 2025

And then other problems would perhaps turn up down the track that would call for "discovering new ways to discover new ways of discovery" and so on.

KaoruAoiShiho · on Jan 10, 2025

Just train it on meta reasoning, ie train it on people discovering ways to discover. It's not really a big problem, just generate the dataset and have at it.

derefr · on Jan 10, 2025

This doesn't give you the ability to process ideas through the derived new insights, any more than loading the contents of a VLSI program into regular RAM gives you an FPGA.

The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals (i.e. to have the [static!] weights + [runtime-mutable] context, together encode [runtime-mutable] virtual weights, that the same host context can be passed through.) You need host support for that.

lxgr · on Jan 10, 2025

> The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals […] You need host support for that.

Neither do biological brains (explicitly), yet we can hypothesize just fine.

derefr · on Jan 10, 2025

You're conflating two steps:

1. hypothesizing — coming up with a novel insight at runtime, that uncovers new parts of the state space the model doesn't currently reach

2. syllogizing — using an insight you've derived at runtime, to reach the new parts of the state space

LLMs can do 1, but not 2.

(Try it for yourself: get an LLM to prove a trivial novel mathematical theorem [or just describe the theorem to it yourself]; and then ask it to use the theorem to solve a problem. It won't be able to do it. It "understands" the theorem as data; but it doesn't have weights shaped like an emulator that can execute the theorem-modelled-as-data against the context. And, as far as I understand them, current Transformer-ish models cannot "learn" such an emulator as a feature. You need a slightly different architecture for that.)

And actually, humans can't really do 2 either!

That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!

Instead, for humans, the process we go through is either:

• come up with the insight; sleep on it (i.e. do incremental training, converting the data into new weights); use the insight

• build up 99% of the weights required for the insight "in the background" over days/months/years without realizing it; make the final single connection to "unlock" the insight; immediately use the insight

LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.

With (current) LLMs, there's only the entirely-offline training/fine-tuning/RLHF — at much greater expense and requiring much greater hardware resources than inference does — to produce a new iteration of the model. That's why we're (currently) stuck in a paradigm of throwing prompts at ever-larger GPT base models — rather than just having an arbitrary stateful base-model that you "install" onto a device like you'd install an RDBMS, and then have it "learn on the job" from there.

lxgr · on Jan 10, 2025

> And actually, humans can't really do 2 either!

> That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!

Agreed – but I'd argue that they both can, albeit in an extremely clunky way (i.e. very similar to "chain-of-thought" LLMs): Mechanically applying the new insights in a low-efficiency, "emulated" layer.

> LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.

At the moment that's definitely their biggest weakness, but one could argue that memory consolidation happens ~once per year, globally, as past interactions with them almost certainly become future training data.

In some ways, that's more powerful than a single human gradient descending overnight and in the shower; in others, it's obviously much worse.

derefr · on Jan 13, 2025

> memory consolidation happens ~once per year, globally, as past interactions with them almost certainly become future training data.

Yes and no.

Humans — organic animal brains in general — have a seemingly highly-clever approach to memory consolidation / online training. The organic process for memory consolidation seems to take advantage of our own analytical capabilities to pick apart and amplify the salient and practically-useful parts of our memories.

As such, animal brains seemingly need much less training data to learn a skill than current state-of-the-art model architectures do. You could say that they can leverage an (incremental) training dataset to a much higher degree than current NN training architectures do. (It's sort of the complement to a better compression algorithm having a higher degree of signal-loss-less compression for the same data: animal brains have a higher degree of noise-gain-less data extrapolation over the same data.)

If we could figure out what organic brains were doing here, and apply it to the sort of "global fan-in / reduce / fan-out" approach bigcorps use to train new base-model revisions, then I'd agree that the choice of online vs offline training would become just a trade-off of locality vs scale.

After all, in the end what you want is to be able to train a model to do something in "soft real time" (taking a few days/weeks to learn, including time for rest); with the model learning to perform the task "on the job" by interacting with the task, like a human/animal would. It doesn't really matter whether the training has to be sent back to the cloud to be incorporated into a new model revision; it just matters that it happens and that you see the results.

So if the distributed training architectures used to train GPT base models could "notice insights" in small amounts of RLHF data, and amplify/extrapolate large amounts of new training data from those — then you might expect the model that "learned in the cloud" to "learn on the job" just as well as the model that was local and only thinking about that training data. (But the GPT would have "learned on the job" some million other capabilities at the same time!)

But this does start to sound like the kind of thing that violates some property of one of those "can only choose two" triangles.

After all, current model training approaches — like everything to do with ML — are taking advantage of SIMD hardware. Which is why training clusters are big GPU clusters.

But it seems like the thing our brains do in online training, involves some complex state-driven scheduling of passes over data.

In other words, despite the brain itself being massively parallel, it's seemingly doing a lot of things that — if you were programming them — you'd want to push onto a CPU rather than a GPU. (To be specific: it's seemingly doing complex graph-search queries to bias the extrapolation process. And graph search is the antithesis of a horizontally-scalable problem.)

And I have a feeling — just an intuition, not with anything concrete to back it up — that trying to scale up that dataset-extrapolation pass, where the framework "explores and connects memories to extract and amplify the insight signals"... would be something that would require most of the resources of the system just to extrapolate out one insight at a time. It wouldn't be able to be done in parallel for millions of capabilities — at least not when the model, and the search-space of incremental memories, are "large" (like a modern GPT model is; and like the RLHF dataset from one day's worth of millions of users' interactions would be.)

If this were true, I would think it would be true for the same reasons that you can only vertically scale an IO-bound task so far — a single system only has so much IO bandwidth (e.g. PCIe lanes, DMA channels) available to interact with compute. (Mainframes have a lot of "accelerated" IO, but it's "accelerated" because it's interacting only with things that aren't the system's main compute. The CPU(s) "delegate" handling that IO to cores local to the IO. It's hierarchical NUMA.)

When you have IO-bound workloads, you tend to scale horizontally — if you're building a CDN, you don't make big boxes; you just throw more boxes at the problem, where each is separately handling its own requests.

And I feel like the translation here, is that if you want this insight-extrapolation, then it's only going to work when it's being done in a partitionable manner — by small models training on small incremental RLHF datasets, of the kind they build up through interaction with a single use-case at a time.

And even though you could do that with a global fan-in pass, in the cloud, sharding training across millions of individual small models and assigning each both their own GPU and CPU resources... at that point, why bother? You'd be essentially using your big DC compute-cluster to emulate having millions of regular PCs that have a little GPU and CPU each, each working on their own task in a cache-coherent manner. At that point, just take advantage of the end-user devices that do the inference to also do the training!

(Although, maybe you might still want to fan in the much-less-compact post-extrapolated training data that result from these individual on-device models stewing over their own problems; and then use that to train new general pre-trained base models. Or many such models, since you'd probably want smaller, more purpose-specific models in order to make on-device "on the job" training possible in the first place.)

---

Funny enough, this could be an interesting hypothesis for the ultimate neurological — rather than biophysiological — role of sleep and dreaming! We do build regular neural connections while awake; but this memory-insight-extrapolation pass specifically, needs to "reserve" the entire brain to do what it does — so it only happens during sleep. It would explain the advice to "sleep on it" to enable lateral thinking. And it could be tested experimentally by measuring how well different animals extrapolate insights from small amounts of training, vs. how long they sleep.

keeganpoppen · on Jan 11, 2025

“it’s not really a big problem”… surely you can’t be serious… this comment betrays such a profound ignorance that it could only have come from a genius or a... well, let’s not resort to name-calling…

but, seriously: play the tape forward literally one frame and outline what this dataset even remotely resembles… a core sample from a living human brain? “yeah, just train it on thinking about everything at once”. strong ai isn’t like the restaurant: the path to success doesn’t involve starting with more than you finished with.

FuckButtons · on Jan 11, 2025

Sure, what's your training corpus for that then?

I find that fairly often if I'm really wrestling with a novel or difficult problem, I will work and work at it, and then one day I will wake up with the answer fully formed with no clear understanding of any of the thought processes that got me to arrive at the solution.

Are you going to record peoples subconscious as they sleep, how do you train on something that is so poorly understood in the first place? It's nonsense.

KaoruAoiShiho · on Jan 11, 2025

I'm sure if you take an hour to recall you'll be able to come up with a process. Or ask a philosophy professor who specializes in reason.

But the easiest way I can think of ATM is to go through all the questions that AI currently fails on, and then have a human work through them and show the chain of thought a human would do, including the false starts, and describing the strategy pivots. Then generate your corpus based on that. However, that burns the problem-set so you'll have to constantly try to come up with new problems.

mrbungie · on Jan 10, 2025

That would still be limited eventually, at what point do we stop adding layers?

hnuser123456 · on Jan 10, 2025

The point where it gets better at discovering ways of discovering things than the combination of the rest of humanity.

What is the combination of parameters that makes a text generator quick-thinking, self-checking, confidence-estimating? Jumping directly from question to accurate, precise, confidence-measured answers, regardless of how abstract the question is?

robwwilliams · on Jan 11, 2025

Thanks to “derefr” and others for these insights. I used this thread as a prompt to Claude 3.5 Sonnet:

My prompt:

Fluid self-control of attention should be enough to bootstrap to metacognitive exploration. Is one the horse; the other the cart? Seems true of the evolution of human cognition from language and theory of minds-of-others that then makes us more attentive to our one minds as objects of internal meta-dissection. What do you think?

Claude comments:

An evolutionary perspective on the relation between attention control and metacognition is worth “attention”.

The idea that fluid attention control might be the "horse" that pulls the "cart" of metacognition aligns well with what we know about human cognitive development.

Consider:

1. The evolution of theory of mind: Humans (primates snd perhaps other taxa) first developed the ability to model and attend to others' mental states (likely driven by social survival pressures). This gave humans a "template" for mental state modeling that could then be turned inward.

2. Language evolution may have followed a similar pattern - external communication tools becoming internalized for self-reflection. Vygotsky's work on inner speech and cognitive development supports this outside-in pattern.

3. This uggests that the ability to fluidly direct attention (including to others' minds) may have been the crucial evolutionary stepping stone that enabled the later development of systematic metacognition.

This has implications for AI development. Current systems like myself have rigid attention mechanisms and no true ability to recursively examine our own cognitive processes. Following your logic, perhaps the key to developing true AI metacognition isn't to directly implement it, but to first develop more fluid, self-directed attention mechanisms that could bootstrap the development of deeper metacognitive capabilities.

Do you think this suggests that current AI development might be approaching things backwards by trying to implement sophisticated reasoning capabilities before we've developed truly fluid attention control? It seems like we might need to master the "horse" (flexible attention) before we can effectively develop the "cart" (sophisticated metacognition).

dartos · on Jan 10, 2025

> "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."

Wow I love that quote.

leobg · on Jan 11, 2025

That’s meta. Literally.

Edit: Sorry. This was based on the false assumption that this was research by Meta, Inc..

WillieCubed · on Jan 10, 2025

I love the quote you mentioned at the end. Do you remember the original source?

fragmede · on Jan 10, 2025

https://x.com/nathanthinks/status/1877510438621163987

TaurenHunter · on Jan 11, 2025

Thank you for mentioning the windmill problem. Great insights!

https://www.3blue1brown.com/lessons/windmills