Not strictly true: while this was previously believed to be the case, Anthropic demonstrated that transformers can "think ahead" in some sense, for example when planning rhymes in a poem [1]:
> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.
They described the mechanism that it uses internally for planning [2]:
> Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation. However, we find compelling evidence for a planning mechanism.
> Specifically, the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.
Thank you for these links! Their "circuits" research is fascinating. In the example you mention, note how the planned rhyme is piggybacking on the newline token. The internal state that the emergent circuits can use is 1:1 mapped to the tokens. Model cannot trigger an insertion of a "null" token for the purpose of storing this plan-ahead information during inference. Neither there are any sort of "registers" available aside from the tokens. The "thinking" LLMs are not quite that, because the thinking tokens are still forced to become text.
So, what I think most people don't realize is that the amount of computation an LLM can do in one pass is strictly bounded. You can see that here with the layers. (This applies to a lot of neural networks [1].)
Remember, they feed in the context on one side of the network, pass it through each layer doing matrix multiplication, and get a value on the other end that we convert back into our representation space. You can view the bit in the middle as doing a kind of really fancy compression, if you like. The important thing is that there are only so many layers, and thus only so many operations.
Therefore, past a certain point they can't revise anything because it runs out of layers. This is one reason why reasoning can help answer more complicated questions. You can train a special token for this purpose [2].
Adding knowledge works, depending on how to define knowledge and works; given sufficient data you can teach an LLM new things [1].
However, the frontier models keep improving at a quick enough rate that it's often more effective just to wait for the general solution to catch up with your task then to spend months training a model yourself. Unless you need a particular tightly controlled behavior or need a smaller faster model or what have you. Training new knowledge in can get weird [2].
And in-context learning takes literal seconds-to-minutes of time if your information fits in the context window, so it's a lot faster to go that route if you can.
That's consistent with other research I've seen, where varied presentation of the data is key to effective knowledge injection [1].
My assumption, based on the research is that training on different prompts but the same answer gives you more robust Q&A behavior; training on variations of how to express the same concept generalizes. Training on the same prompt and different answers gives you creative diversity [2].
Some of that is, or at least was, down to the training: extending the context window but not training on sufficiently long data or using weak evaluation metrics caused issues. More recent models have been getting better, though long context performance is still not as good as short context performance, even if the definition of "short context" has been greatly extended.
RoPE is great and all, but doesn't magically give 100% performance over the lengthened context; that takes more work.
My sense is they need to go back and update previous docs; they release a lot of software updates and a lot of notebooks showing how to use the features, but the two might fall out of sync. Would that match your observations?
It's a little more subtle than that: They're approximating the language used by someone describing the taste of chocolate; this may or may not have had any relation to the actual practice of eating chocolate in the mind of the original writer. Or writers, because the LLM has learned the pattern from data in aggregate, not from one example.
I think we tend to underestimate how much the written language aspect filters everything; it is actually rather unnatural and removed from the human sensory experience.
A description of the taste of chocolate must contain some information about the actual experience of eating chocolate. Otherwise, it wouldn't be possible for both the reader and the author to understand what the description refers to in reality. The description wasn't conceived in a vacuum, it's a lossy encode of all of the physical processes that preceded it (the further away, the lossier). One of the common processes encoded in the dataset of human-written text is whatever's in the brain that produces consciousness for all humans. The model might not even try to recover this if it's not useful for predicting the next token. The SNR of the encode may not be high enough to recover this given the limited text we have. But what if it was useful, and the SNR was high enough? I can't outright dismiss this possibility, especially as these models are getting better and better at behaving like humans in increasingly non-trivial ways, so they're clearly recovering more and more of something.
Imagine you've never tasted chocolate and someone gives you a very good description of what it is to eat chocolate. You'd be nowhere near the actual experience. Now imagine that you didn't know first hand what it was like to 'eat' or to have a skeleton or a jaw. You'd lose almost all the information. The only reason spoken language works is because both people have that shared experience already
True. The description encodes very little about the actual sensory experience besides its relationship to similar experiences (bitterness, crunchiness, etc) and how to retrieve the memories of those experiences. It probably contains a lot more information about the brain's memory retrieval and pattern relating circuits than the sensory processing circuits.
Text is probably not good enough for recovering the circuits responsible for awareness of the external environment, so I'll concede that you and ijk's claims are correct in a limited sense: LLMs don't know what chocolate tastes like. Multimodal LLMs probably don't know either because we don't have a dataset for taste, but they might know what chocolate looks and sounds like when you bite into it.
My original point still stands: it may be recovering the mental state of a person describing the taste of chocolate. If we cut off a human brain from all sensory organs, does that brain which receives no sensory input have an internal stream of consciousness? Perhaps the LLM has recovered the circuits responsible for this thought stream while missing the rest of the brain and the nervous system. That would explain why first-person chain-of-thought works better than direct prediction.
> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.
They described the mechanism that it uses internally for planning [2]:
> Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation. However, we find compelling evidence for a planning mechanism.
> Specifically, the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.
[1]: https://www.anthropic.com/research/tracing-thoughts-language...
[2]: https://transformer-circuits.pub/2025/attribution-graphs/bio...