Isn't talking about "here’s how LLMs actually work" in this context a bit like saying "a human can't be a relevant to X because a brain is only a set of molecules, neurons, synapses"?
Or even "this book won't have any effect on the world because it's only a collection of letters, see here, black ink on paper, that is what is IS, it can't DO anything"...
Saying LLM is a statistical prediction engine of the next token is IMO sort of confusing what it is with the medium it is expressed in/built of.
For instance those small experiments that train a network on addition problems mentioned in a sibling post. The weights end up forming an addition machine. An addition machine is what it is, that is the emergent behavior. The machine learning weights is just the medium it is expressed in.
What's interesting about LLM is such emergent behavior. Yes, it's statistical prediction of likely next tokens, but when training weights for that it might well have a side-effect of wiring up some kind of "intelligence" (for reasonable everyday definitions of the word "intelligence", such as programming as good as a median programmer). We don't really know this yet.
Its pretty clear that the problem of solving AI is software, I don't think anyone would disagree.
But that problem is MUCH MUCH MUCH harder than people make it out to be.
For example, you can reliably train an LLM to produce accurate output of assembly code that can fit into a context window. However, lets say you give it a Terabyte of assembly code - it won't be able to produce correct output as it will run out of context.
You can get around that with agentic frameworks, but all of those right now are manually coded.
So how do you train an LLM to correctly take any length of assembly code and produce the correct result? The only way is to essentially train the structure of the neurons inside of it behave like a computer, but the problem is that you can't do back-propagation with discrete zero and 1 values unless you explicitly code in the architecture for a cpu inside. So obviously, error correction with inputs/outputs is not the way we get to intelligence.
It may be that the answer is pretty much a stochastic search where you spin up x instances of trillion parameter nets and make them operate in environments with some form of genetic algorithm, until you get something that behaves like a Human, and any shortcutting to this is not really possible because of essentially chaotic effects.
> For example, you can reliably train an LLM to produce accurate output of assembly code that can fit into a context window. However, lets say you give it a Terabyte of assembly code - it won't be able to produce correct output as it will run out of context.
Fascinating reasoning. Should we conclude that humans are also incapable of intelligence? I don't know any human who can fit a terabyte of assembly into their context window.
Any human who would try to do this is probably a special case. A reasonable person would break it down into sub-problems and create interfaces to glue them back together...a reasonable AI might do that as well.
I can tell you from first hand experience that claude+ghidra mcp is very good at understanding firmware, labeling functions, finding buffer overflows, patching in custom functionality
On the other hand the average human has a context window of 2.5 petabytes that's streaming inference 24/7 while consuming the energy equivalent of a couple sandwiches per day. Oh and can actually remember things.
Citation desperately needed? Last I checked, humans could not hold the entirety of Wikipedia in working memory, and that's a mere 24 GB. Our GPU might handle "2.5 petabytes" but we're not writing all that to disc - in fact, most people have terrible memory of basically everything they see and do. A one-trick visual-processing pony is hardly proof of intelligence.
I think the idea is that we may not store 2.5 petabytes of facts like wikipedia.
But we do store a ton of “data” in the form of innate knowledge, memories, etc.
I don’t think human memory/intelligence maps cleanly to computer terms though.
>So obviously, error correction with inputs/outputs is not the way we get to intelligence.
This doesn't seem to follow at all let alone obviously? Humans are able to reason through code without having to become a completely discrete computer, but probably can't reason through any length of assembly code, so why is that requirement necessary and how have you shown LLMs can't achieve human levels of competence on this kind of task?
> but probably can't reason through any length of assembly code
Uh what? You can sit there step by step and execute assembly code, writing things down on a piece of paper and get the correct final result. The limits are things like attention span, which is separate from intelligence.
Human brains operate continuously, with multiple parts being active at once, with weight adjustment done in real time both in the style of backpropagation, and real time updates for things like "memory". How do you train an LLM to behave like that?
So humans can get pen and paper and sleep and rest, but LLMs can't get files and context resets?
Give the LLM the ability to use a tool that looks up instructions and records instructions from/to files, instead of holding it in context window, and to actively manage its context (write a new context and start fresh), and I think you would find the LLM could probably do it about as reliable as a human?
Context is basically "short term memory". Why do you set the bar higher for LLMs than for humans?
Couldn't you periodically re-train it on what it's already done and use the context window for more short term memory? That's kind of what humans do - we can't learn a huge amount in short time but can accumulate a lot slowly (school, experience).
A major obstacle is that they don't learn from their users, probably because of privacy. But imagine if your context window was shared with other people, and/or all your conversations were used to train it. It would get to know individuals and perhaps treat them differently, or maybe even manipulate how they interact with each other so it becomes like a giant Jeffrey Epstein.
You're putting a bunch of words in the parent commenter's mouth, and arguing against a strawman.
In this context, "here’s how LLMs actually work" is what allows someone to have an informed opinion on whether a singularity is coming or not. If you don't understand how they work, then any company trying to sell their AI, or any random person on the Internet, can easily convince you that a singularity is coming without any evidence.
This is separate from directly answering the question "is a singularity coming?"
One says "well, it was built as a bunch of pieces, so it can only do the thing the pieces can do", which is reasonably dismissed by noting that basically the only people predicting current LLM capabilities are the ones who are remarkably worried about a singularity occurring.
The other says "we can evaluate capabilities and notice that LLMs keep gaining new features at an exponential, now bordering into hyperbolic rate", like the OP link. And those people are also fairly worried about the singularity occurring.
So mainly you get people using "here's how LLMs actually work" to argue against the Singularity if-and-only-if they are also the ones arguing that LLMs can't do the things that they can provably do, today, or are otherwise making arguments that also declare humans aren't capable of intelligence / reasoning / etc..
False dichotomy. One can believe that LLMs are capable of more than their constituent parts without necessarily believing that their real-world utility is growing at a hyperbolic rate.
Fair - I meant there's two major clusters in the mainstream debate, but like all debates there's obviously a few people off in all sorts of other positions.
There is more than molecules, neurons and synapses. They are made from lower level stuff that we have no idea about (well, we do in this instance but you get the point). They are just higher level things that are useful to explain and understand some things but don't describe or capture the whole thing. For that you would need to go to lower and lower level and so far it seems they go on infinitely. Currently we are stuck at the quantum level, that doesn't mean it's the final level.
OTOH, an LLM is just a token prediction engine. It fully and completely covers it. There is no lower level secrets hidden in the design nobody understands, because it could not have been created if there was. The fact that the output can be surprising is not evidence of anything, we have always had surprising outputs like funny bugs or unexpected features. Using the word "emergence" for this is just deceitful.
This algorithm has fundamental limitations and they have not been getting better, if you look closely. For instance you could vibe code a C compiler now, but it's 80% there, cute trick but not usable in real world. Just like anything, it cannot be economically vibe coded to 100%. They are not going back and vibe coding the previous simpler projects to 100% with "improved" models. Instead they are just vibe coding something bigger to 80%. This is not an improvement in limitations, it is actually communicating between the lines that the limitations cannot be overcome.
They're not powertools lol. Tech has plenty of powertools and we automated the crap out of our job already.
Writing code has never been the limiting factor, it's everything else that goes into it.
Like, I don't mind that there's a bunch of weekend warriors out here building shoddy gazebos and sheds with their brand new overpriced tools, incorrecting each other on the best way to do things. We had that with the bitcoin and NFT bros already.
What I do roll my eyes at is when the bros start talking about how they're totally going to build bridges and planes and it's gonna be soooo easy to get to new places, just slap down a bridge.
Uh huh. Y'all do not understand what building those actually entails lol.
But if you try some penny-saving cheap model like Sonnet [..bad things..]. [Better] pay through the nose for Opus.
After blowing $800 of my bootstrap startup funds for Cursor with Opus for myself in a very productive January I figured I had to try to change things up... so this month I'm jumping between Claude Code and Cursor, sometimes writing the plans and having the conversation in Cursor and dump the implementation plan into Claude.
Opus in Cursor is just so much more responsive and easy to talk to, compared to Opus in Claude.
Cursor has this "Auto" mode which feels like it has very liberal limits (amortized cost I guess) that I'm also trying to use more, but -- I don't really like to flip a coin and if it lands up head then waste half hour discovering the LLM made a mess the LLM and try again forcing the model.
Perhaps in March I'll bite the bullet and take this authors advice.
Yeah, I can’t recommend gpt-5.3-codex enough, it’s great! I’ve been using it with the new macOS app and I’m impressed. I’ve always been a Claude Code guy and I find myself using codex more and more. Opus is still much nicer explaining issues and walking me through implementations but codex is faster (even with xhigh effort) and gets the job done 95% of the time.
I was spending unholy amounts of money and tokens (subsidized cloud credits tho) forcing Opus for everything but I’m very happy with this new setup. I’ve also experimented with OpenCode and their Zen subscription to test Kimi K2.5 an similar models and they also seem like a very good alternative for some tasks.
What I cannot stand tho is using sonnet directly (it’s fine as a subagent), I’ve found it to be hard to control and doesn’t follow detailed instructions.
Out of curiosity, what’s your flow? Do you have codex write plans to markdown files? Just chat? What languages or frameworks do you use?
I’m an avid cursor user (with opus), and have been trying alternatives recently. Codex has been an immense letdown. I think I was too spoiled by cursor’s UX and internal planning prompt.
It’s incredibly slow, produces terribly verbose and over-complicated code (unless I use high or xhigh, which are even slower), and missed a lot of details. Python/django and react frontend.
For the first time I felt like I could relate to those people who say it doesn’t make them faster,” because they have to keep fixing the agent’s shot, never felt that with opus 4.5 and 4.6 and cursor
Codex cli is a very performant cli though, better than any other cli code assistant I've used.
I mean does it matter what code it's producing? If it renders and functions just use it. I think it's better to take the L on verbose code and optimizing the really ugly bits by hand in a few minutes than be kneecapped every 5 hour by limits and constant pleas to shift to Sonnet.
I promise you you're just going to continue to light money on fire. Don't fall for this token madness, the bigger your project gets, the less capable the llm will get and the more you spend per request on average. This is literally all marketing tricks by inference providers. Save your money and code it yourself, or use very inexpensive llm methods if you must.
I think we are going to start hearing stories of people going into thousands in CC debt because they were essentially gambling with token usage thinking they would hit some startup jackpot.
Compared to the salary I loose by not taking a consulting gig for half a year, these $800 arent't all that much. (I guess depending on definition of bootstrap, mine might not be, as I support myself with saved consulting income.)
Startup is a gamble with or without the LLM costs.
I have been coding for 20 years, I have a good feel for how much time I would have spent without LLM assistance. And if LLMs vanish from the face of the earth tomorrow, I still saved myself that time.
Looks very interesting, I fully agree that running CI locally is viable.
But what I didn't pick up for a quick scan of README is best pattern for integrating with git. Do you expect users to manually run (a script calling) selfci manually or is it hooked up to git or similar? When does the merge hooks come into play? Do you ask selfci to merge?
Not an AI researcher and I don't really know, but intuitively it makes a lot of sense to me.
To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".
So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.
If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.
And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.
Asking because I was looking at both Cloudflare and Bunny literally this week...and I feel like I don't know anything about it. Googling for it, with "hackernews" as keyword to avoid all the blogspam, didn't bring up all that much.
(I ended up with Cloudflare and am sure that for my purposes it doesn't matter at all which I choose.)
- The free CDN is basically unusable with my ISP Telekom Germany due to a long-running and well documented peering dispute. This is not necessarily an issue with Cloudflare itself, but means that I have to pay for the Pro plan for every domain if I want to have a functioning site in my home country. The $25 per domain / project add up.
- Cloudflare recently had repeated, long outages that took down my projects for hours at a time.
- Their database offering (D1) had some unpredictable latency spikes that I never managed to fully track down.
- As a European, I'm trying to minimize the money I spent on US cloud services and am actively looking for European alternatives.
You don‘t have to get the Pro plan to solve the Deutsche Telekom issues. You can also use their Argo product for $5/month - but only makes sense if your egress costs wouldn‘t exceed the pro plans pricing.
The reverse. Argo gives better peering than any paid plan. Its the reason for the product‘s existence. They can use more costly peering that they couldn‘t use with their free egress model.
Thanks for the pointer, not doubting that is true. My egress is unfortunately too large for it to make financial sense.
However, at the time I did plenty of trace routes to confirm that the Pro plans peering is at least better than the Free plan for the Telekom problem. Free plan would route traffic to NYC and back, while Pro plan traffic terminates in Frankfurt.
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
Same. For actual productions app I'm typically reviewing the thinking messages and code changes as they happen to ensure it stays on the rails. I heavily use the "revert" to previous state so I can update the prompt with more accurate info that might have come out of the agents trial and error. I find that if I don't do this, the agent makes a mess that often doesn't get cleaned up on its way to the actually solution. Maybe a similar workflow is possible with Claude Code...
You can ask Claude to work with you step by step and use /rewind. It only shows the diff though, which, hides some of the problem. Since diffs can seem fine in isolation, but when viewed in context can have obvious issues.
Ya I guess if you have the IDE open and monitor unstaged git, it's a similar workflow. The other cursor feature I use heavily is the ability to add specific lines and ranges of a file to the context. Feels like in the CLI this would just be pasted text and Claude would have to work a lot harder to resolve the source file and range
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter.
Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing.
Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
I see in your public issue tracker that a lot of people are desperate simply for an option to turn that thing off ("Automatically accept all LLM changes"). Then we could use any kind of plugin really for reviews with git.
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
I mainly use Opus as well, Cursor isn't tied to any AI model and both Opus and Sonnet and a lot of others are available. Of course there's differences in how the context is managed, but Opus is usually amazing in Cursor at least.
I will very quickly @- the parts of the code that are relevant to get the context up and running right away. Seems in Claude that's harder..
(They also have their own, "Composer 1", which is just lightning fast compared to the others...and sometimes feels as smart as Opus, but now and then don't find the solution if it's too complicated and I have to ask Opus to clean it up. But if there's simple stuff I switch to it.)
> remember which changes I made vs which changes AI made..
They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.
You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.
Others have mentioned SVG AI tools... I've tried 3-4 over the previous days and eventually ended up with svgai.org (after I've used Google Gemini for bitmap).
You can instruct it to make edits, or say "Use SVG gradients for the windows" and so on and you can further iterate on the SVG.
It can be frustrating at times, but the end result was worth it for me.
Though for some images I've done 2-3 roundtrips manual editing, Nano Banana, svgai.org ...
The advantage is that it produces sane output paths that I can edit easily for final manual touches in Inkscape.
Some of the other "AI" tools are often just simply algorithms for bitmap->vector and the paths/curves they produce are harder to work with, and also give a specific feel to the vector art..
Or even "this book won't have any effect on the world because it's only a collection of letters, see here, black ink on paper, that is what is IS, it can't DO anything"...
Saying LLM is a statistical prediction engine of the next token is IMO sort of confusing what it is with the medium it is expressed in/built of.
For instance those small experiments that train a network on addition problems mentioned in a sibling post. The weights end up forming an addition machine. An addition machine is what it is, that is the emergent behavior. The machine learning weights is just the medium it is expressed in.
What's interesting about LLM is such emergent behavior. Yes, it's statistical prediction of likely next tokens, but when training weights for that it might well have a side-effect of wiring up some kind of "intelligence" (for reasonable everyday definitions of the word "intelligence", such as programming as good as a median programmer). We don't really know this yet.
reply