Hacker Newsnew | past | comments | ask | show | jobs | submit | johnfn's commentslogin

There genuinely is a lot of interesting discussion to be had about LLMs, and I know this is true because I discuss things with my coworkers daily and learn a lot. I do admit that conversation online about LLMs is frequently lacking. I think it's a bit like politics - everyone has an opinion about it, so unfortunately online discourse devolves to the lowest common denominator. Hey guys, have you noticed that if you use LLMs frequently it's possible you'll forget to think critically?

But "there can't be any interesting discussion about AI programming" is completely false.


For me, almost every single time a conversation like this happens in real life it boils down to the one side claiming that "This is the future" and "Don't get left behind" followed by a torrent of hype and buzzwords. So no, there is no interesting conversations to be had about LLM programming anymore.

> "This is the future"

Yeah, that's silly, it is already the present!

Some interesting conversation one can have with coworkers specifically:

1. How should code review and responsibility for code be updated to a) increase velocity, b) keep quality and c) keep reviewers from burnout. There are plenty scenarios in which vibe coding a component in an afternoon is the correct choice, even if it is buggy, insecure, and no one really understands it.

2. Which parts of the codebase work well with code assistants, which don't? Why? What could be changed to make it easier? In my experience, Claude Code sometimes loses its mind on infra topics. It is also not very good at complex, interconnected services (humans aren't either).

3. Which tasks could be offloaded to agents to save everyone time and sanity? - Creatig Jira Tickets from meeting transcripts is an obvious one, collecting and curating bug reports another one.

4. How should we design systems to better work for coding agents? Does it influence our tech choices? Should it influence them?

5. Is AI a net positive or negative for security?

And so much more. The last topic in particular is incredibly important, and things are developing so fast that you can probably have a new conversation on it every two weeks.


Maybe you struggle to have good conversations because I just provided an anecdote and you immediately stated that my anecdote is false? If this is how you typically interact with people I’m not surprised you’re not having interesting conversations.

He didn't state your anecdote is false, the first two words of his comment are "for me". That means in his experience, not yours.

Ironically, in your crusade to wave the "I'm being censored!" flag, the only person who is trying to do any censoring... is you!

And to top it off, as if that wasn't enough, you're also incredibly snarky and basically implying that this person who just vaguely disagrees with you must be unlikable or something. Which, in another twist of irony, actually makes YOU appear unlikable, because what well-adjusted adult would feel the need to throw someone under the bus for slightly disagreeing with them?


The problem is that mrcsharp added the last sentence: "So no, there is no interesting conversations to be had about LLM programming anymore." So they are definitely trying to turn their anecdote into a universal.

Most of your comments about johnfn are still apropos, though...


IMO since it started as clearly an opinion/anecdote, that last part should be taken to mean in that context, not universally.

He said:

> "there is no interesting conversations to be had about LLM programming anymore"

The first sentence was in his experience, but this is a universal assertion. He is claiming no one, in the world, is having interesting conversations about LLMs.

Is my response really so off-base? Imagine you say "I like Rust because it made my app go fast" and someone replies "There is no one who has used Rust to improve performance." Do you really think that's a normal way to respond to someone sharing an anecdote?


But that's not how he responded because there's a whole ass comment before that.

Okay sure, if you read the comment and then use a Men in Black mind wiper thingy before reading the last sentence, then it might seem brazen or universal. But that's not what you did.

The only way that last comment can reasonably be taken to mean "for everyone on Earth" is if you did not read the lines before it. Because, in that context, to me, it's clear he is only talking about his experience.

This is a phenomena I've noticed lately where everyone feels the need to add a disclaimer for everything and not doing so is seen as an "aha gotcha!" type thing. But we're not algorithms. You do not read one line at a time and then digest it.

You're human, he's human, and there's context. You know that it would be extremely unreasonable for someone to think that nobody, anywhere, has anything to say about LLMs right? Okay. That doesn't mean that this person is being unreasonable.

It means that that's probably not what he meant.


Read his comment in context. The comment thread, condensed, is:

mudkipdev says "There can't be any interesting discussion about AI programming". We agree that is a universal claim. (Right? I mean, your comment says "You know that it would be extremely unreasonable for someone to think that nobody, anywhere, has anything to say about LLMs right" but isn't this a clear example?)

I say "There can be". We agree that is anecdotal.

mrcsharp says "For me, [stuff that supports mudkipdev]. Therefore, there is no interesting discussion".

He is re-asserting mudkipdev's point. mudkipdev says A, I say !A, he says, actually, A.

Your interpretation has him read the back-and-forth between me and mudkipdev, and respond to "A", "!A" with "B". If you only read mrcsharp and nothing else in the thread I can understand this reading, but the context changes things.


Brilliantly mirrored! Unfortunately there are far more people like this than i would have ever imagined pre ai.

    if you use LLMs frequently it's possible you'll forget to think critically?
Nowadays, you can have a sub-agent to think critically for you. ;)

Are you upset about Calvin and Hobbes being a reference to John Calvin and Thomas Hobbes? Probably not? I think OP is asking an interesting question and you are being unnecessarily combative.

IMHO naming characters after famous people is more likely to be considered an "honest" homage than naming a company after them...

Are you upset that Tesla is named Tesla? Probably not? A lot of people are angry about Tesla and I think even then I haven't ever heard of that particular complaint.

This is written by an LLM. Also, it doesn't make sense:

> 57K lines, 0 tests, vibe coding in production

Why on earth would you ship your tests?


"Why would you ship tests?" — Fair point. Source maps only include production bundle files — tests wouldn't appear in the map regardless. Tests may well exist in Anthropic's internal repo, and we can't claim otherwise. However, the bugs we found speak for themselves: a watchdog that doesn't protect the most vulnerable code path for 5+ months, a fallback with telemetry that never executes where it's needed, Promise.race without catch silently dropping tool results. If tests exist, they clearly don't cover the streaming pipeline adequately — these are the kind of issues that even basic integration tests would catch.

You're not beating the "written by an LLM" allegations.

I write it myself, the agent only translates it into English.

This is against the hacker news guidelines[1]:

> Don't post generated comments or AI-edited comments. HN is for conversation between humans.

[1]: https://news.ycombinator.com/newsguidelines.html


It's just Claude bragging about being the first AI whistleblower.

Surely "so frustrating" isn't explicit content?

It is absolutely not paranoia. People are distilling Claude code all the time.

Is this true? Non-reasoning LLMs are autoregressive. Reasoning LLMs can emit thousands of reasoning tokens before "line 1" where they write the answer.

They are all autoregressive. They have just been trained to emit thinking tokens like any other tokens.

reasoning is just more tokens that come out first wrapped in <thinking></thinking>

there are no reasoning LLMs.

This is an interesting denial of reality.

A "reasoning" LLM is just an LLM that's been instructed or trained to start every response with some text wrapped in <BEGIN_REASONING></END_REASONING> or similar. The UI may show or obscure this part. Then when the model decides to give its "real" response, it has all that reasoning text in its context window, helping it generate a better answer.

The section on "artificially low costs" does not make a lot of sense to me. If anything I feel like the costs are inflated for the frontier models, not "artificially low". Easy proof: GLM-5 costs about 1/10 as much as Opus. I'm not going to tell you it's as good as Opus 4.6 -- it's not -- but it performs comparably to where frontier models were 6 months ago. (It's on par with Sonnet 4.5 on leaderboards, though in practice it's probably closer to Sonnet 4.0.)

If I can switch to an open source model today, run it myself, and spend 1/10 as much as Opus, and get to about where frontier models were 6 months ago, fear-mongering about how we'll have to weather "orders-of-magnitude price hikes" and arguing that that one shouldn't even bother to learn how to use AI at all seems disconnected from reality. Who cares about the "shady accounting" OpenAI is doing, or that AI labs are "wildly unprofitable"? I can run GLM 5 right now, forever, for cheap.


The post is factoring in training costs, not just inference.

No it's not. Otherwise this part doesn't make sense

> in fact, they actually compound the problem by encouraging significantly more usage

because if eliminating training costs makes running the model above cost, the problem is helped by significantly more usage not compounded.

More usage compounds the problem only if inference is unprofitable.

(the article briefly mentions training but that's later).


It made sense to me understanding that you can have a unit-profitable API but lose money on loss-leading campaigns like Code subscriptions. Those losses are amplified by encouraging usage. Perhaps I'm mistaken.

Again, that is a statement about inference time costs, not training costs.

> More usage compounds the problem only if inference is unprofitable.

No... only if you're charging full boat for that inference. As I said above, loss-leading caps are a in play here. Obviously encouraging people to use more of basically anything that is an all-you-can-eat subscription leads to less profitability. Not sure if we're talking past each other or what.


We are kind of talking past each other. I'm saying something simpler. This all goes back to the original point I made in reference to your reply to johnfn:

>> The post is factoring in training costs, not just inference.

It is not because training costs are irrelevant here. Training costs do not cause your costs to go up as you accumulate more users.

None of these calculations we're talking about include training costs. You're saying that inference is unprofitable (at least given the subscription plans). I'm simply pointing out that we are talking about inference not training as you stated earlier. You are (very accurately) not talking at all about training costs.


But I don't need to pay training costs to use GLM-5?

Sure, but somebody needs to pay for GLM-6 unless you're happy to stop here.

If everybody stopped training models today and Anthropic and OpenAI were deleted from the universe, I'd be happy to just keep using GLM-5 at its current inference cost. The article's author assumes that there will be a point where we will no longer have access to good models at reasonable cost because current models are subsidized, but GLM-5 disproves that.

Even in this hypothetical future, I will continue to use frontier models until they become "orders of magnitude more expensive", at which point I'll just fall back to the best open source model, which will still only be about 6 months behind. I don't see where the issue is?

I think Sora is an excellent way to see how people's beliefs clash with reality. Even in this post, I see people likening Sora to unveiling "a weapon", it filling them with "bland dread", or comparing it to creating "killing robots". But now that Sora is being shut down, what impact did Sora actually have on society, other than getting a couple of people to waste their time making some funny meme videos? Did any of those negative externalities actually play out?

If you are autistic, I feel that it causes you to see reality a more accurately than most here on this thread.


At least according to the Head of Product at X, Sora was by far the most widely used tool to create fake war videos[0] aiming to push various false narratives. Given how popular fake content is at Meta I can only imagine what they see there (if they even have anybody looking at this kind of thing).

[0] https://x.com/nikitabier/status/2029024577624650041


On X, viewing actual war footage was locked behind age-gating and identity verification, while any idiots' fake war footage was uncensored and consumable by anyone.

I understand that misinformation is a bad thing, and your point is taken that I was probably too quick to brush off the worst thing that Sora did as 'some funny memes'. But still. Photoshop is used to make a lot of misinformation, probably 1000x to 10,000x as much as Sora did, or even more than that. Does anyone say the latest version of Photoshop is like unveiling a weapon? Does anyone say that AI driven generative fill in Photoshop is like creating killing robots?

Sora was one of the earliest demos of a "wow okay that is good enough to be mistaken for real" GenAI model, which is what that comment was referencing with the "weapon" reference (the tech behind it not just Sora™ Videos).

Sure, by the time they productized it, Sora was no longer SOTA thanks to the AI arms race. And ultimately positioned as a TikTok for Slop with an annoying watermark so didn't take the world by storm on its own.

But since it was unveiled GenAI videos as a whole have become commonplace everywhere else on the internet, with plenty of negative impact already in terms of spam or manipulation, and we're barely in year 2 so far.


As someone who generally liked the products that OpenAI puts out, I think Sora was their first product that I really didn't like. I liked GPT primarily because I felt like it respected me: I never felt like it was trying to distract me from my work or get me to waste time doomscrolling. It's primary value proposition to keep me using it wasn't to trick me with addictive content, but to get me high quality answers as fast as possible. And I felt like OpenAI's other products, like Deep Research, agent mode, etc, were the same way. Even Atlas, although I suspect it will be equally ill-fated, attempts to follow this same pattern. It really felt like OpenAI was separating themselves from the common popular apps like Tiktok, Reddit, Instagram, etc, which seemed to exist entirely to distract me from things I care about and waste my time.

Sora was the first product OpenAI shipped where I felt that fell into that second category, and for that I was very disappointed. You have all those GPUs, and the most incredible technology in the world, and the most brilliant engineers, and all you can think to do with them is to make an app that just makes meme videos? I mean, c'mon!

Still, I am mystified by how rapidly Sora went from launch to shutdown. Does anyone have any guess what happened there? Even if Sora wasn't a spectacular success, it seems to me like subsequent model improvements could have moved the needle - shutting it down so soon seems premature. I mean, what if this is the equivalent of making ChatGPT with GPT 3?


> I liked GPT primarily because I felt like it respected me: I never felt like it was trying to distract me from my work or get me to waste time doomscrolling

i recently used gpt for the first time in several months (i'm a daily claude user) and didn't find this at all. it is most certainly trying to pull you into engagement with how it ends each response. "if you want, i could tell you about this thing that's relevant to what you are discussing and tease just enough so that you addictively answer yes"


What happened is that they make no money, because people use it an masse to generate videos that they then post on TikTok and Instagram, nobody actually doomscrolls Sora.

Hosting videos is really expensive. AI video generation inference is really expensive. I'd love to see how much money this experiment cost.

So much that they walked away from a billion dollar deal with Disney by dropping Sora.

It's not clear to me what that billion dollar meant.

To me it seems it was "Disney gets shares and we get to use their characters in Sora".

Even if Sora breaks even, why would you gift Disney stock? It's not like they actual gave 1B to openai.


I don't think anyone outside of Disney/ClosedAI knows what deal was actually made. Maybe they just shut down public use of Sora but Disney will still be able to use it internally? Maybe they never even signed anything, as is too often the case with AI deals, especially big ones, how we read about signed/inked deals but then it turns out it was all just words spoken. Maybe they took the cash, then shut Sora down to save money? Could be any number of things that happened which we might never know.

Hosting videos is not that expensive, compared to generation and inference costs. It's not cheap but it's not that horrible

> I liked GPT primarily because I felt like it respected me: I never felt like it was trying to distract me from my work or get me to waste time doomscrolling.

Not about Sora, but about ChatGPT. I felt the same way for quite a while until I noticed that its response pattern has changed, apparently aiming for higher engagement. Someone aggressively pursued a metric.

At some point, ChatGPT started leaving annoying cliffhangers in its every response, like "Do you want me to share a little-known secret of X that professionals often use?" Like, come on!


> I liked GPT primarily because I felt like it respected me: I never felt like it was trying to distract me from my work or get me to waste time doomscrolling. It's primary value proposition to keep me using it wasn't to trick me with addictive content, but to get me high quality answers as fast as possible.

I'm curious if you still feel this way about current iterations of ChatGPT? It seems like it's now primed to engagement bait the user, especially when used through the web UI. You can ask it a simple question with a straight forward answer and it will still try to get you to follow up with more.

> What is the minimum thickness for Shimano M8100 disc brake rotors?

> For Shimano XT M8100-series rotors (like RT-MT800 / RT-MT900 commonly used with M8100 brakes), the minimum thickness is 1.5 mm. If the rotor measures 1.5 mm or thinner, Shimano says it should be replaced.

> (a bunch of pointless details in bullet points)

> If you want, tell me the exact rotor model (e.g., RT-MT800, RT-MT900, size), and I can confirm the spec for that specific one and what typical wear looks like.

The entire query could have been answered with "1.5mm". The "if you want" follow ups are so annoying.


"I am mystified by how rapidly Sora went from launch to shutdown"

I suspect they promised synthetic movies but it quickly became clear that they were never going to be able to deliver on this.

Slick fifteen second lulz-clips, sure, but I don't think they can make several of them consistent enough to fit into a larger video narrative without the audience finding it jarring and incoherent.

Perhaps legal at Disney also concluded that the output wouldn't be possible to copyright, which is their core business.


Every studio that made video content using AI video generation — think those all come commercials— basically just generated and regenerated the same few-second clips until they got an acceptable one. Hundreds and hundreds of times. I would be astonished if it would have been cheaper than actual CGI had the generation not been so heavily subsidized, and the product sucked.

*Coke commercials

> Still, I am mystified by how rapidly Sora went from launch to shutdown. Does anyone have any guess what happened there?

My guess is they over committed server/energy resources, since they were generating ~30 images per frame of 1 second of video for results that may be discarded and then tried again.

Now that energy costs are increasingly less predictable because of the war, they're prioritizing what is sustainable. Willing to blow up the $1 billion Disney deal for Sora, because that's a popular IP that would have increased discarded server time.


I'm also curious if Sora has been used by Iran to generate those Lego propaganda videos critical of the President. Given how close Sam Altman is with the current administration, I wouldn't be surprised if Sora is now reserved for U.S. government propaganda only.

Might be why the latest Iran propaganda video could be created in PowerPoint: https://bsky.app/profile/rachelbitecofer.bsky.social/post/3m...


Are there known tells that could be used to determine which model the video came from?

(This sort of question, and the Grok sexual abuse, is why I'd like to see mandatory invisible watermarks on generated images/video)


I don't think so. There are tons of self hosted models for video (they are smaller and easier to run).

Most people serious about this stuff usually have their own pipelines.


I'm not sure, but you could be right. Sora is/was the top-of-the-line platform for video generation, and the Lego IP videos were polished. Makes sense to outsource when your own energy grid is being destroyed. Anyone with an account and VPN could utilize the platform.

I'd like to know what self hosted models they've been using, if any, and who provided them, trained on Lego IP.


Since you seem to be better informed, I'm also interested in what self hosted models for video you recommend for creating my own Lego movie clips now that Sora is no longer an option for a paid service. There's tons, right?

Look up Wan and Hunyan for starters.

These are open weight models, so you can fine tune them on Lego content… But presumably they already have enough training data since they were made by Chinese companies who don’t give a shit about Western IP rights.


> Still, I am mystified by how rapidly Sora went from launch to shutdown

I think if you had to foot the bill for generating a bajillion gigabytes of slop with no real utility, you wouldn't be too mystified.

They showed off their technology and proved it was impressive. That's all it had to do.


For me, Sora changed the way I viewed Sam Altman as a person.

I really thought he wasn't like the previous generations of tech leaders - as you mentioned OpenAI (with him in charge) seemed to be genuine about making a product that could improve people's lives.

He'd go on podcasts and quite convincingly talk about how ChatGPT could prevent real world harm like suicide, and possibly even contribute to helping disease too.

Then they drop this and it just doesn't gel. So much of what they've done since has just doubled down on the Zuck-esque scumminess and greed too.

Part of me still sees Dario as genuine in the way that Sama seemed back in 2024, but I'm sure once he has enough investor pressure he'll cave the same way too.


> He'd go on podcasts and quite convincingly talk about how ChatGPT could prevent real world harm like suicide, and possibly even contribute to helping disease too.

He is a con man. Of course he’s charming and convincing, that’s how he ended up where he is. But he’s just as full of it as Musk when he was waxing lyrical about saving the world and going to Mars. They lie very convincingly.


Sam Altman made his stake at the table with a shady and failed location data harvesting app (https://en.wikipedia.org/wiki/Loopt). That's who he is, that's what he does, and we're all better off paying less attention to the sounds he emits, and more to the things he does.

> the things he does.

The things he does is convince investors to give him billions of dollars to build what he wants. Where exactly does that leave us?


A fool and his money shall soon be parted. Sam is a face. If it wasnt him, it would be someone else.

Multiple people have attested that Sam Altman is extremely charming (especially in more casual, intimate settings) and talks very nobly about his goals, but his actual work is just…all kinds of awful. And I think that charm only goes so far as it seems clear that people are starting to demand that OpenAI actually match its words with work it cannot produce.

I think his board fight within OpenAI where essentially lied to the board, his obsession with retinal scanning everyone for his biometric cryptocurrency (Worldcoin), how he left Y Combinator are just evidence that he’s not very heroic. Most cringe to me is that he and many others seem aware that what their are doing is corrosive and harmful to society on some level as Altman has admitted to having a bunker somewhere around Big Sur [0]. Which…WTF.

[0] https://www.newyorker.com/magazine/2016/10/10/sam-altmans-ma...


> how he left Y Combinator

Not too familiar with that history, but he still is listed as a courtesy credit/reviewer at the end of PG's blog entries, so I assume he didn't have too much of a bad exit?


We’ll never know exactly what exactly transpired, but I think the existing evidence is clear that as President of Y Combinator he should not have been also as involved in OpenAI as he was.

This is a conflict of interest and I think one a very obvious one. He tried to have it both ways and was forced to choose in the end. I think putting himself in that situation rather than resigning up front to pursue OpenAI ambitions says a lot about his character.


He is a conman, and potentially a terrible person (look for it)

> ChatGPT could prevent real world harm like suicide

It could prevent suicide, maybe, but we know that it does cause suicides, at least in some cases. Seems like a poor value proposition.


I haven't followed him much as I really don't care, but the one clip I've seen of him that really stands out to me (I've seen more but this is the one I remember) is one where he's talking to some guy who doubts the LLMs genius, and Sam says something like "what if ChatGPT solved quantum gravity, would you be convinced then?"

To me, this just came off as pathetic. It hasn't solved anything and there's no reason to believe it ever will. The whole question is completely pointless except to put the idea in viewers heads that ChatGPT will soon revolutionize science, with no actual substance behind it. It's not even a question, there's only one possible answer. He's holding the guy verbally hostage just to manipulate dumb viewers.

So anyway that's the only memorable clip I've seen of Sam Altman, and based on that alone, fuck that guy.


The most memorable clip I've seen of him was the Brad Gerstner's podcast one (an investor of OpenAI), Gerstner questioned Altman about the financials of OAI, how could it have committed to spend so much given the revenue, it's a decent question and it's been up in the air for a while across the media.

Altman's reaction was very telling of the kind of person he is, just immediately lashing out at Gerstner in a childish way, asking if Gerstner wanted to sell his shares because he could find a buyer in no time.

It was a pathetically immature reaction, I wouldn't expect that from any kind of professional, even less someone who has held positions as Altman has and now sits at the top of the leadership for a company sucking hundreds of billions of investment.

Apart from that clip there's also the whole saga of sama @ Reddit, full of lies, deceptions, and the same kind of immature attitude peppered across Reddit itself.


> Gerstner questioned Altman about the financials of OAI

After glazing OpenAI and Sam personally for 45 minutes straight. But as soon as Sam was questioned in the slightest, he exploded.


My most memorable clip was when he was interviewed about the "suicide" of an ex-employee and Sama lied through his teeth. I can't understand people who say this snake is "charming"... he's a bad liar and has sub-zero charisma.

https://www.youtube.com/watch?v=zrgEZ8FeZEc


> It was a pathetically immature reaction, I wouldn't expect that from any kind of professional, even less someone who has held positions as Altman has and now sits at the top of the leadership for a company sucking hundreds of billions of investment.

If you're familiar with nepobaby brats and narcissists, this is not surprising.


> He's holding the guy verbally hostage just to manipulate dumb viewers.

Why? The other person can say "Yes". That doesn't mean ChatGPT has the capability to do it?


That's the point. The other guy can only say yes - if chatgpt solved a hard problem and improved our understanding of the universe there would be no discussion as to its capability to do so.

"No" is not a reasonable answer to the question. It's like asking an atheist "if god and Jesus and all the angels came to earth and showed themselves for all to see, would you believe in god then?" Well yes of course, I believe in all the things we can all see. The lack of evidence is the whole point.

So asking "if there was evidence would you think differently?" Is either a fundamental misunderstanding of the persons position, or just a cheap ploy to manipulate people. In Sam's case I'm thinking it was the latter. He's a clever guy, he knows he's on camera. He asked that question just to plant the idea in people's minds - not the guy he was talking to, that guy didn't even need to answer the question because as already said there's only one answer to it. But to everyone watching, Sam basically just put it out there that ChatGPT solving quantum gravity is within the realm of possibility. Which it probably isn't.


Fair, thanks for explaining

Thinking that Scam Altman of Worldcoin etc. fame was "genuine about making a product that could improve people's lives" seems like a strange kind of delusion.

I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)

You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.

In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.

Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.


If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress.

It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.


What is a "truly new task"? Does there exist such a thing? What's an example of one?

Everything we do builds on top of what's already been done. When I write a new program, I'm composing a bunch of heuristics and tricks I've learned from previous programs. When a mathematician approaches an open problem, they use the tactics they've developed from their experience. When Newton derived the laws of physics, he stood on the shoulders of giants. Sure, some approaches are more or less novel, but it's a difference in degree, not kind. There's no magical firebreak to separate what AI is doing or will do, and the things the most talented humans do.


That highlighted phrase "everything is a remix" was for a good reason, there's a documentary of that same name, and I can certainly recommend it.

At the same time, there are things that are truly novel, even if the idea is based on combining two common approaches, the implementation might need to be truly novel, with new formulas and new questions that arise from those. AI can't belp there, speaking from experience.


That's why context management is so important. AI not only get more expensive if you waste tokens like that, it may perform worse too

Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.


Try the refactor again tomorrow. It might have gotten easier or more difficult.

You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.

It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example.

Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.


> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)

The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.

You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.


I think it's more of a data vs intelligence thing.

They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).

Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.


I don't think so. I went through the output of Opus 4.6 vs GPT 5.4 pro. Both are given different directions/prompts. Opus 4.6 was asked to test and verify many things. Opus 4.6 tried in many different ways and the chain of thoughts are more interesting to me.

You might be joking, but you're probably also not that far off from reality.

I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.

We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.


I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.

If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.


You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.

A lot of math problems/proofs are like minesweeper or sudoku in a way though. They're a long series of individually kinda simple logical deductions that eventually result in a solution. Some really hard problems are only really hard because each one of those "simple" deductions requires you to have expert knowledge in some disparate area to make that leap.

Some thoughts.

1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.

2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.

3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.

Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.


I fear that under those constraints, the only optimal output is “42”

This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:

1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.

2. Verifying that the proposed solution actually is a full solution.

This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"


>The details about human involvement are always hazy and the significance of the problems are opaque to most.

Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.


I mean the details are in the post. You can see the conversation history and the mathematician survey on the problem

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: