Hacker Newsnew | past | comments | ask | show | jobs | submit | stephantul's commentslogin

It’s a serious comment, posted solely with the intent to sow doubt.

Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?

You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.

Ah ok sorry, it just looked like you were speaking on his behalf.

The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.

What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”


When you're on the hunt for VC cash "numbers go up" is the main criteria.

Would the single sentence „Imagine you are a regular computer player and accustomed to the usual elements of games“ count as a harness?

Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.

Yes it matters, because it’s not a measurement of whether it accomplishes the task if a human tells it how to solve it.

...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".

Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).

But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):

> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration

but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.


I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.

Mannnn, here I thought this was going to be an informative article! But it’s just a commercial for the author’s consulting business.


Oops! That's actually out of date from prior template I had. I don't actually consult at the moment :). Removing!


The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone


I work with author; author is definitely not AI generated.


This is definitely a mistake! What contact section are you referring to? The only references to contact I see in this post now are at the end where I linked to my X/LinkedIn profiles but those links look right to me?


Do you have benchmarks? Naively I would compare this to Numba? But maybe I am way off the mark here


This is it. Getting something on the table for stakeholders to look at trumps anything else.


It would have taken the same time, if not less, given the extra time for mitigations, trying different optimization techniques, runtimes, etc.

One of the reasons the project was killed was that we couldn't port it to our line of low powered devices without a full rewrite in C.

Please note this was more than a decade ago, way before Rust was the language it was today. I wouldn't chose anything else besides Rust today since it gives the best of both worlds: a truly high level language with low level resource controls.


I’ve never been so conflicted about an article. It has clearly been generated by an llm, but still has useful content. It’s a good article, but…


Same. Admitting to it is one thing, but still it takes a certain kind of attitude to outright forbid people to write tests.


Tokens saved should not be your north star metric. You should be able to show that tool call performance is maintained while consuming fewer tokens. I have no idea whether that is the case here.

As an aside: this is a cool idea but the prose in the readme and the above post seem to be fully generated, so who knows whether it is actually true.


The AI prose is getting so tiring to read

"We measured this. Not estimates — actual token counts using the cl100k_base tokenizer against real schemas, verified by an automated test suite."


[flagged]


Are you an llm? That would be so ironic


I found this comment because I was wondering the same thing on a completely unrelated thread. I strongly suspect this is a bot.


[flagged]


ok, I'll stop. I am not the only person who suspected you!


[flagged]


This is such a funny interaction


This is a thinly veiled commercial, not really useful.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: