Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?
You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.
Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.
...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.
The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone
This is definitely a mistake! What contact section are you referring to? The only references to contact I see in this post now are at the end where I linked to my X/LinkedIn profiles but those links look right to me?
It would have taken the same time, if not less, given the extra time for mitigations, trying different optimization techniques, runtimes, etc.
One of the reasons the project was killed was that we couldn't port it to our line of low powered devices without a full rewrite in C.
Please note this was more than a decade ago, way before Rust was the language it was today. I wouldn't chose anything else besides Rust today since it gives the best of both worlds: a truly high level language with low level resource controls.
Tokens saved should not be your north star metric. You should be able to show that tool call performance is maintained while consuming fewer tokens. I have no idea whether that is the case here.
As an aside: this is a cool idea but the prose in the readme and the above post seem to be fully generated, so who knows whether it is actually true.
reply