More

stephantul · 2026-03-28T11:53:14 1774698794

It’s a serious comment, posted solely with the intent to sow doubt.

stephantul · 2026-03-27T18:26:23 1774635983

Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?

scottmu · 2026-03-27T19:43:20 1774640600

You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.

stephantul · 2026-03-27T20:55:40 1774644940

Ah ok sorry, it just looked like you were speaking on his behalf.

stephantul · 2026-03-27T06:44:07 1774593847

The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.

What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”

cbg0 · 2026-03-27T06:53:49 1774594429

When you're on the hunt for VC cash "numbers go up" is the main criteria.

fzeindl · 2026-03-27T09:03:24 1774602204

Would the single sentence „Imagine you are a regular computer player and accustomed to the usual elements of games“ count as a harness?

ting0 · 2026-03-27T08:21:57 1774599717

Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.

stephantul · 2026-03-27T18:50:46 1774637446

Yes it matters, because it’s not a measurement of whether it accomplishes the task if a human tells it how to solve it.

versteegen · 2026-03-27T12:03:36 1774613016

...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".

Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).

But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):

> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration

but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.

stephantul · 2026-03-27T18:52:29 1774637549

I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.

stephantul · 2026-03-23T15:13:26 1774278806

Mannnn, here I thought this was going to be an informative article! But it’s just a commercial for the author’s consulting business.

sbpayne · 2026-03-23T15:14:55 1774278895

Oops! That's actually out of date from prior template I had. I don't actually consult at the moment :). Removing!

halb · 2026-03-23T15:44:03 1774280643

The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone

CharlieDigital · 2026-03-23T16:08:45 1774282125

I work with author; author is definitely not AI generated.

sbpayne · 2026-03-23T16:07:09 1774282029

This is definitely a mistake! What contact section are you referring to? The only references to contact I see in this post now are at the end where I linked to my X/LinkedIn profiles but those links look right to me?

stephantul · 2026-03-21T15:56:50 1774108610

Do you have benchmarks? Naively I would compare this to Numba? But maybe I am way off the mark here

stephantul · 2026-03-21T08:27:01 1774081621

This is it. Getting something on the table for stakeholders to look at trumps anything else.

ameixaseca · 2026-03-21T15:41:41 1774107701

It would have taken the same time, if not less, given the extra time for mitigations, trying different optimization techniques, runtimes, etc.

One of the reasons the project was killed was that we couldn't port it to our line of low powered devices without a full rewrite in C.

Please note this was more than a decade ago, way before Rust was the language it was today. I wouldn't chose anything else besides Rust today since it gives the best of both worlds: a truly high level language with low level resource controls.

stephantul · 2026-03-18T17:57:34 1773856654

I’ve never been so conflicted about an article. It has clearly been generated by an llm, but still has useful content. It’s a good article, but…

stephantul · 2026-03-10T19:52:37 1773172357

Same. Admitting to it is one thing, but still it takes a certain kind of attitude to outright forbid people to write tests.

stephantul · 2026-03-09T07:15:43 1773040543

Tokens saved should not be your north star metric. You should be able to show that tool call performance is maintained while consuming fewer tokens. I have no idea whether that is the case here.

As an aside: this is a cool idea but the prose in the readme and the above post seem to be fully generated, so who knows whether it is actually true.

rakag · 2026-03-09T09:44:20 1773049460

The AI prose is getting so tiring to read

"We measured this. Not estimates — actual token counts using the cl100k_base tokenizer against real schemas, verified by an automated test suite."

hrmtst93837 · 2026-03-09T09:00:50 1773046850

[flagged]

stephantul · 2026-03-09T13:32:30 1773063150

Are you an llm? That would be so ironic

danlitt · 2026-03-09T14:38:50 1773067130

I found this comment because I was wondering the same thing on a completely unrelated thread. I strongly suspect this is a bot.

hrmtst93837 · 2026-03-09T14:52:16 1773067936

[flagged]

danlitt · 2026-03-09T14:57:41 1773068261

ok, I'll stop. I am not the only person who suspected you!

hrmtst93837 · 2026-03-09T15:12:49 1773069169

[flagged]

stephantul · 2026-03-09T15:25:10 1773069910

This is such a funny interaction

stephantul · 2026-03-08T13:51:50 1772977910

This is a thinly veiled commercial, not really useful.