More

HereBePandas · 2025-11-18T13:27:02 1763472422

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

Palmik · 2025-11-18T14:04:57 1763474697

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

lucassz · 2025-11-18T22:22:19 1763504539

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

enraged_camel · 2025-11-18T13:31:37 1763472697

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

HereBePandas · 2025-11-18T13:41:58 1763473318

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

embedding-shape · 2025-11-18T17:17:12 1763486232

> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.

Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

HereBePandas · 2025-11-18T13:17:32 1763471852

[comment removed]

Palmik · 2025-11-18T13:19:54 1763471994

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

HereBePandas · 2025-11-18T13:30:50 1763472650

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

Palmik · 2025-11-18T14:04:33 1763474673

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

HereBePandas · on Feb 15, 2024

They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.

westoncb · on Feb 15, 2024

I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.

a_wild_dandan · on Feb 15, 2024

I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.

westoncb · on Feb 15, 2024

I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).

a_wild_dandan · on Feb 15, 2024

Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.

danielmarkbruce · on Feb 15, 2024

Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.

somenameforme · on Feb 16, 2024

You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.

westoncb · on Feb 15, 2024

Would be awesome if it is solved but seems like a much deeper problem tbh.

caesil · on Feb 15, 2024

Unfortunately Google's track record with language models is one of overpromising and underdelivering.

chaxor · on Feb 16, 2024

This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).

mlsu · on Feb 16, 2024

I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.

leegao · on Feb 16, 2024

Their in-context long-sequence understanding "benchmark" is pretty interesting.

There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]

They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.

This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.

It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.

[1] from https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.

smeagull · on Feb 15, 2024

I believe that's a limitation of using vectors of high dimensions. It'll be muddy.

Aeolun · on Feb 16, 2024

Not unlike trying to keep the whole contents of the document in your own mind :)

sirsinsalot · on Feb 16, 2024

It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.

razodactyl · on Feb 16, 2024

Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."

The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.

I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.

andy_ppp · on Feb 16, 2024

Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.

westoncb · on Feb 17, 2024

That was not muddy, but it's not the kind of scenario where muddiness shows up.

tcdent · on Feb 16, 2024

Page 8 of the technical paper [1] is especially informative.

The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

HereBePandas · on Feb 8, 2024

HereBePandas · on Feb 8, 2024

Weird - just tried this and it worked for me.

belltaco · on Feb 8, 2024

I tried this

> write a powershell script to crawl an entire website and download all images

It still refuses to generate code for that.

HereBePandas · on Dec 6, 2023

Tech report seems to hint at the fact that GPT-4 may have had some training/testing data contamination and so GPT-4 performance may be overstated.

smarterclayton · on Dec 6, 2023

From the report:

"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."

ZeroCool2u · on Dec 6, 2023

Great catch!

HereBePandas · on Nov 15, 2023

I'd be shocked - given the incentives - if it hasn't already happened to a great extent. Many of the types of people Google DeepMind hires are also the types of people hedge funds hire.

HereBePandas · on July 28, 2023

Sure, but this time in 2022, you probably wouldn't have let it do that much (or even had this thought).

Progress!

HereBePandas · on June 8, 2023

?

BERT was 5 years ago. Of course it's worse than anything introduced more recently (both inside and outside Google).

https://en.m.wikipedia.org/wiki/BERT_(language_model)

HereBePandas · on June 4, 2023

The US has had the Bill of Rights and its constitutional system / property protection for a long time but only recently has had the degree of NIMBYism that it's had.

It's easy to pretend the bug is actually an important and noble feature, but sometimes it's just a bug that needs fixing.