The bird not having wings, but all of us calling it a 'solid bird' is one of the...

zarzavat · 2026-02-12T10:15:07 1770891307

This test is so far beyond AGI. Try to spit out the SVG for a pelican riding a bicycle. You are only allowed to use a simple text editor. No deleting or moving the text cursor. You have 1 minute.

RC_ITR · 2026-02-12T23:49:12 1770940152

Sorry, is your definition of AGI "doing things worse than humans can do, but way faster?" because that's been true of computers for a long time.

pixl97 · 2026-02-13T14:14:22 1770992062

I mean for this particular benchmark, yes.

You'd have to put it in an agentic loop to perform corrections otherwise.

Rudybega · 2026-02-11T21:50:39 1770846639

MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

RC_ITR · 2026-02-12T23:44:13 1770939853

Here's the score for new AIME's, where we know the answers aren't in training.

https://matharena.ai/?view=problem&comp=aime--aime_2026

As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?

As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.

kingstnap · 2026-02-12T15:50:36 1770911436

The benchmarks are harder than you might imagine and contain more wrong answers and terrible questions than you would expect.

You don't need to take my word for it, try playing MMLU yourself.

https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

Its not MMLU-Pro btw, which is considerably harder.

RC_ITR · 2026-02-12T23:50:21 1770940221

Sure and AGI will 100% it 100% of the time, even if it is hard.

hieudesu · 2026-02-14T13:35:48 1771076148

Your definition of AGI must be absurd

simonw · 2026-02-12T00:22:52 1770855772

It has a wing. Look at the code comments in the SVG!