I don't think it's just an engineering problem - decades of research have failed to produce a convincing, general definition of intelligence, capability or agency. You can try to form proxy metrics by combining benchmarks, but existing benchmarks are flawed, and should be taken with a pinch of salt.
It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.
I use Gemini via its web app, which aggressively autoswitches to the Flash over Pro, but I usually notice quickly because the answers are weird or the logic doesn't quite follow. I feel like, at least for 'daily driver' usage, small models are still a little disappointing. That said, they're getting very good for more automation-y tasks with simple, well-constrained tasks.
That's likely because they're chasing enterprise - see deals with HSBC, ASML, AXA, BNP Paribas etc... Given swelling anti-US sentiment and their status as a French 'national champion', Mistral are probably in a strong position for now regardless of model performance, research quality or consumer uptake.
pass@k means that you run the model k times and give it a pass if any of the answers is correct. I guess Lean is one of the few use cases where pass@k actually makes sense, since you can automatically validate correctness.
The only other pair I've seen treated that way is Nothing's Headphones - although the (maybe niche) musicians I've seen wearing them lean into y2k aesthetics, where Apple's products are more broadly appealing.
Jaxtyping is the best option currently - despite the name it also works for Torch and other libs. That said, I think it still leaves a lot to be desired. It's runtime-only, so unless you wire it into a typechecker it's only a hint. And, for me, the hints aren't parsed by Intellisense, so you don't see shape hints when calling a function - only when directly reading the function definition.
Personally, I also think the syntax is a little verbose: for a generic shape hint you need something like `Shaped[Array, "m n"]`. But 95% of the time I only really care about the shape "m n". It doesn't sound like much, but I recently tried hinting a codebase with jaxtyping and gave up because it was adding so much visual clutter, without clear benefits.
This would be an insta-switch feature for me! Jaxtyping is a great idea, but the runtime-only aspect kills it for me - I just resort to shape assertions + comments, but it's a pretty poor solution.
A follow-up question: Google's old `tensor_annotations` library (RIP) could statically analyse operations - eg. `reduce_sum(Tensor[Time, Batch], axis=0) -> Tensor[Batch]`. I guess that wouldn't come with static analysis for jaxtyping?
I wonder how much this is just a sampling bias. Older media has been repeatedly filtered over time, so you don't see all the bland, derivative ripoffs that were abundant at the time. Likewise, interesting and forward-thinking work produced today may not be widely appreciated for many years - consider that Van Gogh's work was largely ignored during his lifetime.
It's unintuitive to me that architecture doesn't matter - deep learning models, for all their impressive capabilities, are still deficient compared to human learners as far as generalisation, online learning, representational simplicity and data efficiency are concerned.
Just because RNNs and Transformers both work with enormous datasets doesn't mean that architecture/algorithm is irrelevant, it just suggests that they share underlying primitives. But those primitives may not be the right ones for 'AGI'.
LeCun was stubbornly 'wrong and boneheaded' in the 80s, but turned out to be right. His contention now is that LLMs don't truly understand the physical world - I don't think we know enough yet to say whether he is wrong.
It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.
reply