OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".
Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.
Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.