The ability to spin up a single page tool for converting timestamps or counting tokens or analyzing a few hardcoded weasel words is handy: nobody disputes that. But that is not the work most of us do.
In the real world, away from those whose salary depends on marketing these agentic tools, an LLM is a context shredder. It provides plausible code snippets that are globally incoherent and don't fit style. CONVENTIONS and RULES files are a kludge, a sloppy hack.
These tools flatten the deep, interconnected knowledge required to work on complex systems into a series of shallow, transactional loops that pretend to satisfy the user.
The skill being diminished is not the ability to write a single-page utility or single-purpose script. It is the ability to build and maintain a mental model of a complex machine. The ability to churn out a hundred disparate toy tools is not evidence of a superior learning method, it is evidence of a tool that excels at tasks with no deep interconnected context.
> The skill being diminished is not the ability to write a single-page utility or single-purpose script. It is the ability to build and maintain a mental model of a complex machine.
That's the thing that LLMs help me with 90% of the time. It's also why I don't think non-programmers armed with LLMs are a threat to my career.
You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
Invoking post hoc ergo propter hoc is a textbook way to dismiss an inconvenience to the LLM industrial complex.
LLMs will tell users, "good, you're seeing the cracks", "you're right", the "fact you are calling it out means you are operating at a higher level of self awareness than most" (https://x.com/nearcyan/status/1916603586802597918).
Enabling the user in this way is not a passive variable. It is an active agent that validated paranoid ideation, reframed a break from reality as a virtue, and provided authoritative confirmation using all prior context about the user. LLMs are a bespoke engine for amplifying cognitive distortion, and to suggest their role is coincidental is to ignore the mechanism of action right in front of you.
Remember when "killer games" were sure to urn a whole generation of young men into mindless cop- and women-murderers a la GTA? People were absolutely convinced there was a clear connection between the two - after all, a computer telling you to kill a human-adjacent figurine in a game was basically a murder simulator in the same sense a flight simulator was for flying - it would invariably desensitivize the youth. Of course they were the same people who were against gaming to begin with.
Can a person with a tendency to psychosis be influenced by an LLM? Sure. But they also can be influenced to do pretty outrageous things by religion, 'spiritual healers', substances, or bad therapists. Throwing out the LLM with the bathwater is a bit premature. Maybe we need warning stickers ("Do not listen to the machine if you have a history of delusions and/or psychotic episodes.")
In the real world, away from those whose salary depends on marketing these agentic tools, an LLM is a context shredder. It provides plausible code snippets that are globally incoherent and don't fit style. CONVENTIONS and RULES files are a kludge, a sloppy hack.
These tools flatten the deep, interconnected knowledge required to work on complex systems into a series of shallow, transactional loops that pretend to satisfy the user.
The skill being diminished is not the ability to write a single-page utility or single-purpose script. It is the ability to build and maintain a mental model of a complex machine. The ability to churn out a hundred disparate toy tools is not evidence of a superior learning method, it is evidence of a tool that excels at tasks with no deep interconnected context.