If all LLM advancements stopped today, but compute + energy got to the price where the $30 million zettaflop was possible, I wonder what outcomes would be possible? Would 1000 claudes be able to coordinate in meaningful ways? How much human intervention would be needed?
Headline/article is extremely misleading. They still have subscription plans with included usage, but those usage limits are now based on tokens instead of messages.
I like this, and think it's true for how humans learn. What's interesting to me is that it seems LLMs are significantly smarter than they were two years ago, but it doesn't feel like they have better "taste". Their failure modes are still bizarre and inhuman. I wonder what it is about their architecture/training that scales their experience without corresponding improvements in taste.
In theory, RLVR should encourage less error-prone code, similar to a human getting burned by production outages like the article mentioned. Maybe the scale in training just isn't big enough for that to matter? Perhaps we need better benchmarks that capture long-term issues that arise from bad models and unnecessary complexity.
I’ve tried having one “big” task that I’m focusing on with active back and forth while letting other Claude instances handle easier back-burner type tasks that it can effectively one-shot. But I’ve noticed that often turns into me spending more time/focus than I’d want on tasks that aren’t actually that impactful. I still think I get more done than I would otherwise, but I still haven’t found the best management strategy.
Yeah that confused me, but the compression paper also doesn’t make a ton of sense since I doubt Google would have released it if it was actually such a competitive advantage compared to what other labs are doing. So I wonder what’s actually causing the price decrease.
Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.
Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.
IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add one
But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead
In theory, the algorithm could deal with that by choosing the commit at each step, which gives the best expected information gain; divided by expected test time. In most cases it would be more efficient just to cache the compiled output though.
This doesn't sound quite right, but I'm not sure why.
Perhaps: a reasonable objective would be to say that for N bits of information, I would like to pick the test schedule that requires the least total elapsed time. If you have two candidate commits and a slow recompile time, it seems like your algorithm would do many repeats of commit A until the gain in information per run drops below the expected gain from B divided by the recompile time, then it would do many repeats of B, then go back to A, etc. So there are long runs, but you're still switching back and forth. You would get the same number of bits by doing the same number of test runs for each commit, but batching all of the A runs before all of the B runs.
Then again: you wouldn't know how many times to run each in advance, and "run A an infinite number of times, then run B an infinite number of times" is clearly not a winning strategy. Even with a fixed N, I don't think you could figure it out without knowing the results of the runs in advance. So perhaps your algorithm is optimal?
It still feels off. You're normalizing everything to bits/sec and choosing the maximum. But comparing an initial test run divided by the rebuild time vs a subsequent test run divided by a much faster time seems like you're pretending a discrete thing is continuous.
The general requirement for this approach to be optimal, is called "dynamical consistency". A good description is in [1]. It is the situation where, suppose you have a budget B , and you search until your budget is exhausted. Then you are informed that there is an additional budget, B2, and you can continue searching until that is exhausted. A situation is dynamically consistent if, for any B,B2, the optimal strategy is such that you would make the same choices whether you know that you will get B2 or not.
So you are correct that discreteness is a problem, because if you are nearing the end of the budget you may optimally prefer to get more dice rolls than take bigger bets. But the optimal solution is then often analytically intractable (or at least it was - I last read about this a while back), and the entropy approach is often reasonable anyway. (For cases where search effort is significant, a good search plan can be found by simulation).
Note that "pick the commit with best expected information gain" in git_bayesect isn't optimal even in the no overhead regime. I provide a counterexample in the writeup, which implies ajb's heuristic is also not optimal. I don't see a tractable way to compute the optimal policy.
One idea: if you always spend time testing equal to your constant overhead, I think you're guaranteed to be not more than 2x off optimal.
(and agreed with ajb on "just use ccache" in practice!)
I think if you make your test script compile and then run the tests up to N times, failing on first fail, then when you run bayesect, it just "sees" a test that is "N times more" deterministic, so will behave appropriately.
I'm not sure how to choose an optimal value of N. My first hunch is make it so that it takes at least as long to run all the tests as it takes to setup (checkout, compile link etc.), but it may make sense to go a lot more than that. I'd have to do some thinking about the maths.
It's surprising that this works so well considering that AI-generated AGENTS.md files have been shown to be not very useful. I think the key difference here is that the real-world experience helps the agent reach regions of its latent space that wouldn't occur naturally through autoregression.
I wonder how much of the improvement is due to the agent actually learning new things vs. reaching parts of its latent space that enable it to recall things it already knows. Did the agent come up with novel RL reward design protocols based on trial and error? Or did the tokens in the environment cause it to "act smarter"?
reply