I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.
swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.
It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.
To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.
I don't know whether it just doesn't work well in GGUF / llama.cpp + OpenCode but I can't get anything useful out of Devstal 2 24B running locally. Probably a skill issue on my end, but I'm not very impressed. Benchmarks are nice but they don't always translate to real life usefulness.
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.