> SWE-bench Verified 59.2 This seems pretty darn good for a 30B model. That's si...

achierius · 2026-01-19T18:13:50 1768846430

I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.

robbies · 2026-01-19T18:39:34 1768847974

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

NitpickLawyer · 2026-01-19T19:20:10 1768850410

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.

khimaros · 2026-01-20T03:24:53 1768879493

Terminal Bench 2.0

primaprashant · 2026-01-20T02:40:04 1768876804

You should check out Devstral 2 Small [1]. It's 24B and scores 68.0% on SWE-bench Verified.

[1]: https://mistral.ai/news/devstral-2-vibe-cli

Palmik · 2026-01-20T08:35:16 1768898116

To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.

dajonker · 2026-01-20T09:12:17 1768900337

I don't know whether it just doesn't work well in GGUF / llama.cpp + OpenCode but I can't get anything useful out of Devstal 2 24B running locally. Probably a skill issue on my end, but I'm not very impressed. Benchmarks are nice but they don't always translate to real life usefulness.