Hacker Newsnew | past | comments | ask | show | jobs | submit | agtestdvn's commentslogin

I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!


let's goo


"This may seem construed, but most real-world tasks have many layers of nuance that all have the potential to be miscommunicated.You might think that a simple solution would be to just copy over the original task as context to the subagents as well. That way, they don’t misunderstand their subtask. But remember that in a real production system, the conversation is most likely multi-turn, the agent probably had to make some tool calls to decide how to break down the task, and any number of details could have consequences on the interpretation of the task."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: