Author here. My argument is: we give instructions to coding agents dozens of times a day. Over time, speaking those instructions naturally tends to produce more detailed context than typing them out, because the friction of typing makes you abbreviate.
I've been using VoiceInk on macOS for a few months now. The workflow is just: hold shortcut, speak, release, text appears at cursor and works in terminal, editor, chat, wherever.
The post covers Handy, Whispering, VoiceInk, OpenWhispr, and FluidVoice. All open-source, all do local transcription, all paste directly into the active window. The differences are mostly platform support, model selection, and how much extra stuff (AI post-processing, voice-activated mode, etc.) they add.
Happy to answer questions about any of these or about the voice-typing-for-agents workflow in general.
I encourage everyone to use speech-to-text tools to give detailed context to coding agents. As a developer, I love my keyboard and I can understand if you're skeptical. I was too. But using speech-to-text is one of the high-leverage things you can do as a developer.
We all know LLMs work better when given more context and clear instructions. When you're working with coding agents, you're giving instructions to them multiple times a day, every day. Over time, you end up giving much better instructions and detailed context if you use speech-to-text compared to manually typing all those instructions all the time.
There are tons of open source and proprietary products for speech-to-text which offer inference on local machine or on cloud. So I put together a curated list of 30+ open-source tools across Linux, macOS, Windows, Android, and iOS. Most support offline recognition. Pick whatever you find suitable, but I definitely recommend giving speech-to-text a try for your LLM workflows. And if you're skeptical, give it a week and then re-evaluate.
I picked India and a random year, 1985 [1]. The number 3 song caught my eye cause it had the thumbnail of a famous movie that came out in 2004, although the correct song played. When I went to the linked Spotify playlist for that year, the included song at number 3 was wrong and linked to the song from the 2004 movie.
Not sure what the data source is, but needs a little bit of cleaning and validation. Not critiquing, this project is awesome, just giving a heads up.
Thanks for the feedback! Yes, there are still some inaccuracies that I am fixing manually. I implemented a suggest feature so that I can get some external help to expand and polish: https://88mph.fm/suggest
I found two obvious issues in the first playlist I tried, and used /suggest, but two out of ten doesn't inspire confidence. Maybe in addition to /suggest, extend your app to include a checker on the compiled playlist. The two songs I noticed, one was released over 30 years later, the other wasn't even from the same century and was an unrelated genre. Just comparing song release date against the playlist year would easily have caught these.
I ran this live in Tokyo with ~50 engineers. The biggest "aha" moment was Phase 4, when the agent loop closes and the LLM starts chaining tool calls autonomously. People go from "I'm building a chatbot" to "oh, this is an agent".
Also, there have been plenty of "build a coding agent in 200 lines" posts on HN in the past year, and they're great for seeing the final picture. I created this simple structured exercise so we start from an empty loop and build each piece ourself, phase by phase. So instead of just reading the implementation, I hope more people try the implementation themselves.
These are the 7 phases of implementation:
1. LLM in the loop: replace the canned response with an actual LLM call
2. Read file tool: implement the tool + pass its schema to the LLM + detect tool use in the response
3. Tool execution: execute the tool the LLM requested and display the result
4. Agent loop: the inner loop where tool results go back to the LLM until no more tool calls
5. Edit file tool: create and edit files
6. Bash tool: execute shell commands with user confirmation
7. Memory: use the agent to build the agent, add AGENTS.md support for persistent memory across sessions
Feedback and PRs welcome. Happy to answer any questions.
Been working on a weekly newsletter [1] to stay fully informed about agentic coding with one email, once a week. I also keep the focus narrow, only on what engineers and tech leaders would find useful for shipping code and leading teams, which means I filter out all generic AI news, or what CEO said what, or any marketing fluff.
To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.
I don't know whether it just doesn't work well in GGUF / llama.cpp + OpenCode but I can't get anything useful out of Devstal 2 24B running locally. Probably a skill issue on my end, but I'm not very impressed. Benchmarks are nice but they don't always translate to real life usefulness.
I've been using VoiceInk on macOS for a few months now. The workflow is just: hold shortcut, speak, release, text appears at cursor and works in terminal, editor, chat, wherever.
The post covers Handy, Whispering, VoiceInk, OpenWhispr, and FluidVoice. All open-source, all do local transcription, all paste directly into the active window. The differences are mostly platform support, model selection, and how much extra stuff (AI post-processing, voice-activated mode, etc.) they add.
Happy to answer questions about any of these or about the voice-typing-for-agents workflow in general.
reply