Hacker Newsnew | past | comments | ask | show | jobs | submit | primaprashant's commentslogin

I've been using speech-to-text tools every day now especially for dictating detailed prompts to LLMs and coding agents. I personally use VoiceInk which is open-source.

I tried to look for what other solutions are available and I've collected all the best open-source ones in this awesome-style GitHub repo. Hope you find something that works for you!

https://github.com/primaprashant/awesome-voice-typing


Will check out. I made a custom made dirty solution working for the coding agent we use. Speaking is much faster than typing but it takes mental effort to lay out your thoughts before speaking unlike typing.

I love using (tiling) window managers, and one of the most important requirements for me is having a key binding for switching to the last active workspace. The proposed solution in the blog doesn't achieve this. I use Aerospace on macOS right now and think it's the best solution available.

I generally have fixed workspaces for different things: first for a browser, second for a code editor, third for a terminal, and so on. If I want to switch between the browser and code editor, I can do that with a single key binding, usually Alt+Tab. The same binding lets me switch between the code editor and terminal just as easily.

When you have something like 10 different workspaces, not having this key binding becomes annoying. If you need to alternate between windows on workspace one and workspace eight, you're stuck using both hands to press Control+1 and then Control+8. But with a last-active-workspace key binding, you can just Alt+Tab between them. This is the killer feature I always need.


Actually, that was just recently merged into InstantSpaceSwitcher!

Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.

I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!

https://github.com/primaprashant/awesome-voice-typing


Can you explain how exactly dictation is used for development? I type about 120 WPM so typing is always going to be way faster for me than talking. Aside for accessibility, is dictation development for slower typers or is it more so you can relax on a couch while vibe coding? If this comes off as condescension it's not intended, I am genuinely out of the loop here.

I think most people can speak faster than 120 WPM. For example this site says I speak at 343 WPM https://www.typingmaster.com/speech-speed-test/, and I self-measure 222 WPM on dense technical text.

Micro machines guy could be vibe coding at an absurd rate.

My LLM types at 2k WPM. So I ise that to talk to my LLMs

For me personally, it's not really about typing speed. While I can type pretty fast and most likely I speak faster than typing, but typing and dictating are just different way of doing things for me. While the end result of both is same, but for me it's just like different way of doing things and it's not a competition between the two.

I regularly just sit down and often just describe whatever I'm trying to do in detail and I speak out loud my entire thought process and what kind of trade-offs I'm thinking, all the concerns and any other edge cases and patterns I have in my mind. I just prefer to speak out loud all of those. I regularly speak out loud for 5 to 10 minutes while sometimes taking some breaks in between as well to think through things.

I am not doing it just for vibe coding, I'm using it for everything. So obviously for driving coding agents, but also for in general, describing my thoughts for brainstorming or having some kind of like a critique session with LLMs for my ideas and thoughts. So for everything, I'm just using dictation.

One other benefit I think for me personally is that since I'm interacting with coding agents and in general LLMs a lot again and again every day, I end up giving much more context and details if I'm speaking out loud compared to typing. Sometimes I might feel a little bit lazy to type one or two extra sentences. But while speaking, I don't really have that kind of friction.


Most English speakers speak faster than 120 wpm so that's probably why people, especially those who can't type at speeds like you can, prefer it.

Typing is considerably less energy intensive than speaking. At least it is for me. I save the speaking for meetings, etc.

Author here. My argument is: we give instructions to coding agents dozens of times a day. Over time, speaking those instructions naturally tends to produce more detailed context than typing them out, because the friction of typing makes you abbreviate.

I've been using VoiceInk on macOS for a few months now. The workflow is just: hold shortcut, speak, release, text appears at cursor and works in terminal, editor, chat, wherever.

The post covers Handy, Whispering, VoiceInk, OpenWhispr, and FluidVoice. All open-source, all do local transcription, all paste directly into the active window. The differences are mostly platform support, model selection, and how much extra stuff (AI post-processing, voice-activated mode, etc.) they add.

Happy to answer questions about any of these or about the voice-typing-for-agents workflow in general.


I've found RTK CLI proxy [1] quite useful for reducing token usage

[1]: https://github.com/rtk-ai/rtk/


I encourage everyone to use speech-to-text tools to give detailed context to coding agents. As a developer, I love my keyboard and I can understand if you're skeptical. I was too. But using speech-to-text is one of the high-leverage things you can do as a developer.

We all know LLMs work better when given more context and clear instructions. When you're working with coding agents, you're giving instructions to them multiple times a day, every day. Over time, you end up giving much better instructions and detailed context if you use speech-to-text compared to manually typing all those instructions all the time.

There are tons of open source and proprietary products for speech-to-text which offer inference on local machine or on cloud. So I put together a curated list of 30+ open-source tools across Linux, macOS, Windows, Android, and iOS. Most support offline recognition. Pick whatever you find suitable, but I definitely recommend giving speech-to-text a try for your LLM workflows. And if you're skeptical, give it a week and then re-evaluate.

https://github.com/primaprashant/awesome-voice-typing


I picked India and a random year, 1985 [1]. The number 3 song caught my eye cause it had the thumbnail of a famous movie that came out in 2004, although the correct song played. When I went to the linked Spotify playlist for that year, the included song at number 3 was wrong and linked to the song from the 2004 movie.

Not sure what the data source is, but needs a little bit of cleaning and validation. Not critiquing, this project is awesome, just giving a heads up.

[1]: https://88mph.fm/in/1985


Thanks for the feedback! Yes, there are still some inaccuracies that I am fixing manually. I implemented a suggest feature so that I can get some external help to expand and polish: https://88mph.fm/suggest


I found two obvious issues in the first playlist I tried, and used /suggest, but two out of ten doesn't inspire confidence. Maybe in addition to /suggest, extend your app to include a checker on the compiled playlist. The two songs I noticed, one was released over 30 years later, the other wasn't even from the same century and was an unrelated genre. Just comparing song release date against the playlist year would easily have caught these.


Continuing my weekly newsletter about agentic coding updates:

https://www.agenticcodingweekly.com/


I ran this live in Tokyo with ~50 engineers. The biggest "aha" moment was Phase 4, when the agent loop closes and the LLM starts chaining tool calls autonomously. People go from "I'm building a chatbot" to "oh, this is an agent".

Also, there have been plenty of "build a coding agent in 200 lines" posts on HN in the past year, and they're great for seeing the final picture. I created this simple structured exercise so we start from an empty loop and build each piece ourself, phase by phase. So instead of just reading the implementation, I hope more people try the implementation themselves.

These are the 7 phases of implementation:

  1. LLM in the loop: replace the canned response with an actual LLM call
  2. Read file tool: implement the tool + pass its schema to the LLM + detect tool use in the response
  3. Tool execution: execute the tool the LLM requested and display the result
  4. Agent loop: the inner loop where tool results go back to the LLM until no more tool calls
  5. Edit file tool: create and edit files
  6. Bash tool: execute shell commands with user confirmation
  7. Memory: use the agent to build the agent, add AGENTS.md support for persistent memory across sessions

Feedback and PRs welcome. Happy to answer any questions.


Been working on a weekly newsletter [1] to stay fully informed about agentic coding with one email, once a week. I also keep the focus narrow, only on what engineers and tech leaders would find useful for shipping code and leading teams, which means I filter out all generic AI news, or what CEO said what, or any marketing fluff.

[1]: https://www.agenticcodingweekly.com/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: