Why I Built Offline Voice Transcription

I haven't typed more than a few sentences in the last three years.

Everything I write, whether it's a message to a client, a prompt to Claude, a note to myself, even this post, starts as speech. I talk, and words appear on screen.

This wasn't always the case. I used to type everything, like most people. But around 2022, open source speech recognition models got good enough that I started experimenting, and somewhere along the way I stopped going back to the keyboard.

Here's what changed. The models, specifically the open source ones like Whisper, got accurate enough that corrections became rare rather than constant. And modern Macs and PCs all ship with processors that can run these models locally, whether it's Apple's Metal chips or integrated graphics on Windows machines. You don't need a subscription. You don't need to send your audio to a server somewhere. It just runs on your computer, and it works.

But the free part isn't actually the biggest win.

The biggest win is having it fully wired into how I already work. I have a single keyboard shortcut. Press it once, recording starts. Press it again, recording stops. The audio gets transcribed in a few seconds, and the transcription auto-pastes directly wherever my cursor is. That's it. No app to switch to, no window to open, no text to copy manually. Whatever app I'm in, whatever field I'm focused on, it just appears. And I'm working on adding screen context too, so the AI sees what I see. More on that in a moment.

These models handle punctuation and basic formatting on their own, so what comes out is usually clean enough to send as-is. For a quick Slack message or an AI prompt, I don't even look at it before it lands. For something longer, like an email, I'll glance through and make a few corrections, maybe fix a name or restructure a sentence. But I'm editing, not rewriting.

Custom vocabulary helps with edge cases. My company is spelled Idea Labz, with a Z, which most models would get wrong by default. Same with names that have unusual spellings, like Srikanth. You add them once as custom vocabulary and the model learns to spell them correctly.

The place where this really pays off is communicating with AI. When you're typing, there's friction. You shorten your thoughts. You skip context because typing it out feels like too much work. When you're speaking, you don't have that problem. You just say what you're thinking, and it comes out complete. The prompts end up longer, cleaner, more specific. The AI understands what you actually want because you told it everything instead of compressing it into the shortest thing you could bring yourself to type.

Mistakes don't matter. If the transcription gets a word slightly wrong, Claude or Codex or whatever you're using understands it anyway. It's forgiving. And mid-stream corrections work too. You can say "oh wait, when I said dashboard a few minutes back, I meant the admin panel" and the AI just handles it, instead of you clicking back, highlighting, deleting, retyping. You go with the flow and the AI follows along. So the bar for when transcription is "good enough" drops even further.

The compound effect after three years is subtle but real. It's a quality of life feature. It might not solve a major problem, but when you speak out loud, your quality of thought changes. You explain things more fully. You catch yourself mid-sentence and correct course. That shift in how you think, not just how you type, is what compounds over time.

This has been my workflow for a while now, and it works. But where I've taken it further is integrating it with automations.

Right now, the flow is: speak, transcribe, auto-paste. What I've built on top of that is having a local model format it before pasting. So instead of just dictating the content, I dictate the content and the instruction in one breath. Something like, "Format this using my voiceSkill.md and paste it into my clipboard." Or "Turn this into a LinkedIn post."

You're suddenly using voice not just to get words on screen, but to talk to your local AI and tell it what to do with those words. The transcription becomes the input to another step, and that step runs automatically based on what you said.

The obvious next piece, and what I'm working toward now, is adding screen context. When my keyboard shortcut starts recording, what if it also captures a screenshot automatically and sends it along with the transcription? The AI gets both what I said and what I was looking at. Imagine reading a research paper, or a Reddit thread, or a client's dashboard, or an analytics review. You start speaking, taking notes out loud. When you stop, the AI receives your words plus a snapshot of your screen. It sees what you see. The context you'd otherwise have to describe or copy-paste is just there.

If you write regularly, or communicate a lot over text, or spend any meaningful time prompting AI, this kind of setup is worth building. The tools are all available. It's just about wiring them together in a way that fits how you work.

Why I Built Voice Transcription That Works Offline

Let's build it together