I've been trying to build a locally hosted voice assistant that actually works. Not "works in a demo" works. "I use it every day and it doesn't make me want to throw it out the window" works.
After several iterations, I have something that meets that bar. Here's what I learned.
The stack that works for me right now: faster-whisper for speech-to-text, a local LLM for processing, and edge-tts (or a local TTS engine) for speech output. The key insight was that each component has matured enough individually that the full pipeline is now viable, even though no single all-in-one solution is ready yet.
Let me talk about each piece.
Speech-to-text: faster-whisper. This is the component that improved the most in the last year. Whisper was already good when OpenAI released it, but faster-whisper's CTranslate2-based implementation runs at 4-6x speed on CPU and is even faster on GPU. On my hardware, I get real-time transcription with the medium model - meaning it processes speech as fast as I can talk. Accuracy is good enough that I almost never have to repeat myself.
The trick with STT is that "good enough" is really high. If it gets one word in twenty wrong, the LLM usually figures out what you meant. But if it gets one word in five wrong, the experience is maddening. Faster-whisper with the medium model consistently hits the "good enough" threshold.
The brain: local LLM. This is where things get opinionated. For a voice assistant, you need a model that's fast more than you need one that's smart. Waiting five seconds for a response kills the conversational flow. I'm using a smaller model with lower latency, and honestly, for voice assistant tasks - answering questions, controlling smart home stuff, setting timers, quick lookups - you don't need GPT-5. A well-prompted smaller model handles 90% of voice requests perfectly.
The remaining 10% - complex reasoning, nuanced questions - I route to a cloud API. But 90% local is a massive improvement over 100% cloud.
Text-to-speech: the weak link. This is still the least satisfying part of the pipeline. Local TTS has improved, but it's noticeably worse than cloud options. The voices sound slightly robotic. The prosody is off. Emotional expression is limited. I'm using edge-tts as a compromise - it's Microsoft's cloud TTS but it's free and fast - with a fallback to local TTS when offline.
I expect local TTS to catch up within the next year. The architectures are there. The training data is there. It's a matter of compute and fine-tuning.
The integration layer is where I spent the most time and where most people give up. Getting all three components to work together seamlessly - with proper voice activity detection (knowing when I've stopped talking), interrupt handling (being able to cut off the assistant mid-response), and wake word detection - requires careful engineering.
What I've found works:
- Silero VAD for voice activity detection. It's lightweight and accurate.
- A short audio buffer so the system catches the beginning of speech, not just the middle.
- Streaming TTS output so the response starts before the full text is generated.
- Aggressive timeout tuning. The difference between 200ms and 500ms of silence detection is the difference between natural and awkward.
Here's my honest assessment: a local voice assistant in 2026 is about 80% as good as Alexa or Google Home for common tasks, and significantly better for anything that requires custom behavior or integration with your own systems. The 20% gap is mostly in TTS quality and the breadth of pre-built integrations.
That gap is closing fast. If you tried local voice assistants a year ago and gave up, try again. The stack has improved substantially.