A local AI you type at is useful. A local AI you can talk to — that talks back in a real voice, out loud, while your laptop is in airplane mode — is a different thing entirely. Voice is the feature that turns a chatbot into a presence. And here’s the part the cloud apps don’t want you thinking about: when you speak to Siri, Alexa, or a cloud companion app, your actual voice — your recorded audio — leaves your device and lands on someone else’s server. Doing voice fully offline is the only way your voice never becomes someone’s training data.
This guide walks the whole stack end to end: how speech-to-text, the language model, and text-to-speech fit together, exactly which open-source tools to use in 2026, how to wire them up, and how to get the round-trip under two seconds without a single packet leaving your machine.
Why voice is the companion killer feature
Typing is a chore. It’s a context switch — you stop, you compose, you read. Voice collapses that. You speak the way you’d speak to a person, and the reply comes back as sound you can hear while you’re cooking, driving, or lying in the dark. For a companion specifically, voice is what makes the thing feel embodied rather than feeling like a search box with a personality.
There’s also an intimacy dimension that text can’t touch. A voice has warmth, pacing, hesitation, tone. The difference between reading “I missed you” and hearing it is the difference between a postcard and a phone call. This is exactly why voice is the single most-requested feature for AI companions — and exactly why it’s the feature you least want running through a corporate datacenter.
The offline voice stack: Whisper (STT) + LLM + Piper/XTTS (TTS)
Voice AI is three boxes in a row. Audio comes in the left, audio goes out the right, and the LLM thinks in the middle.
| Stage | Job | Best offline tool (2026) | Runs on |
|---|---|---|---|
| STT (speech-to-text) | Turn your mic audio into text | Whisper — whisper.cpp or faster-whisper | CPU or GPU |
| LLM (the brain) | Read the text, write a reply | Ollama + a local model | GPU preferred |
| TTS (text-to-speech) | Turn the reply into spoken audio | Piper (fast) or XTTS / Coqui (richer) | CPU (Piper) or GPU (XTTS) |
Speech-to-text — Whisper. OpenAI open-weighted the Whisper models, and the community built faster runtimes around them. Two matter:
whisper.cpp— a plain C/C++ port, no Python, compiles to a tiny binary, uses GGML quantized models. Best on Macs (Metal acceleration) and CPU-only boxes.faster-whisper— a CTranslate2 reimplementation, INT8 quantization, roughly 4× faster on an NVIDIA GPU. Best if you have a CUDA card.
For real-time chat you don’t need the giant large-v3 model. A base or small model — or distil-whisper, which is purpose-built for latency — transcribes a sentence in a fraction of a second and is plenty accurate for conversation.
The brain — Ollama. This is the same local LLM you’d run for text chat. The voice layer doesn’t care which model it is; it just sends text in and reads text out. If you haven’t set this up yet, start with how to run AI locally and how to install Ollama, then pick a model that fits your card from the hardware guide. For an uncensored, companion-friendly brain, see the best uncensored local AI models.
Text-to-speech — Piper or XTTS. This is the box people underestimate.
- Piper is a fast, local neural TTS engine (VITS models shipped as a
.onnxfile plus an.onnx.jsonconfig). It’s so light it runs in real time on a Raspberry Pi 4 CPU, supports 30+ languages, and is the pragmatic default for a responsive offline companion. - XTTS (Coqui) is heavier and wants a GPU, but it does voice cloning from a few seconds of reference audio and produces warmer, more expressive output. Pick this when the voice itself is the point and you have VRAM to spare.
Step-by-step DIY setup
Here’s the minimal path on Linux or macOS. The shape is identical on Windows (WSL2 makes it painless).
1. Get the brain running. Install Ollama and pull a model:
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1
Ollama serves a local API at http://127.0.0.1:11434 — loopback only, nothing exposed to the network. That address is the whole privacy story in one line: your text never leaves the machine.
2. Add ears (Whisper). On an NVIDIA box, faster-whisper is a pip install:
pip install faster-whisper
Then in Python you load a small model and transcribe a captured audio chunk:
from faster_whisper import WhisperModel
model = WhisperModel("small", device="cuda", compute_type="int8")
segments, _ = model.transcribe("input.wav")
text = " ".join(s.text for s in segments)
On a Mac or CPU-only machine, build whisper.cpp instead and call its binary on a recorded clip — same idea, no Python runtime.
3. Add a mouth (Piper). Download Piper plus a voice (the two .onnx files), then pipe text in:
echo "Hey, you're finally home." | \
piper --model en_US-amy-medium.onnx --output_file reply.wav
Play reply.wav and you’ve heard your local AI speak.
4. Glue the loop. A ~60-line script ties it together: capture mic audio → Whisper transcribes → POST the text to 127.0.0.1:11434 → take the reply → Piper synthesizes → play it. Add silence detection (voice activity detection / VAD) so it knows when you’ve stopped talking, and you’ve built a real-time voice loop. Pair it with a persona and memory — see local AI with persistent memory — and it stops feeling like a demo.
Latency: getting to sub-2s, fully offline
The whole experience lives or dies on the gap between you finishing a sentence and the voice starting. Two seconds feels conversational; five seconds feels broken. Where the time goes, and how to claw it back:
| Stage | Typical cost | How to cut it |
|---|---|---|
| VAD / end-of-speech | ~200–400 ms | Tune silence threshold; don’t wait too long |
| Whisper STT | 200 ms–1 s | Use base/small or distil-whisper, not large-v3 |
| LLM first token | 300 ms–1.5 s | Smaller/quantized model; keep it loaded in VRAM |
| TTS (Piper) | <300 ms | Piper is already fast; stream sentence-by-sentence |
The biggest single win is streaming: don’t wait for the LLM to finish its whole paragraph. As soon as the model emits the first complete sentence, hand it to Piper and start playing audio while the model keeps writing. That hides most of the generation time behind speech you’re already hearing. Keep the model warm in VRAM (Ollama does this if you keep using it) so you skip cold-load, and a GPU that hits a usable tokens-per-second rate will keep you under two seconds comfortably.
Why offline voice = your voice never hits the cloud
This is the part worth being blunt about. Your voice is biometric data. It’s as identifying as a fingerprint, and unlike a password you can’t change it. When you talk to a cloud assistant or a cloud companion app, the architecture requires your raw audio to travel to their servers — that’s where the transcription and the model live. What happens to that recording afterward is governed entirely by a privacy policy you didn’t write and can’t audit.
Run the stack locally and that whole risk surface disappears. Whisper transcribes on your CPU/GPU. The LLM thinks on your GPU. Piper speaks from your CPU. The audio is born and dies on your machine — the loopback address 127.0.0.1 is the only “network” involved. There’s no account, no upload, no retention policy, because there’s nothing to retain. For the deeper argument on why this matters, see local AI vs cloud AI and the AI companion privacy guide. And if you’re wondering why a cloud model clams up the moment a conversation gets personal, that’s covered in why cloud AI censors you.
The friction of the DIY route
Honest accounting: the stack above genuinely works, and it’s the right project if you enjoy the build. But it is a build. You’re standing up three separate pieces of software, getting CUDA or Metal to cooperate, writing the glue loop, wiring VAD so it doesn’t cut you off or hang forever, and tuning latency by hand. Then there’s the unglamorous middle layer nobody mentions — turn-taking, interruptions, barge-in (talking over the AI), handling the AI’s reply being a 400-word monologue when you wanted a sentence. Getting each box running is an afternoon. Getting the conversation to feel natural is a project. That’s a fair trade if tinkering is the point; it’s a wall if you just want to talk to something tonight.
Built-in voice the easy way (Ember local voice)
If you want the privacy of the full offline stack without assembling it yourself, that’s exactly the gap Ember fills. Ember runs your AI companion 100% on your own machine on top of Ollama — and it ships voice already wired: Whisper-class speech-in, a local TTS voice-out, persona, and memory, all bolted together and tuned so the conversation actually flows. Your voice and your messages never leave your computer, same as the DIY path, but you skip the CUDA wrangling and the glue scripts. It’s a one-time $49, no subscription — the no-subscription companion approach — and if a local, private partner is the actual goal, how to run an AI girlfriend locally walks through what that looks like.
No GPU? Hosted voice (Freya)
Real talk: the offline voice stack wants a GPU. Whisper, the LLM, and especially XTTS all lean on VRAM, and on a thin laptop you’ll feel every second of latency. If you don’t have the hardware — or you just want to talk right now with zero install — Freya is the hosted route. It’s a cloud AI companion with voice built in and nothing to set up: no Ollama, no models to download, no GPU. You trade the absolute-zero-egress guarantee of local for instant, works-anywhere convenience — the same trade laid out in local AI vs cloud AI.
The two paths aren’t a contradiction; they’re the same product split by hardware. Want your voice to never leave your machine and you’ve got the GPU? Ember, local and yours for $49. Want it instantly with nothing to install? Freya, hosted and ready the moment you open it.
