“Do I need a GPU to run an AI companion?” almost always means something more specific: can I get a good companion experience on the machine I already own, or do I have to spend money first? The honest answer is no — you do not strictly need a GPU. But “can run” and “feels good to talk to” are two very different bars. A graphics card doesn’t unlock the experience so much as it removes the friction: faster replies, bigger and smarter models, longer memory. Without one, you can still run a companion locally on CPU, or skip the hardware question entirely with a hosted service. This page walks through exactly what a GPU buys you, what CPU-only actually feels like in practice, what it costs either way, and a clear decision path so you stop guessing.
The real question behind “do I need a GPU”
Nobody searches this because they care about silicon. They care about three downstream things: will replies come fast enough to feel like a conversation, will the model be smart enough to stay in character and remember things, and how much will it cost me to get there. A GPU is just the lever that moves all three at once.
So reframe the question. It isn’t “GPU: yes or no.” It’s: given my hardware and my patience, where does a local companion stop being fun and start being a chore? For some people the line is “I have a gaming PC with a decent card, so local is a no-brainer.” For others it’s “I have a work laptop with integrated graphics and I want to talk to something tonight” — and for them, local-on-CPU is technically possible but a hosted companion is the obvious move. Both are legitimate. The trap is buying a GPU you didn’t need, or suffering through CPU-only when a $0-upfront cloud option would have served you better.
What a GPU buys you for companion chat
A companion is a large language model (LLM) generating tokens — roughly, word-pieces — one after another. Two numbers govern the experience:
- Tokens per second (speed). A GPU holds the whole model in fast video memory (VRAM) and runs the math in parallel. That’s why a mid-range card produces replies several times faster than a CPU. Comfortable conversational reading speed is somewhere around 8–10 tokens/second and up; below that, you’re watching a progress bar. (More on this threshold in tokens per second: what’s actually usable.)
- Model size (smarts). VRAM is a hard ceiling on which models fit. More VRAM means you can run a larger, more coherent model — better at staying in character, holding context, and not contradicting itself. An 8GB card comfortably runs solid 7–8B models; 12–16GB opens up the sweet spot; 24GB lets you run genuinely large models well. Our VRAM guide for companions breaks this down model by model.
The other lever is quantization — compressing a model so it fits in less memory. Tags like Q4_K_M are the common middle ground: roughly four bits per weight, a big size cut for a small quality cost. Quantization is what lets an 8GB card punch above its weight. See the GGUF quantization cheat sheet for which tag to pick.
In short: a GPU buys speed and headroom. It doesn’t make a companion possible — it makes it pleasant.
CPU-only companion: the honest experience
You can absolutely run a local companion with no dedicated GPU. Ollama and similar runtimes fall back to CPU automatically. Here’s the unvarnished reality.
- It works. Install the runtime (
curl -fsSL https://ollama.com/install.sh | sh), pull a small model, andollama runit. The API listens on127.0.0.1:11434— fully local, nothing leaves your machine. - Keep the model small. On CPU, stick to compact models (think 3B–8B class at a
Q4quant). Larger models technically load but generate painfully slowly. - Speed is the catch. Expect single-digit tokens per second on a typical modern CPU with a small model — readable, but you’ll feel the pauses, and long replies test your patience. Plenty of RAM matters more than raw clock speed here, since the whole model lives in system memory.
- Prompt processing lags too. Long companion histories (the “memory” you’ve built up) take longer to re-read on CPU, so replies get slower as the conversation grows.
If this is your path, we have a dedicated walkthrough: run local AI without a GPU, plus a companion-specific version, an uncensored AI girlfriend with no GPU. The summary: CPU-only is a real option for the patient and the privacy-maximalist, a frustrating one for everybody else.
When local makes sense (you have, or will buy, a GPU)
Local is the right call when one or more of these is true:
- You already own a capable GPU (a recent gaming card with 8GB+ VRAM). The hardware cost is sunk — running a companion on it is essentially free forever.
- Ownership and privacy are the point. Local means messages never leave your machine. No server-side logs, no per-message moderation, no terms-of-service changes that suddenly restrict what you can say. For why that matters, see local AI vs cloud AI and why cloud AI censors you.
- You want to pay once, not monthly. A local setup has no subscription. The model is yours; it runs as long as your PC does.
This is exactly the lane Ember is built for: a companion that runs 100% on your own machine via Ollama, sold once for $49, uncensored because there’s no server in the middle deciding what’s allowed. If you have the GPU (or you’re already buying one for gaming and AI is a bonus), Ember turns that hardware into a private, permanent companion. Start with how to run an AI companion locally if you want the full setup picture.
When the cloud just works (no GPU, want it now)
Be honest with yourself. The cloud is the better answer when:
- You don’t have a GPU and don’t want to buy one. A $300+ graphics card to chat is a hard sell if gaming or other GPU work isn’t already on your list.
- You’re on a laptop, a Mac without much unified memory, or integrated graphics. Local is possible but compromised (more in the FAQ below).
- You want to start in the next five minutes, on your phone or any browser, with zero install, zero model downloads, zero
Q4_K_Mdecisions.
A hosted companion runs the model on someone else’s GPU and streams the conversation to you. The trade-off is real and worth naming: messages are processed and, by the nature of the architecture, stored server-side — so you’re trusting a provider’s privacy policy rather than owning the stack. That’s a genuine cost. But for the “no GPU, want it now” reader, it’s often the right trade. Freya is exactly this: a hosted AI companion with no setup and no hardware requirement — you open it and start talking.
Cost comparison: GPU upfront vs hosted ongoing
There’s no universally “cheaper” option — it depends on your timeline and whether the hardware does double duty.
| Local + GPU (e.g. Ember) | Hosted cloud (e.g. Freya) | |
|---|---|---|
| Upfront cost | GPU if you don’t own one (varies widely), plus a one-time app price | $0 hardware |
| Ongoing cost | Electricity only | Monthly/usage subscription |
| Time to first chat | Install runtime, pull model, configure | Minutes — open and talk |
| Privacy | Messages stay on your machine | Processed/stored server-side; trust the policy |
| Smarts ceiling | Limited by your VRAM | Provider runs large models for you |
| Best if… | You own a GPU, value ownership, hate subscriptions | No GPU, want it now, fine with hosted |
The crossover logic is simple. If you already own the GPU, local is the cheapest path that exists — there’s no recurring bill. If you’d have to buy a GPU purely for this, the hosted subscription will be cheaper for a long time before the hardware pays for itself, and you skip the setup entirely. For the deeper privacy side of this trade, see our AI companion privacy guide.
Decision flowchart
Walk it top to bottom and stop at your first “yes”:
- Do you already own a GPU with 8GB+ VRAM? → Run local. It’s effectively free and fully private. Go to how much VRAM you need to pick a model.
- No GPU, but happy to buy one (and it’ll also game / do other AI work)? → Local is worth it. See the cheapest GPU for local AI.
- No GPU, won’t buy one, but privacy is non-negotiable and you’re patient? → CPU-only local with a small model. Read run local AI without a GPU.
- No GPU, want it now, fine trusting a provider? → Hosted cloud. Open Freya and start talking.
Most people who land on this page are honestly at step 4 — and that’s not a failure of nerve, it’s a sensible read of the trade-offs.
FAQ: laptops, Macs, and integrated graphics
Can my laptop run a companion? If it has a dedicated GPU with enough VRAM, yes — treat it like a desktop. Gaming laptops qualify. Thin-and-light laptops with integrated graphics fall back to CPU and the patience caveats above apply. Thermals also throttle laptops during sustained generation.
What about a Mac? Apple Silicon (M-series) is genuinely good for local AI because the CPU and GPU share fast unified memory — that pool acts like VRAM. A Mac with ample unified memory runs mid-sized models well; one with the base memory config is more limited. See running local AI on a Mac mini.
Do integrated graphics (Intel/AMD iGPU) help? Barely, for now — most popular runtimes don’t accelerate well on integrated GPUs, so you’re effectively on CPU. Don’t count an iGPU as “having a GPU” for this purpose. Check whether your PC can run a companion at all before committing.
Will a cheap GPU work? Often, yes — an older card with 8GB VRAM runs small companion models nicely. You don’t need a flagship. The budget AI PC build guide covers sensible entry points.
The bottom line: a GPU is the difference between possible and pleasant for a local companion — but it’s not a requirement to start. If you don’t have one and don’t want to think about VRAM, quants, or downloads tonight, a hosted companion like Freya skips the hardware question entirely and lets you start the conversation right now.
