Most “can my PC run a local AI companion?” answers either drown you in GPU jargon or hand-wave with “you need a good graphics card.” Here’s the honest version: whether your machine can run a private, uncensored AI companion comes down to four numbers you can read in two minutes — your GPU, its VRAM, your system RAM, and your OS. Get those, match them to a tier below, and you’ll know exactly which models will run smoothly, which will crawl, and whether you’re better off going hosted. No benchmarking suite, no trial-and-error downloads.
The 2-minute check: GPU, VRAM, RAM, OS
A local AI companion is just a large language model (LLM) running on your own hardware via a runtime like Ollama. The single biggest factor in whether it runs well is VRAM — the dedicated memory on your graphics card. The model’s weights have to fit somewhere fast, and a GPU’s VRAM is the fast place.
Here are the four things to find:
- GPU — Do you have a discrete graphics card (NVIDIA or AMD), or only integrated graphics? NVIDIA is the smoothest path; AMD works with more setup; Apple Silicon (M-series) is its own strong category.
- VRAM — How many gigabytes of memory the GPU has. This is the number that decides your model size. 8GB, 12GB, 16GB, 24GB are the meaningful breakpoints.
- System RAM — Your regular computer memory. Matters for CPU fallback and for running everything else. 16GB is a floor; 32GB is comfortable.
- OS — Windows, macOS, or Linux. All three run Ollama. Apple Silicon Macs are special because they share memory between CPU and GPU (more on that below).
If you remember one rule: VRAM drives model size. Everything else is secondary.
Reading your specs (Windows/Mac/Linux quick steps)
You don’t need any special software. Here’s how to pull the numbers on each OS.
Windows:
- Press
Ctrl + Shift + Escto open Task Manager → Performance tab → click GPU. The model name is at the top; Dedicated GPU Memory is your VRAM. - For RAM, click Memory in the same tab.
- Or run
dxdiagfrom the Start menu and check the Display tab.
macOS:
- Click the Apple menu → About This Mac. You’ll see the chip (e.g., Apple M2) and Memory.
- On Apple Silicon, that single memory number is both your RAM and your usable VRAM — it’s unified memory. A 16GB M-series Mac can devote a large share of that 16GB to a model. This is why Macs punch above their weight for local AI; see our Mac mini for local AI breakdown.
Linux:
- For NVIDIA: run
nvidia-smi. The top-right number in the table is total VRAM in MiB. - For AMD:
rocm-smior checklspci | grep -i vgafor the card name. - For RAM:
free -h.
Write down: GPU name, VRAM in GB, RAM in GB, OS. Now match them.
Spec tiers and what each can run
Local companion models are typically run quantized — compressed to use less memory at a small quality cost. The common sweet spot is a Q4_K_M quant, which roughly halves memory needs versus full precision while keeping output quality high. The tiers below assume Q4-class quants, which is what you’ll actually use.
| Tier | VRAM | What runs well | Companion experience |
|---|---|---|---|
| Entry | 6–8GB | 7B–8B models (Q4) | Fully usable. Fast, coherent chat. The realistic floor. |
| Comfortable | 12–16GB | 12B–14B, some 22B (Q4) | Noticeably richer, more consistent personality and memory. |
| Enthusiast | 24GB | 27B–32B (Q4), 70B partially | Premium local experience; long context, strong roleplay. |
| No discrete GPU | iGPU only | 1B–3B on CPU, slowly | Possible but sluggish — see the no-GPU path below. |
| Apple Silicon | 16GB+ unified | 8B–14B comfortably | Strong; unified memory does a lot of work. |
The honest minimum for a genuinely good companion is an 8GB GPU running an 8B model — that’s the entry tier, and it’s where most people start. We go deeper on model-per-tier picks in how much VRAM for a local AI companion and the full local AI hardware guide.
What ‘minimum’ actually feels like (slow vs smooth)
“It runs” and “it’s pleasant to use” are different claims. The metric that bridges them is tokens per second (t/s) — how fast the model generates text. A token is roughly three-quarters of a word.
- Below ~5 t/s: Painful. You watch words appear slower than you read. Fine for testing, not for chatting.
- ~8–15 t/s: The usable floor. Reads like someone typing quickly. This is the target for a companion.
- 20+ t/s: Smooth and immersive — text arrives faster than you read it.
When a model fits entirely in VRAM, you land in the smooth range easily. When it spills out of VRAM, speed falls off a cliff — which is the trap the next section covers. We break down the human feel of these numbers in tokens per second: what’s actually usable.
So when someone says a model meets the “minimum specs,” ask: minimum to load, or minimum to enjoy? Those are different. Aim to fully fit your model in VRAM with a couple of gigabytes to spare for context.
Partial offload: the speed cost most pages hide
Here’s the detail that makes or breaks the verdict. When a model is too big for your VRAM, Ollama doesn’t refuse — it offloads the overflow layers to system RAM and runs them on the CPU. This is called partial offload, and it’s much slower than people expect.
The reason: your GPU’s VRAM has enormous bandwidth; system RAM over the PCIe bus does not. Every layer that lives in RAM forces the GPU to wait. A model that flies at 25 t/s fully in VRAM can collapse to 3–5 t/s once a third of it spills to CPU. The slowdown isn’t proportional — it’s punishing, because the CPU portion becomes the bottleneck for the whole generation.
What this means in practice:
- A 14B Q4 model on a 12GB card mostly fits and stays fast.
- The same 14B on an 8GB card partially offloads and slows to a crawl.
- The fix is almost always a smaller model or a more aggressive quant (e.g., Q4 instead of Q5, or step down a model size), not “just run it anyway.”
This is why the tier table is conservative. The goal is full-VRAM fit. If you’re CPU-only with no discrete GPU at all, that’s a different setup entirely — covered in run local AI without a GPU and do I need a GPU for an AI companion?.
If it can’t: the hosted route (Freya)
If your spec check came back thin — integrated graphics, 4GB VRAM, an older laptop, or a Mac with 8GB — running a satisfying companion locally is going to fight you. Partial offload will keep it slow, and buying hardware isn’t always the move.
That’s the case for a hosted companion. Freya runs the whole model in the cloud, so there’s nothing to install, no VRAM math, no quant tags — you open it and talk. Same kind of expressive, uncensored companion experience, just with someone else’s GPU doing the work. The trade-off is honest: it’s not 100% private the way a local model on your own disk is, because the conversation happens on a server. If zero-setup and “want it now” matter more than full local ownership, hosted is the pragmatic answer. (Curious about cloud companion privacy in general? Read are AI girlfriend apps safe?)
If it can: the local setup path (Ember)
If you cleared the entry tier — an 8GB+ GPU or a 16GB Apple Silicon Mac — you can run a private companion that lives entirely on your machine. The setup is short:
- Install the runtime:
curl -fsSL https://ollama.com/install.sh | sh - Pull and run a model (the API serves locally on
127.0.0.1:11434, loopback-only — nothing leaves your machine):ollama run <model> - Point a companion front-end at that local API.
That’s the bones of it. The real payoff is ownership: uncensored, no monthly fee, no server logging your chats, because there’s no server. Ember is built exactly for this — a sold-once companion that runs 100% on your own hardware through Ollama, for the privacy-and-ownership reader. Start with how to run AI locally and the best uncensored local AI models to pick what fits your tier.
Cheap upgrades that change the verdict
Sometimes you’re one small change away from a “yes.” In rough order of value-per-dollar:
- Add system RAM (cheapest win). Going from 16GB → 32GB won’t speed up a fully-in-VRAM model, but it stabilizes partial offload, lets you run bigger context, and stops your OS from choking. Often under $50.
- Use a smaller model or heavier quant. Free. Dropping from a 14B to an 8B, or from Q5 to Q4_K_M, can move you from “crawls” to “smooth” with no purchase at all.
- A used mid-range NVIDIA GPU. The single biggest leap. A used card in the 12–16GB VRAM range jumps you from the entry tier to comfortable. See cheapest GPU for local AI.
- Close the VRAM hogs. Browsers with many tabs and other GPU apps eat VRAM. Closing them can free up enough headroom to fit a model fully.
Run the check, find your bottleneck, and fix the cheapest one first — RAM and model size are free or near-free wins before you ever buy a GPU.
The bottom line: read four numbers, match a tier, and you know your answer. If your hardware clears the entry bar, a fully private local companion like Ember is within reach today; if it doesn’t, Freya gets you the same kind of conversation in the cloud with zero setup while you decide whether to upgrade.
