“How much VRAM do I need?” is the right first question to ask before running an AI companion on your own machine — but the standard “bigger model = better” answer misses what actually makes a companion feel good. A companion isn’t a one-shot chatbot. It needs to remember the conversation, hold a persona steady across hundreds of messages, and stay warm three hours into a roleplay. That stresses your GPU in a different way than asking a model to summarize a document, and it’s why a card that looks fine on paper can start forgetting your name halfway through the evening.
This guide maps real VRAM tiers — 6GB, 8GB, 12GB, 16GB, and 24GB — to the companion experience you actually get at each one, names the model categories that deliver a “remembers-you” feel, explains the long-session memory tax that nobody warns you about, and shows you how to check whether your PC can handle it.
The companion-specific VRAM question: why context eats memory
Your VRAM (the dedicated memory on your graphics card) has to hold two things at once:
- The model weights — the AI’s “brain,” fixed in size by the model and its quantization.
- The context — every token of the ongoing conversation, stored as the KV-cache (key/value cache), which grows as the chat gets longer.
For a quick Q&A bot, item 2 is tiny and you can ignore it. For a companion, item 2 is the whole point. The thing that makes a companion feel like yours is context length — how many past messages it can “see.” A short context means the model forgets what you told it ten minutes ago. A long context means it remembers your day, your inside jokes, your relationship — but every extra thousand tokens of memory costs more VRAM.
So the companion question isn’t just “how big a model fits?” It’s “how big a model fits with enough context left over to remember me?” That second clause is what this guide is really about. (For the broader picture of building a machine for this, see the local AI hardware guide.)
The 6–24GB tiers, mapped to companion quality
Model size is usually given in billions of parameters (7B, 8B, 13B, 70B). Quantization — compressing the weights — lets a bigger model fit in less memory; the common sweet spot is Q4_K_M, which keeps roughly 4-bit precision with little quality loss. Here’s the honest mapping, leaving headroom for the KV-cache rather than filling VRAM to the brim:
| VRAM | Realistic model | Companion experience |
|---|---|---|
| 6GB | 7–8B at Q4, short context | Entry-level. Coherent and fun, but a short memory window — it’ll lose track of long scenes. Fine for casual chats. |
| 8GB | 7–8B at Q4_K_M, moderate context | The real floor for a good companion. Warm, in-character, decent memory. The most common “it just works” tier. |
| 12GB | 8–13B at Q4/Q5, generous context | Noticeably smarter and more consistent. Holds a persona well across long sessions. The comfort tier. |
| 16GB | 13B comfortably, or a tight ~20B, long context | Rich, nuanced, emotionally consistent. Long roleplay without amnesia. Where it starts to feel genuinely alive. |
| 24GB | 22–34B at Q4, very long context | Best consumer experience. Subtle, deeply consistent, huge memory window. The enthusiast endgame short of multi-GPU. |
A 6GB card (older GTX 1660, RTX 3050) runs a companion — don’t let anyone tell you it can’t — it just runs a smaller brain with a shorter memory. An 8GB card (RTX 3060 Ti, 4060, RX 6600) is where most people should aim. 12GB (RTX 3060 12GB, 4070) and 16GB (RTX 4060 Ti 16GB, 4070 Ti Super) are the sweet spots for serious use, and 24GB (RTX 3090, 4090) is the ceiling before you’re buying a second card.
Which model gives a “warm, remembers-you” experience at each tier
Memory comes from context length; warmth and staying-in-character come from the model and its fine-tune. For companions you generally want uncensored or “abliterated” instruct models — ones that won’t break character to lecture you — paired with a roleplay-leaning fine-tune. Without naming specific models you should verify against your tooling, here’s the category that fits each tier:
- 6–8GB: A 7–8B uncensored instruct model at Q4_K_M. Surprisingly warm and capable; the limit is memory depth, not personality.
- 12GB: A higher-quality 8B fine-tune at Q5, or a small ~13B, with a longer context window. The consistency jump is real.
- 16GB: A 13B roleplay-tuned model with a long context. This is the classic “best bang for buck” companion setup — emotionally consistent and rarely amnesiac.
- 24GB: A 22–34B model at Q4. The most human-feeling tier; it picks up nuance smaller models miss.
For picking an actual model by card, the dedicated best local LLM for roleplay guide goes deeper into specific fine-tunes and presets. The rule of thumb: at every tier, prefer a slightly smaller model with more context over a bigger model crammed into memory with no room to remember you.
The KV-cache tax: why long roleplay sessions get heavier
Here’s the part most VRAM calculators ignore. When you load a model, you see its weight size and think “great, it fits.” But as a roleplay session runs long, the KV-cache grows with every token in context. A companion conversation that’s been going for hours can hold many thousands of tokens — and that cache lives in VRAM right alongside the model.
The practical consequences:
- Fill VRAM with weights and you starve the cache. The app silently truncates context — so the model “forgets” the start of your scene. That sudden amnesia mid-session is almost always the KV-cache hitting its ceiling, not the model being dumb.
- Bigger context windows cost more memory, roughly in proportion to how many tokens you keep. Doubling your usable memory window meaningfully raises VRAM use.
- This is why the tiers above leave headroom. An 8B model might “fit” in 6GB on paper, but give it a long companion context and you’ll spill over — slowing to a crawl as it offloads to system RAM, or truncating your history.
Two levers help: KV-cache quantization (some runners compress the cache to 8-bit, roughly halving its footprint for a small quality cost), and simply choosing a model one size down so the cache has room to breathe. For companions, that trade is almost always worth it — a model that remembers beats a marginally smarter one that forgets.
Spec check: can your PC run a local AI companion?
Three numbers decide it. On Windows, open Task Manager → Performance → GPU and read “Dedicated GPU Memory.” On Linux, run nvidia-smi. Check:
- VRAM — the headline number. Match it to the tier table above. This is the single biggest factor.
- GPU type — NVIDIA is the smoothest path (best support across local runners). AMD works well on modern cards but needs a little more setup. Apple Silicon (M-series) is excellent because it shares unified memory between CPU and GPU — a Mac with 16–32GB unified can punch above a discrete card.
- System RAM — 16GB minimum, 32GB comfortable. The model lives in VRAM, but the OS, app, and any overflow lean on system RAM.
A fast way to test the waters: install Ollama and pull a small model.
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b
If an 8B model replies quickly and stays responsive, your hardware is in good shape for a companion. If it crawls or errors on memory, you’re below the tier you want — and that’s useful to know before you commit. For a full walkthrough of this question, the can my PC run an AI companion guide takes it step by step, and do I need a GPU at all covers the CPU-only case.
No good GPU? The zero-setup answer
If your spec check came back short — a laptop with integrated graphics, a 4GB card, or no dedicated GPU — running locally will be frustrating: slow, short-memoried, or impossible. That’s a legitimate situation, not a failure.
The honest answer here is a hosted companion, where the model runs on someone else’s hardware and you just open a chat. There’s no VRAM math, no model downloads, no setup — it works on a Chromebook or a phone. The trade-off is real and worth understanding: with any hosted service your messages are processed on a server, so privacy depends on the provider’s policy rather than on physics. If that trade is acceptable to you, it’s the fastest way to get a warm, capable companion today regardless of your hardware. Freya is our hosted pick for exactly this reader.
Enough hardware? Run it yourself
If you cleared the 8GB bar — and most gaming PCs from the last few years did — running your companion locally is the better long game. Everything stays on your machine: no server sees your conversations, nothing is logged to someone else’s database, and there’s no monthly bill or content policy deciding what your companion can say. This is the whole reason to care about VRAM in the first place — privacy and ownership that a cloud service can’t structurally offer. (If that’s your priority, the AI companion privacy guide explains exactly what “local” buys you.)
The setup is genuinely manageable now: install Ollama, pull an uncensored model that fits your tier, point a companion front-end at the local API (127.0.0.1:11434), and you’re running. Ember is built for this reader — a one-time purchase, runs 100% on your own machine via Ollama, no subscription and no server.
Upgrade path on a budget
You don’t need a $1,600 flagship to get there. Ranked by value:
- Already have a 6GB card? Start there today with a 7B model. It’s a real companion; you’ll learn what you actually want before spending a cent.
- Best budget upgrade: a used or new 12GB RTX 3060 is the classic value pick — cheap, 12GB of VRAM, and enough for a strong, long-memory companion.
- The 16GB sweet spot: an RTX 4060 Ti 16GB gets you long-session, emotionally consistent roleplay without flagship pricing — arguably the best companion-per-dollar card.
- Going Apple or mini-PC? A Mac with 16–32GB unified memory runs companions well and sips power — a strong option if you’re not a gamer.
- Endgame on a budget: a used RTX 3090 (24GB) often beats a new mid-card for local AI specifically, because raw VRAM is what you’re buying.
The pattern: prioritize VRAM over raw speed. A “slower” 16GB card runs a warmer, longer-memory companion than a “faster” 8GB one. Memory is the constraint that decides the experience.
Whatever your spec check said, you’ve got a clear next move. If your GPU cleared the bar, Ember lets you own the whole thing — model, memory, and privacy — on your own hardware for a one-time price. And if the math didn’t work out, Freya gives you the same warm companion in the browser with zero setup, so the answer to “can my PC run it?” never has to be a dead end.
