“How much VRAM do I need for local AI?” is the question that decides whether running a model on your own machine is a five-minute win or a frustrating crawl. The honest answer is that it’s not one number — it depends on the model’s parameter count, the quantization you pick, and how much context you keep open. But it’s also completely predictable once you know the formula. This guide gives you that formula, a model-size-to-memory cheat sheet, the hidden memory costs nobody warns you about, and a tier-by-tier map so you can look at your GPU and know exactly what it can run before you download a single byte.

The short version: VRAM is the constraint that matters, your model has to fit inside it to run fast, and a model that fully fits in VRAM will beat a bigger one that doesn’t — every time.

VRAM vs RAM vs unified memory: what each does for inference

Three kinds of memory show up in this conversation, and conflating them is the #1 reason people buy the wrong machine.

  • VRAM (video memory) is the fast memory soldered onto your dedicated GPU (an NVIDIA RTX card, for example). LLM inference is bottlenecked by memory bandwidth, and VRAM is roughly an order of magnitude faster than system RAM. When the whole model lives in VRAM, the GPU streams weights at full speed and you get snappy, real-time responses. This is the number that defines what you can run well.
  • System RAM is your regular DDR4/DDR5 memory attached to the CPU. It’s plentiful and cheap, but slow for this job. A model running in RAM uses the CPU to do the math, which is dramatically slower than a GPU. RAM is the fallback, not the goal.
  • Unified memory is the Apple Silicon trick (and some newer integrated designs): a single pool of fast memory the CPU and GPU share. A Mac with 32GB or 64GB of unified memory can load surprisingly large models because there’s no separate, tiny VRAM ceiling — the whole pool is usable, and Apple’s bandwidth is far better than typical desktop RAM. It won’t match a high-end discrete GPU on raw speed, but it punches well above a PC with the same nominal memory. (See Mac mini for local AI if you’re weighing that route.)

The rule to internalize: discrete GPU = your VRAM number is the budget. Apple Silicon = your unified memory number is the budget (minus what the OS needs). No GPU = system RAM is the budget, and you pay for it in speed.

The sizing formula: params × bytes-per-param × 1.2

Here’s the core math. To estimate the memory a model needs:

memory (GB) ≈ (billions of params) × (bytes per parameter) × 1.2

The bytes per parameter is set by the quantization — the precision the weights are stored at. Lower precision = smaller file = less memory, with a gentle quality trade-off:

QuantBytes/paramQualityNotes
FP16 / BF16 (full)2.0ReferenceWhat the model was trained at; rarely needed locally
Q8_0~1.0Near-losslessGreat if it fits
Q4_K_M~0.5ExcellentThe community default sweet spot
Q4_K_S / Q3_K_M~0.4Good→fairSqueeze a model into a tighter card
Q2_K~0.3Noticeably degradedLast resort

The × 1.2 is overhead — the runtime, a baseline context window, and working buffers. It’s an approximation, not a guarantee, but it keeps you honest.

Cheat-sheet: approximate VRAM to load a model at Q4_K_M (the quant most people should use):

Model sizeQ4_K_M weights~VRAM to run comfortably
3B~2 GB4 GB+
7–8B~4.5–5 GB8 GB
13–14B~8–9 GB12 GB
22–24B~13–14 GB16 GB (tight) / 24 GB (comfortable)
32–34B~19–20 GB24 GB
70B~40–43 GB48 GB+ (or dual 24GB cards)

A worked example: a 7B model at Q4_K_M is 7 × 0.5 × 1.2 ≈ 4.2 GB, which is why 8B-class models are the natural home of an 8GB card — with headroom left for context. For the full quant breakdown, the GGUF quantization cheat sheet goes deeper.

The hidden tax: KV-cache and long-context memory growth

The cheat-sheet above covers loading the weights. It does not cover the KV-cache — the memory that holds the conversation as you talk. This is the cost everyone forgets, and it’s why a model that loads fine can run out of memory three hours into a chat.

Every token in your context window — the system prompt, the character card, the whole back-and-forth — gets cached so the model doesn’t recompute it each turn. The cache grows with context length, and it can add anywhere from a few hundred megabytes to several gigabytes on top of the weights. A long roleplay or a document you’ve pasted in eats VRAM continuously.

Practical implications:

  • Leave a buffer. Don’t pick a model that fills your VRAM to 99% on load. Leave 1–3GB of headroom for context, more if you run long sessions.
  • Context length is a memory dial. Running an 8K context is cheap; a 32K or 128K context can cost as much memory as the model itself. Set the context to what you actually need.
  • KV-cache quantization helps. Many runtimes can store the cache at lower precision (e.g. Q8 or Q4 KV), roughly halving or quartering its footprint with minimal quality loss — a real lifesaver on smaller cards.

If persistent, long-running conversations are your use case, local AI with persistent memory explains how to keep continuity without blowing the cache up.

The ‘largest quant that fully offloads’ rule

This is the single most important rule in local AI, and it’s counterintuitive:

A smaller model (or lower quant) that fits entirely in VRAM beats a bigger one that spills into system RAM — almost always.

When a model doesn’t fit in VRAM, the runtime “offloads” the overflow layers to system RAM and runs them on the CPU. The moment that happens, your speed falls off a cliff — often from a comfortable 30–50 tokens/second down to 2–5, because every spilled token now waits on slow RAM and the CPU. A partial offload is the worst of both worlds.

So the decision tree for any given card is:

  1. Pick the biggest model whose Q4_K_M weights + a context buffer fit fully in your VRAM.
  2. If a model you want is slightly too big, drop the quant (Q4_K_S, Q3_K_M) to make it fit fully — a tighter quant in VRAM beats a fatter quant that spills.
  3. Only accept CPU offload if you genuinely need a model your hardware can’t hold and you can tolerate slow output.

“Fully offloads” in Ollama/llama.cpp terms means all layers on the GPU. When you ollama run a model, watch whether it reports 100% GPU. If it doesn’t, you’ve either picked too big a model or too long a context.

Hardware tiers: what each VRAM class actually runs

Here’s the map. Find your card, find your ceiling.

VRAMSweet-spot models (Q4_K_M)Experience
8 GB (RTX 3060 Ti, 4060, 3070)7–8B class, short-to-medium contextFast, genuinely usable; the entry point that works
12 GB (RTX 3060 12GB, 4070)13–14B class, or 8B with big contextThe value tier; noticeably smarter than 8B
16 GB (RTX 4060 Ti 16GB, 4070 Ti Super)Up to ~22–24B (tight), comfortable 14BExcellent all-rounder for companions and writing
24 GB (RTX 3090, 4090)32–34B class, or 24B with huge contextFlagship local experience; near-frontier feel
48 GB+ / dual-GPU70B classEnthusiast/pro; serious investment

For card-specific picks: best local LLM for 8GB VRAM, best local LLM for 12GB & 16GB VRAM, and VRAM for a 70B model. If you’re spec’ing a fresh machine, the best budget AI PC build maps these tiers to actual parts.

A note on the 8GB tier: it gets dismissed online, but an 8B model on an 8GB card is a real, fast, capable assistant — and it’s where most people should start. Don’t let forum snobbery talk you into a card you don’t need.

No GPU or a weak GPU: the honest CPU reality

If you have integrated graphics, an old laptop, or a 4GB card, here’s the truth nobody likes to say: you can run local AI on CPU + system RAM, but it will be slow. A 7B model on a modern CPU might give you 3–8 tokens/second — readable, but it’s a typewriter, not a conversation. Anything bigger crawls.

What helps:

  • More RAM, not more cores — you need enough RAM to hold the model (16GB lets you run 7–8B comfortably on CPU).
  • Small models. A 3B model on CPU is genuinely tolerable for quick tasks.
  • Patience. It works; it just isn’t live-chat fast.

The full reality check, with realistic expectations, is in run local AI without a GPU and the do I need a GPU for an AI companion breakdown.

If your hardware can’t deliver a responsive experience and you don’t want to buy a GPU, there’s an honest off-ramp: run it in the cloud instead. A hosted service like Freya puts a capable companion on a remote GPU you never have to think about — zero setup, no VRAM math, works on any laptop or phone. You trade the privacy and one-time-cost advantages of local for instant, frictionless access. That trade-off is laid out plainly in local AI vs cloud AI.

Tokens per second: what speed is actually usable

Memory determines whether you can run a model; bandwidth determines how fast. Here’s a rough feel for tokens/second (tok/s):

Tok/sFeel
< 5Painful for chat; fine for batch/overnight tasks
5–10Tolerable; faster than you read but not snappy
15–30Comfortable live chat — the target zone
30–60+Excellent; feels instant, great for voice and roleplay

A model fully loaded on a decent GPU lands in that 30–60 range easily. The same model spilling to CPU drops below 5 — which is exactly why the “fits fully in VRAM” rule matters more than raw model size. For a deeper look at where the usable threshold really sits, see tokens per second: what’s actually usable.

Verdict tree: can your machine run it?

Walk this top to bottom:

  1. Do you have a discrete GPU with 8GB+ VRAM (or an Apple Silicon Mac with 16GB+ unified)?Yes. You can run local AI well. Match your VRAM to the tier table, pick the biggest Q4_K_M model that fits with a context buffer, install Ollama (curl -fsSL https://ollama.com/install.sh | sh), and ollama run it. This is the privacy-and-ownership path: your data never leaves the machine, you pay once, and nothing is censored or logged by a third party. Ember is built exactly for this — a companion that runs 100% on your own hardware via Ollama, sold once, with no cloud in the loop.

  2. Do you have a weak GPU or 16GB+ system RAM but no real GPU? → You can run small models on CPU (3–8B), accepting slower speeds. If that’s fine for your use, go local. If you want it to feel instant, see the next branch.

  3. No capable hardware, or you just want it to work right now with zero setup?Freya. A hosted companion runs on a remote GPU, so there’s no VRAM math, no install, and it works on any device. You give up the local-only privacy guarantee, but you get instant access.

For the broader hosted-vs-local decision, local AI vs cloud AI is the companion piece to this guide.


Once you know your VRAM number, the rest is easy: pick the right tier, leave room for context, and keep the whole model on the GPU. If your machine can hold an 8B model, Ember gives you a fully local, uncensored companion you own outright — and if it can’t, or you’d rather skip the hardware entirely, Freya runs the same kind of experience in the cloud with nothing to install.