RAM vs VRAM for Local AI: Which One Actually Matters?

VRAM decides what local AI models you can run; system RAM is the floor, not the lever. The GB-per-billion rule, the 2x-RAM tip, CPU-offload, and Apple unified

If you’re spec’ing a machine to run AI locally, you’ve probably hit the same confusing wall everyone does: the GPU has its own “VRAM,” your motherboard has “system RAM,” and every forum thread seems to give a different answer about which one matters. Here’s the truth, stated plainly up front: for running language models on a GPU, VRAM is the number that decides what’s even possible, and system RAM is the supporting cast. They are not interchangeable, they are not the same speed, and buying 64GB of system RAM does almost nothing for you if your graphics card only has 8GB of VRAM. This guide explains exactly why, with the real numbers and a buying order you can act on.

The one-line answer: VRAM decides feasibility, RAM is the supporting cast

A local LLM runs fast when the entire model lives inside the memory attached directly to the chip doing the math. On a discrete NVIDIA or AMD GPU, that memory is VRAM. The model’s weights have to fit inside VRAM for you to get the snappy, real-time responses people install local AI for in the first place.

System RAM — your regular DDR4 or DDR5 sticks — is plentiful and cheap, but it sits across a slow bus relative to VRAM, and it feeds the CPU, not the GPU. For GPU inference, system RAM mostly handles loading the model file off disk and staging it before it goes to the card. Once the model is resident in VRAM, your DDR5 is largely a spectator.

So the question “does RAM matter for local AI” has a two-part answer: yes, you need enough of it, but past a sane threshold, more RAM doesn’t make your models smarter or faster. VRAM is the lever. RAM is the floor.

	VRAM	System RAM
Attached to	The GPU	The CPU
Relative bandwidth	Very high	~10x lower than VRAM
Role in GPU inference	Holds the model that’s actually running	Loads/stages the file, OS overhead
Decides what you can run?	Yes	Only as a fallback
Worth overspending on?	Yes	No

Why spilling from VRAM to RAM is 5-30x slower (and why models ‘feel broken’)

Here’s the failure mode that confuses newcomers. When a model is slightly too big for your VRAM, the inference engine (Ollama, llama.cpp, etc.) doesn’t refuse to run it — it quietly offloads the leftover layers into system RAM and runs those layers on the CPU. The model “works,” but it crawls.

The reason is bandwidth. LLM inference is memory-bandwidth-bound: generating each token requires streaming the model’s weights through the compute unit, over and over. VRAM on a modern GPU delivers that stream at hundreds of gigabytes to roughly a terabyte per second. Dual-channel desktop system RAM delivers a small fraction of that. So the moment even a handful of layers spill to RAM, your generation speed collapses to the speed of the slowest path — and in practice people see a 5-30x slowdown depending on how many layers spilled and how slow their RAM and CPU are.

This is why a model “feels broken”: it’s not broken, it’s bandwidth-starved. A 7B model that fully fits in VRAM might produce tokens faster than you can read; the same model with 20% of its layers in RAM can drop to a painful word-every-second crawl. A model that fully fits will always beat a bigger one that doesn’t. If you’ve ever wondered why your setup is sluggish, this is usually the culprit — and the fix is choosing a model (or a tighter quantization) that fits entirely in VRAM. For what counts as “fast enough,” see our breakdown of usable tokens per second.

VRAM sizing: the rough GB-per-billion-params rule at Q4

You can predict whether a model fits before downloading a single byte. The dominant factor is the quantization — the precision the weights are stored at. The community default, Q4_K_M, stores weights at roughly half a byte each, which gives a handy rule of thumb:

At Q4, a model needs roughly 0.5-0.6 GB of VRAM per billion parameters, plus overhead for context.

So the quick mental math:

Model size	Approx. VRAM at Q4_K_M	Comfortable on
7-8B	~5-6 GB	8GB card (tight), 12GB (easy)
12-14B	~9-11 GB	12GB card
24B	~14-16 GB	16GB card
32B	~20-22 GB	24GB card
70B	~40-48 GB	dual GPUs / 48GB+

Two things people forget. First, the context window (KV cache) also lives in VRAM and grows with how much conversation history you keep open — budget a couple of extra GB and more for very long contexts. Second, lower quants (Q3, Q2) let you squeeze a bigger model into a smaller card, but quality degrades. The full precision-vs-size tradeoff is laid out in our GGUF quantization cheat sheet, and the complete sizing formula lives in the local AI hardware guide. If you’re sizing specifically for a chat companion, how much VRAM for a local AI companion maps model classes to personality quality.

System RAM: the 2x-VRAM rule of thumb and when 32GB vs 64GB matters

So how much RAM for local AI do you actually need? A reliable rule of thumb is to have at least as much system RAM as VRAM, and ideally 2x — enough to comfortably load the model file off disk, stage it, and leave headroom for your OS and apps while inference runs on the GPU.

In practice:

16GB system RAM — fine for an 8GB GPU running 7-8B models, if you’re not also running a heavy browser and a dozen apps.
32GB system RAM — the sweet spot for most local-AI desktops. Pairs well with a 12-24GB GPU, gives you room to keep other software open, and handles the larger model files staging into VRAM without thrashing.
64GB system RAM — worth it if you do CPU offload of big models, run multiple models, do heavy document/RAG work alongside the model, or keep large datasets in memory. For pure GPU inference of a single model that fits in VRAM, 64GB is comfort, not necessity.

The honest framing for VRAM vs system RAM for an LLM: once you clear the “enough RAM to load it” bar, additional system RAM does not increase model quality or GPU speed. Spend the marginal dollar on VRAM instead.

The CPU-offload exception: when RAM bandwidth (DDR5 vs DDR4) becomes the lever

There’s one scenario where system RAM stops being a spectator and becomes the main event: when you can’t fit the model in VRAM at all and deliberately run partly or entirely on the CPU. This is common for people with weak GPUs or no GPU, and for anyone trying to run a 70B-class model on a single consumer card. (Our guide to running local AI without a GPU covers this path in full.)

In CPU-offload mode, generation speed is dictated by RAM bandwidth, not capacity. This is the one case where DDR5 vs DDR4 actually matters for AI: DDR5’s higher bandwidth (and especially dual-channel, fully-populated configurations) can meaningfully outpace DDR4 when the CPU is doing the token generation. Memory channels matter too — a dual-channel config roughly doubles effective bandwidth over single-channel, which is why a laptop with one stick can feel far slower than the same chip with two.

But set expectations: even the best DDR5 is still an order of magnitude behind VRAM. CPU offload makes large models possible, not fast. It’s a “run it overnight” or “I only send one message at a time” solution, not a real-time chat experience.

Does the CPU itself matter? Mostly no for GPU inference — don’t overspend

This trips up people coming from gaming or video-editing builds, where the CPU is a headline component. For GPU-based LLM inference, the CPU barely matters. The GPU does essentially all the math; the CPU’s job is to load the model, manage the OS, and shuttle data. A modern mid-range CPU is completely sufficient. You do not need a flagship chip, and spending your budget there instead of on VRAM is the single most common build mistake.

The CPU only becomes relevant in two cases: CPU-offload/no-GPU inference (where core count and RAM bandwidth drive speed), and very heavy preprocessing pipelines. For a normal GPU setup, buy a sensible CPU and put the saved money toward a bigger card. The best budget AI PC build guide allocates a build around exactly this priority.

Apple Silicon’s unified memory: where this whole model breaks down (in a good way)

Everything above assumes a discrete GPU with its own walled-off VRAM. Apple Silicon (M-series) throws out that wall. It uses unified memory — a single, fast pool that the CPU and GPU share. There is no tiny separate “VRAM” ceiling: a Mac with 32GB or 64GB of unified memory can dedicate most of that pool to a model.

The practical upshot is that the VRAM-vs-RAM distinction collapses into one number: buy as much unified memory as you can afford. A 64GB Mac can load models that would require a 48GB+ discrete GPU costing far more. Apple’s memory bandwidth is also far higher than typical desktop DDR, so it punches well above a PC with the same nominal RAM — though it still won’t match a high-end NVIDIA card on raw throughput. If a Mac is your route, Mac mini for local AI walks through the sweet-spot configurations.

Buyer takeaways: spend on VRAM first, then RAM, then everything else

If you remember nothing else, remember the priority order:

VRAM first. It’s the constraint that decides which models you can run well. More VRAM = bigger, smarter models at full speed. This is where your money should go.
Enough system RAM second. 32GB is the comfortable default for a GPU build; 16GB is a workable floor; 64GB only pays off for CPU offload, multi-model, or heavy RAG workloads.
A sensible CPU third. Mid-range is plenty for GPU inference. Don’t overspend here.
Fast/dual-channel RAM matters only for CPU offload. DDR5 over DDR4 is a real lever when there’s no GPU in play — otherwise it’s a rounding error.
Apple Silicon collapses the choice into one number: maximize unified memory.

Before you buy anything, sanity-check what your current machine can already handle — our can my PC run an AI companion guide turns your specs into a concrete yes/no.

Once you know your VRAM budget, the next step is putting it to work. If you’ve got a capable GPU, Ember runs a fully local, uncensored AI companion entirely on your own machine over Ollama — your VRAM, your data, nothing leaving the box. And if your hardware isn’t there yet (or you just want it working today with zero setup), Freya delivers the same kind of companion from the cloud — no GPU, no install, no shopping list.

RAM vs VRAM for Local AI: Which One Actually Matters?

The one-line answer: VRAM decides feasibility, RAM is the supporting cast

Why spilling from VRAM to RAM is 5-30x slower (and why models ‘feel broken’)

VRAM sizing: the rough GB-per-billion-params rule at Q4

System RAM: the 2x-VRAM rule of thumb and when 32GB vs 64GB matters

The CPU-offload exception: when RAM bandwidth (DDR5 vs DDR4) becomes the lever

Does the CPU itself matter? Mostly no for GPU inference — don’t overspend

Apple Silicon’s unified memory: where this whole model breaks down (in a good way)

Buyer takeaways: spend on VRAM first, then RAM, then everything else

Ember — own it

Freya — no setup

RAM vs VRAM for Local AI: Which One Actually Matters?

The one-line answer: VRAM decides feasibility, RAM is the supporting cast

Why spilling from VRAM to RAM is 5-30x slower (and why models ‘feel broken’)

VRAM sizing: the rough GB-per-billion-params rule at Q4

System RAM: the 2x-VRAM rule of thumb and when 32GB vs 64GB matters

The CPU-offload exception: when RAM bandwidth (DDR5 vs DDR4) becomes the lever

Does the CPU itself matter? Mostly no for GPU inference — don’t overspend

Apple Silicon’s unified memory: where this whole model breaks down (in a good way)

Buyer takeaways: spend on VRAM first, then RAM, then everything else

Ember — own it

Freya — no setup

Related guides

Local AI Hardware Guide: How Much VRAM & RAM You Need (2026)

Is a Used RTX 3090 Still the Best Value for Local AI in 2026?

Is the RTX 3060 12GB Good Enough for an AI Companion?