If Ollama just threw CUDA error: out of memory or refused a model with a 500 and the words “requires more system memory than available”, the fix is almost never “buy a bigger GPU.” It’s understanding what’s actually eating your VRAM and turning the right knobs. A 7B model that fits comfortably at one setting will overflow your card at another — and the setting that breaks people most often isn’t the model size at all. This guide gives you a VRAM ladder: a fast win first, then three levers you pull in order, then the exact reference numbers so you stop guessing.

What the error actually means: weights + KV cache exceed available VRAM

Your GPU has to hold two big things to run a model, not one.

  1. The model weights. This is the file size you downloaded, give or take. A 7B model at Q4_K_M is roughly 4.4 GB; the same model at Q8_0 is around 7.7 GB. A true 8B (like the llama3.1:8b used as the running example below) lands a bit higher — about 4.9 GB at Q4_K_M and ~8.5 GB at Q8_0 — which matters on an 8 GB card where that half-gig is exactly the margin. This part is fixed once you pick a model and quant.
  2. The KV cache (key/value cache). This grows with your context window (num_ctx). Every token you let the model “remember” costs VRAM, continuously, for the whole session. This is the part nobody budgets for, and it’s why a model that loaded yesterday suddenly OOMs today when you bumped the context.

When weights + KV cache + a bit of CUDA overhead is larger than your free VRAM, you get CUDA out of memory. Ollama will sometimes silently spill layers to system RAM instead of crashing — which “works” but tanks your tokens/sec. So the real goal isn’t just “make it stop crashing,” it’s “make it fit on the GPU at a usable speed.”

Two numbers to keep in your head:

  • Free VRAM, not total. Your desktop, browser, and especially other model sessions are already holding some. Run nvidia-smi to see what’s actually free.
  • KV cache scales with num_ctx. Double the context, roughly double the cache. This is the single biggest lever most people ignore.

Quick win: ollama stop to free VRAM stuck from a previous model

Before you change any settings, check whether you’re OOMing because a previous model is still resident in VRAM. Ollama keeps a model loaded for a few minutes after you stop chatting (the keep_alive window) so the next request is instant. If you switch models or re-run with new options, the old one may still be camped on your card.

See what’s loaded right now:

ollama ps

If you see a model sitting there, unload it:

ollama stop <model-name>

Or just confirm with the GPU directly:

nvidia-smi

Look at the ollama / ollama_llama_server process and how much memory it holds. If you stop the model and your free VRAM jumps back up, that was your whole problem — you were trying to load a second model into space the first one never released. On a single 8–12 GB card you generally want one model resident at a time. If this is a recurring headache, see our deeper writeup on why Ollama won’t use your GPU, which covers the related case where the model loads but never touches CUDA at all.

Lever 1: reduce num_ctx — the KV-cache math most people get wrong

Here’s the mistake: people set a huge context “just in case” — 32k, 128k — and wonder why an 8 GB card chokes on a 7B model that should fit fine. The weights might be 4.4 GB, but a 32k context can add several gigabytes of KV cache on top. The cache is sized for the maximum context you declared, whether or not your actual conversation is that long.

The cache cost depends on the model’s architecture (number of layers, attention-head dimensions, and whether it uses grouped-query attention), so there’s no one-size formula — but the relationship is linear in num_ctx. Halving the context roughly halves the cache. If you OOM at num_ctx 16384, try 8192, then 4096. Most chat and roleplay sessions are perfectly happy at 4k–8k.

Set it per-session inside ollama run with the /set parameter command:

/set parameter num_ctx 4096

Or pin it in a Modelfile / via the API options field so it’s permanent. (Ollama’s default context is modest — often 2k–4k — so if you raised it and then started OOMing, that’s your culprit. If you actually need a big context window and have the VRAM headroom, we walk through doing it safely in how to increase the Ollama context window.)

Rule of thumb: drop num_ctx first. It’s free, it’s reversible, and it usually buys more VRAM than people expect.

Lever 2: quantize the KV cache, then drop to a smaller weight quant

If trimming context isn’t enough, attack the two memory consumers directly.

Quantize the KV cache. By default the cache is stored in f16. You can cut it roughly in half by switching to 8-bit, which is usually free in quality terms. But there’s a catch that trips up almost everyone:

Important: Ollama only quantizes the KV cache when Flash Attention is on, and it leaves Flash Attention OFF by default. You have to set both on the Ollama server (not the client):

OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE="q8_0" ollama serve

If you set the cache type without Flash Attention, the cache silently stays f16, you save nothing, and you get no error telling you why. On Linux with systemd, add both variables to the service override; on the desktop app, set them in your environment before launch. There’s also a q4_0 cache type that halves the cache again, but it’s more aggressive and can degrade long-context coherence — try q8_0 first.

Drop the weight quant. If you’re running Q5 or Q6, step down to Q4_K_M — the sweet spot most people use, with minimal quality loss for a meaningful VRAM saving. If you’re already at Q4_K_M and still over budget, Q3_K_M will squeeze it further, at a more noticeable hit to coherence:

ollama run llama3.1:8b-instruct-q4_K_M

Pull the explicit quant tag (e.g. ...-q4_K_M) so you’re certain which quant you’re loading — the bare llama3.1:8b / latest tag already maps to q4_K_M, but being explicit avoids surprises if a model’s default ever changes. If the quant tag soup is confusing — what K_M means, when Q4 is fine vs. when you’ll feel it — our GGUF quantization cheat sheet lays it all out in one table.

Order of operations: KV cache to q8_0 → weights to Q4_K_M → weights to Q3_K_M. Stop as soon as it fits.

Lever 3: cap GPU layers vs. offload to RAM — and what it costs you

If the model still won’t fit entirely on the GPU, you can split it: keep some layers on the GPU and offload the rest to system RAM/CPU. Ollama does this automatically when a model is too big, but you can also control it explicitly with num_gpu (the number of layers to place on the GPU):

/set parameter num_gpu 20

Lower the number to push more layers onto the CPU; raise it to keep more on the GPU. The trade-off is brutal and worth stating plainly:

  • Fully on GPU: fast. Tens of tokens/sec on a decent card.
  • Partially offloaded: every layer that lives in RAM is computed by the CPU and shuttled over the PCIe bus. Even a few offloaded layers can drop you from snappy to sluggish.
  • Mostly on CPU: you’ll often land in the low single digits of tokens/sec — technically working, painful to actually use.

Offloading is a legitimate way to run a model that’s a little too big, but treat it as a last resort before switching models, not a default. If you’re routinely offloading, the honest read is that the model is too large for your card — pick a smaller one. (For what counts as “usable” speed and where the patience cliff is, see tokens per second: what’s actually usable.)

The ‘requires more system memory than available’ 500 error

This one looks different from a raw CUDA OOM, and the cause is usually one of three things.

Ollama’s conservative pre-flight math. Before loading, Ollama estimates how much memory the model + context + cache will need, and refuses up front if it doesn’t think it’ll fit — that’s the 500 with “model requires more system memory than available.” The number it reports is an estimate with a safety margin, so it sometimes refuses a model that would have squeaked in. The fix is the same as above: lower num_ctx, quantize the cache, or drop the weight quant until the estimate fits.

Docker / container RAM limits. If you run Ollama in Docker, the container may see far less RAM than your host has. A --memory flag, a Compose mem_limit, or Docker Desktop’s resource cap will make Ollama think the machine is tiny. Raise the container’s memory allocation (and pass the GPU through with --gpus all) so it can see real hardware.

OLLAMA_GPU_OVERHEAD set too high. This env var reserves a chunk of VRAM as a buffer. If it’s set aggressively (or inherited from a config you forgot about), it can eat enough headroom to trip the refusal. Check it, and unset or lower it if it’s not doing you a favor:

echo $OLLAMA_GPU_OVERHEAD

The “system memory” wording is a little misleading — on a GPU setup this error is usually still about VRAM budget plus Ollama’s safety margin, not your RAM sticks. Treat it as the same problem with a more cautious messenger.

Reference table: approximate VRAM per model size, quant, and context

These are ballpark figures for planning — real usage varies by model architecture and your other GPU load. “VRAM (weights)” is the model file roughly resident; add the KV cache for your context on top. Numbers assume Q4_K_M-class weights unless noted, and a modest 4k context for the “fits on” guidance.

Model sizeQ4_K_M weightsQ8_0 weights~KV cache @ 8k (f16)Comfortable card
3B~2.0 GB~3.4 GB~0.5–1 GB6 GB+
7B~4.4 GB~7.7 GB~1–2 GB8 GB (tight), 12 GB (easy)
8B~4.9 GB~8.5 GB~1–2 GB8 GB (tighter), 12 GB (easy)
12–14B~8 GB~14 GB~2–3 GB12–16 GB
24B~14 GB~25 GB~3–4 GB24 GB
32B~19 GB~34 GB~4–5 GB24 GB (q8 cache + 4k ctx)
70B~40 GB~75 GB~6–10 GB48 GB+ / dual-GPU

Note the 7B vs 8B split: an actual 8B model (the llama3.1:8b in our example) sits about 0.5 GB above a classic 7B at the same quant. On a 12 GB+ card that’s noise; on an 8 GB card it’s often the difference between “fits with an 8k context” and “OOMs unless you trim to 4k or quantize the cache.”

Practical reads from the table: 8 GB cards are happiest with 7–8B at Q4_K_M and a 4k–8k context; 12 GB (e.g. an RTX 3060 12GB) handles 7–8B comfortably and 12–14B with care; 24 GB (e.g. a 3090/4090) opens up 24–32B; 70B wants 48 GB or two cards. For a full hardware-to-model map, see how much VRAM you need for a local AI companion.

When the model genuinely doesn’t fit your card

Sometimes the answer is simply: this model is too big for this GPU, and forcing it via heavy offload or Q3 is a bad trade. When you’ve pulled every lever and it’s still slow or refusing, you have two clean options.

Run a smaller model. A well-chosen 7–8B at Q4_K_M on an 8 GB card, or a 12–14B on 12 GB, will beat a crippled, half-offloaded 32B every time — for both speed and the actual experience, because a model running at 2 tokens/sec is unusable no matter how smart it is. Pick the largest model that fits fully on the GPU with your real context, and stop there.

Or skip the hardware problem entirely. If you don’t have the VRAM and don’t want to buy a card, a fully local setup may not be the right fit today — running everything on your own machine is the whole point of running AI locally, and it’s deeply worth it when the hardware lines up, but it’s honest to admit when it doesn’t.

If you do have the GPU and want a private, uncensored companion that runs 100% on your own machine via Ollama — no logs leaving the box, ever — Ember is built for exactly this VRAM ladder and ships tuned defaults so you don’t fight OOM errors. And if your card just can’t carry it, Freya runs the same kind of companion in the cloud with zero setup and no GPU required — so you can pick the side of the VRAM line you’re actually on.