“How much VRAM do I need to run a 70B model?” has a short answer and an honest answer. The short answer: about 40GB+ at a usable 4-bit quant once you account for context. The honest answer is that most charts you’ll find lie to you by quietly leaving out the KV cache — the per-conversation memory tax that grows with context length and can add many gigabytes on top of the weights. This page gives you the real math: a formula you can run in your head, a full chart from 7B to 70B by quantization level, and the reasons a “48GB” rig sometimes still chokes.

No marketing numbers, no “it depends” hand-waving. Just bytes.

The base formula: params × bytes-per-param × 1.2

Every VRAM estimate starts from one idea: a model is a pile of numbers (parameters), and each number occupies a fixed amount of space depending on its quantization (how many bits per weight). Multiply, then add overhead.

VRAM (GB) ≈ (params in billions × bytes per param) × 1.2

The 1.2 is a ~20% fudge factor for the inference overhead that the raw weights don’t capture: the CUDA context, activation buffers, the runtime’s own working memory, and a small KV cache at short context. It is a floor, not a ceiling — long context blows past it (more on that below).

Bytes per param is the whole game. Here’s what the common formats cost:

FormatBits/paramBytes/paramNotes
FP16 / BF16162.0Full-precision “raw” weights
Q8_08~1.0Near-lossless, big
Q6_K6~0.75Excellent quality
Q5_K_M5~0.65Strong middle ground
Q4_K_M4~0.55The default sweet spot
Q3_K_M3~0.45Noticeable quality loss

Note the q4 vram formula subtlety: a “4-bit” GGUF isn’t exactly 0.5 bytes/param. Formats like Q4_K_M keep some tensors at higher precision and add per-block scales, so real-world cost lands around 0.55–0.6 bytes/param. Using 0.55 keeps your estimate honest instead of optimistic. For a deeper breakdown of what each suffix means, see the GGUF quantization cheat sheet.

Quick sanity check on a 70B model at Q4_K_M:

70 × 0.55 × 1.2 ≈ 46 GB  (weights + base overhead)

That’s already past a single 24GB card and brushing a 48GB card — and we haven’t paid the context tax yet.

Full chart by quant from 7B to 70B

This is the vram by model size chart people actually come here for. Figures are approximate VRAM for weights plus base overhead at short context (a few thousand tokens). Add the KV cache from the next section for long chats.

ModelFP16 (16-bit)Q8_0 (8-bit)Q5_K_M (5-bit)Q4_K_M (4-bit)Q3_K_M (3-bit)
7B~16 GB~9 GB~6 GB~5.5 GB~4.5 GB
8B~18 GB~10 GB~7 GB~6 GB~5 GB
13B~28 GB~15 GB~11 GB~9 GB~7.5 GB
34B~70 GB~38 GB~27 GB~23 GB~19 GB
70B~150 GB~78 GB~55 GB~46 GB~38 GB

A note on the FP16 column: those figures are essentially raw weights (params × 2 bytes), with only a slim margin — full FP16 inference at scale actually wants more once you add the same overhead and KV cache the other columns carry. We list FP16 lightly because it’s almost never the deployment target on consumer hardware; the quantized columns are where real local builds live. Treat every number here as approximate.

Read this chart with three rules of thumb:

  • A 7B–8B model fits comfortably in 8GB at Q4, which is why it’s the entry tier. See best local LLM for 8GB VRAM and the wider hardware guide.
  • A 24GB card (3090/4090) is the 34B and “fast 13B” tier, and runs a 70B only with offload pain — covered in best local LLM for a 24GB card.
  • A clean 70B at Q4 wants ~46GB before context — a 48GB card, dual 24GB cards, or an Apple Silicon machine with enough unified memory.

The KV-cache / context tax most pages omit

Here’s what the weight-only charts hide. When the model generates, it stores a key/value cache for every token in the conversation so it doesn’t recompute the whole history each step. This kv cache vram context grows linearly with context length and is separate from the weights.

A workable estimate — and the part most pages get wrong — is that the cache stores keys and values for the model’s KV heads, not its full hidden dimension:

KV cache (GB) ≈ 2 (K and V) × layers × (n_kv_heads × head_dim) × 2 bytes × context_tokens / 1e9

The crucial term is (n_kv_heads × head_dim) — the KV projection dimension — not the full hidden_dim. Modern models use GQA (grouped-query attention), which shares a small number of KV heads across many query heads. That collapses the cache by roughly 8× versus the naive “full hidden_dim” formula older guides print.

Work it for Llama-3-70B (80 layers, 8 KV heads, head_dim 128) at a 32K context:

2 × 80 × (8 × 128) × 2 bytes × 32,768 / 1e9 ≈ 10.7 GB

That ~10.7GB is the realistic fp16 number — and it matches the table below. If you instead plug in the full hidden_dim (8192) and ignore GQA, you’d get ~86GB, which is wrong by about 8× for any GQA model. So if you find a formula that uses hidden_dim directly, treat its output as a pre-GQA upper bound and divide by ~8 for a modern model.

You don’t need to memorize the formula. The takeaways are what matter:

  • KV cache scales with context, not just model size. Double the context window, double the cache.
  • It scales with model dimensions too — but with the KV-head dimension, not the full hidden size. A 70B still has far more layers than a 7B, so each token of context costs more on the big model, just not 8× more than GQA-naive math suggests.
  • At long context it stops being a rounding error. On a 70B, a full 32K-token context adds roughly 10–20GB of KV cache on top of the ~46GB of weights (the ~10.7GB fp16 figure above sits at the bottom of that band; runtime fragmentation and headroom push real usage higher) — which is exactly how a “48GB should be plenty” rig runs out of memory mid-conversation.

Two levers shrink it further. KV-cache quantization (storing the cache at 8-bit or 4-bit instead of 16-bit) can roughly halve or quarter the figures above. And GQA is already baked into the formula — it’s the reason the cache is gigabytes, not tens of gigabytes. If your runtime lets you set a smaller context window, that’s the cheapest fix of all — most chat doesn’t need 32K tokens.

Why 70B needs 40GB+ realistically

Put the pieces together and the headline number explains itself:

Component70B @ Q4_K_M
Weights~38 GB
Inference overhead~6–8 GB
KV cache @ ~8K context~2–4 GB
KV cache @ ~32K context~10–20 GB
Realistic total~46–58 GB

(The ~2–4GB at 8K and ~10–20GB at 32K are GQA-corrected fp16 ranges; the low end is the bare cache, the high end leaves room for runtime overhead and KV fragmentation across model families.)

So when someone says “you need 40GB+ for a 70B,” they’re describing the weights-plus-a-little floor. A genuinely comfortable 70B at real context lives in the 48–64GB band. That’s why the practical hardware answers are a single 48GB workstation card, two 24GB GPUs (2× RTX 3090 is the classic enthusiast build), or an Apple Silicon machine with 64GB+ of unified memory where the GPU can address most of system RAM.

Drop to a smaller, sharper model and the problem evaporates: a strong 34B at Q4 fits in ~24GB, and modern 30B-class models often beat last year’s 70B. Bigger isn’t automatically better — see the best uncensored local models for where the quality actually lives.

Partial offload and the speed cliff

Can’t fit the whole model in VRAM? Runtimes like Ollama and llama.cpp will offload the overflow layers to system RAM and run them on the CPU. It works. It’s also where dreams of a fast 70B on a 12GB card go to die.

The reason is bandwidth. A modern GPU moves data at ~1000 GB/s; dual-channel DDR system RAM is closer to ~50–100 GB/s — an order of magnitude slower. Token generation is bandwidth-bound, so your speed is gated by the slowest tier any layer lives on. The cliff is brutal and non-linear:

FitTypical feel
100% in VRAMFast — tens of tokens/sec
~80% in VRAM, 20% CPUNoticeably slower
~50/50 splitOften a few tokens/sec — painful for chat
Mostly CPUSlideshow — fine for batch, not conversation

For what counts as a usable speed, see tokens per second, explained. The practical rule: offload a little and you’ll tolerate it; offload a lot and you’ll hate it. A 70B half-on-CPU is a tech demo, not a daily driver. Better to run a model that fits than a bigger one that crawls.

Quant choices to shrink the footprint

Quantization is your primary VRAM lever, and the quality cost is smaller than people fear — down to a point:

  • Q8_0 — near-lossless, but ~2× the size of Q4. Only worth it if you have VRAM to burn.
  • Q5_K_M — excellent quality, modestly bigger than Q4. A great pick when you have headroom.
  • Q4_K_M — the default sweet spot. Most users can’t reliably tell it from FP16 in chat, at roughly a quarter of the size.
  • Q3_K_M and below — real, visible degradation: weaker reasoning, more repetition. Use only to squeeze a model that otherwise won’t fit.

The strategic move is to right-size the model to your card, then pick the highest quant that fits with your context budget. A Q5 34B on a 24GB card will usually feel smarter and faster than a Q3 70B clinging on by its fingernails. The GGUF quantization cheat sheet and the budget AI PC build guide walk through matching quant, model, and hardware together.

Quick lookup table

Bookmark this. “What fits?” by card, at Q4_K_M with a sane context window:

Your VRAMComfortable modelNotes
8 GB7B–8B Q4Entry tier; snappy, capable
12 GB8B–13B Q4Room for longer context
16 GB13B Q4–Q5Sweet spot for most users
24 GB34B Q4 / fast 13BThe enthusiast standard
48 GB70B Q4The real 70B entry point
2× 24 GB70B Q4 + contextClassic dual-3090 build
CPU / no GPU7B small-quant, slowPossible, not pleasant — see run without a GPU

The honest bottom line: a real 70B is a 48GB+ commitment, and a comfortable one is dual-GPU or a big unified-memory Mac. That’s hundreds to a couple thousand dollars of hardware before you generate a single token — and a second 3090 plus the PSU and case to feed it isn’t a small line item.

If you want the output of a large model without buying the silicon to run it, the calculus shifts. Freya is our hosted route — a cloud AI companion that runs the heavy model on someone else’s GPUs, so there’s no VRAM math, no quant tuning, and nothing to download; you just open it and talk. When the hardware bill for a 70B is the thing standing between you and the experience, letting the cloud carry the VRAM is the path of least resistance.