If you’ve ever stared at a Hugging Face model page with twenty different .gguf files — Q2_K, Q4_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS — and had no idea which one to download, this page is the answer. Quantization is just compression for model weights: it shrinks a model so it fits in your VRAM, trading a little quality for a lot of size. The whole game is picking the largest quant that still fits entirely on your GPU. Get that right and you’ll get near-original quality at full speed. Get it wrong and you either waste quality you didn’t need to lose, or you spill into system RAM and watch your tokens-per-second collapse.

Here’s the 30-second version, then the details if you want them.

The 30-Second Rule: Largest Quant That Fully Offloads Wins

There is one rule that resolves 90% of quant decisions:

Pick the biggest quant whose file size leaves ~1–2 GB of VRAM headroom for context. If it all fits on the GPU, you’re done.

Why this works: an LLM running on a GPU is fast because every weight lives in VRAM. The moment a model is too big and some layers get offloaded to CPU/system RAM, generation speed falls off a cliff — often from 40+ tokens/sec down to single digits. A smaller quant that runs fully on the GPU will beat a larger quant that spills, both on speed and usually on real-world usefulness.

So the decision isn’t “what’s the highest-quality quant?” It’s “what’s the highest-quality quant that fully offloads?” For most people on a single consumer GPU, the answer lands on Q4_K_M or Q5_K_M, and you can stop reading here if you just need a default. The rest of this page is for sizing it exactly.

If you’re still choosing hardware, our local AI hardware guide covers what VRAM actually buys you.

K_M vs K_S vs I-Quants Explained Simply

The cryptic suffixes encode two things: how many bits per weight and which strategy was used.

  • The number (Q4, Q5, Q6, Q8) = the average bits per weight. Higher = bigger and more faithful to the original model.
  • _K = a K-quant, the modern default. K-quants group weights into super-blocks of 256 and store separate scaling factors per sub-block, spending more bits on the layers that matter most. This is why a 4-bit K-quant beats an old “legacy” 4-bit quant of the same size.
  • _S / _M / _L = Small, Medium, Large variants of the same bit level. _M keeps a few critical tensors (like attention layers) at higher precision; _S squeezes them down to save space. _M is almost always the right pick_S saves a few hundred MB for a measurable quality hit.
  • IQ (I-quants) = a newer family (IQ2_XS, IQ3_M, IQ4_XS, IQ4_NL…) that uses an importance matrix (imatrix) — a calibration pass that figures out which weights matter — to preserve quality at very low bitrates. An IQ4_XS is smaller than Q4_K_M yet, when calibrated well, can match it.

The practical rule of thumb:

If you…Use
Fit a normal K-quant comfortablyQ-K quants (Q4_K_M, Q5_K_M) — simpler, robust
Are squeezing a model that barely fitsI-quants (IQ4_XS, IQ3_M) — more quality per GB
Run on CPU or a weak GPULean toward K-quants — I-quants can be slower to decode on some hardware

I-quants buy you the most quality per gigabyte, which matters most at the low end (2–3 bit) or when you’re trying to cram a 70B model into limited VRAM. For everyday 4–5 bit use on a GPU that fits the model, plain K-quants are the safe default.

Quality vs Size by Quant Level

This is the table to bookmark. Sizes are relative multipliers — multiply by the model’s parameter count in billions to get a rough GB figure (the estimator below makes this exact).

Quant~Bits/weightSize vs fp16QualityVerdict
Q2_K~2.6~16%Noticeably degradedLast resort to make a big model fit at all
Q3_K_M~3.4~21%Usable, some slipsSqueeze tier for large models
IQ3_M~3.5~22%Better than Q3_K at same sizeBest low-bit option if imatrix is good
Q4_K_M~4.5~28%Very good — the sweet spotDefault for most people
IQ4_XS~4.3~26%≈ Q4_K_M, slightly smallerGreat when you need a hair more room
Q5_K_M~5.5~34%Near-imperceptible lossUpgrade if it still fits
Q6_K~6.6~41%“Almost lossless”Diminishing returns begin here
Q8_0~8.5~53%Essentially lossless (~0.01 perplexity vs fp16)Only when VRAM is plentiful
fp16/bf1616100%The originalRarely worth it for inference

The honest takeaway: the quality difference between Q4_K_M and Q8_0 is small for chat, roleplay, and most writing. The difference between Q4_K_M and Q2_K is large. Spend your VRAM getting to Q4_K_M first; spend what’s left climbing toward Q5/Q6.

Copy-Paste Size Estimator

You don’t need to memorize file sizes. Multiply billions of parameters × bits-per-weight ÷ 8 to get GB, then add ~10–20% overhead, plus room for context (KV cache).

Drop this into a terminal:

# usage: ./quantsize.sh <params_in_billions>
# prints approximate file size for common quants
params=${1:-8}
declare -A bpw=( [Q3_K_M]=3.4 [Q4_K_M]=4.5 [IQ4_XS]=4.3 [Q5_K_M]=5.5 [Q6_K]=6.6 [Q8_0]=8.5 )
echo "Model: ${params}B parameters"
printf "%-9s %8s %12s\n" "QUANT" "FILE_GB" "FITS_8GB?"
for q in Q3_K_M Q4_K_M IQ4_XS Q5_K_M Q6_K Q8_0; do
  gb=$(awk -v p="$params" -v b="${bpw[$q]}" 'BEGIN{printf "%.1f", p*b/8*1.1}')
  fit=$(awk -v g="$gb" 'BEGIN{print (g<6.5)?"yes":"tight/no"}')
  printf "%-9s %8s %12s\n" "$q" "$gb" "$fit"
done

Run bash quantsize.sh 8 for an 8B model. The ×1.1 is overhead; you still need 1–2 GB on top for context. A quick mental shortcut: a 4-bit (Q4_K_M) model is roughly half its parameter count in GB — an 8B is ~4.5 GB, a 13B is ~7.5 GB, a 70B is ~40 GB.

Which Quant for Which VRAM

Match the model+quant to your card so it fully offloads. These assume ~1–2 GB reserved for context and your desktop.

VRAMComfortable pickWhat it runs well
6 GB7–8B @ Q4_K_M or IQ4_XSSolid 8B chat; tight on context
8 GB8B @ Q4_K_M / Q5_K_MThe mainstream sweet spot — see best local LLM for 8GB VRAM
12 GB8B @ Q6_K, or 13–14B @ Q4_K_MBigger models or higher quant — your call
16 GB14B @ Q5_K_M, or 22B @ Q4_K_MComfortable headroom
24 GB32–34B @ Q4_K_M, or 14B @ Q8_0Genuinely capable local models
48 GB+70B @ IQ4_XS/Q4_K_MFrontier-class local — see VRAM for a 70B model

Once you’ve picked a model, pulling it is one line with Ollama, which auto-selects a sensible quant:

ollama run llama3.1:8b          # default tag, usually Q4_K_M
ollama run qwen2.5:14b-instruct-q5_K_M   # pin an explicit quant

If Ollama isn’t set up yet, start with how to run AI locally.

When Q8 or fp16 Is Worth It

For chat and roleplay, almost never — the gap over Q5/Q6 is too small to feel, and you’re paying double the VRAM. Reach for Q8_0 only when:

  • You have VRAM to burn and the model still fits with full context. Free quality is free quality.
  • The task is precision-sensitive: code generation, math, structured/JSON output, function-calling, or long multi-step reasoning where small errors compound.
  • You’re building a dataset or doing fine-tuning prep, where you want the cleanest possible base behavior.

fp16/bf16 is almost never worth it for inference — Q8_0 is within a rounding error of it at half the size. fp16 matters when you’re training or fine-tuning, not when you’re chatting. If you find yourself reaching for Q8 to “fix” a model that feels dumb, the problem is usually the model choice or your sampling settings, not the quant.

Common Quant Mistakes

  • Picking a quant that spills off the GPU. The #1 error. A Q6_K that offloads two layers to CPU is slower and feels worse than a Q4_K_M that runs fully on the card. Fit first.
  • Forgetting the KV cache. Long context eats VRAM separately from weights. A model that fits at 2K context can OOM at 32K. Always leave headroom.
  • Choosing _S to save a few hundred MB. The quality hit isn’t worth it. Use _M.
  • Going below Q4 when you didn’t have to. Q2/Q3 are squeeze tiers for when a model otherwise won’t fit at all — not a default. Prefer a smaller model at Q4 over a bigger model at Q2.
  • Using a non-imatrix I-quant. I-quants depend on a good importance matrix. An IQ4_XS from a careless uploader can underperform; from a trusted one it shines (next section).
  • Assuming bigger quant = smarter. Above Q5/Q6 you’re paying real VRAM for changes you can’t perceive in normal use.

Where to Find Trustworthy Quants

Quantization quality varies by who made the file, especially for I-quants that need a calibration pass. Two uploaders are the community standard on Hugging Face:

  • bartowski — fast, well-labeled repos with the full ladder of K-quants and imatrix I-quants for nearly every major model. The README usually includes a quality/size chart so you can pick at a glance.
  • mradermacher — enormous coverage, including static quants (-GGUF) and imatrix quants (-i1-GGUF). If a model exists, mradermacher has probably quantized it.

For I-quants specifically, look for the i1- prefix or “imatrix” in the repo name — that’s the calibrated version. Before downloading anything, sanity-check the source: a quant is just a binary blob, so provenance matters. Our guide on downloading GGUF models safely from Hugging Face covers what to verify.


Once you’ve internalized the 30-second rule, picking a quant really is a 30-second decision: estimate the size, match it to your VRAM, grab the _M K-quant (or an imatrix I-quant when you’re tight), and run. The harder question is usually the one before the quant — which model, on which hardware, for what. If you’d rather skip the model-shopping and VRAM math entirely and just talk to a private AI that’s already tuned to run on your machine, Ember ships a curated local companion you own outright — and if you don’t have the GPU for any of this, Freya runs the same kind of experience in the cloud with zero setup.