If you’ve ever stared at a Hugging Face model page with twenty different .gguf files — Q2_K, Q4_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS — and had no idea which one to download, this page is the answer. Quantization is just compression for model weights: it shrinks a model so it fits in your VRAM, trading a little quality for a lot of size. The whole game is picking the largest quant that still fits entirely on your GPU. Get that right and you’ll get near-original quality at full speed. Get it wrong and you either waste quality you didn’t need to lose, or you spill into system RAM and watch your tokens-per-second collapse.
Here’s the 30-second version, then the details if you want them.
The 30-Second Rule: Largest Quant That Fully Offloads Wins
There is one rule that resolves 90% of quant decisions:
Pick the biggest quant whose file size leaves ~1–2 GB of VRAM headroom for context. If it all fits on the GPU, you’re done.
Why this works: an LLM running on a GPU is fast because every weight lives in VRAM. The moment a model is too big and some layers get offloaded to CPU/system RAM, generation speed falls off a cliff — often from 40+ tokens/sec down to single digits. A smaller quant that runs fully on the GPU will beat a larger quant that spills, both on speed and usually on real-world usefulness.
So the decision isn’t “what’s the highest-quality quant?” It’s “what’s the highest-quality quant that fully offloads?” For most people on a single consumer GPU, the answer lands on Q4_K_M or Q5_K_M, and you can stop reading here if you just need a default. The rest of this page is for sizing it exactly.
If you’re still choosing hardware, our local AI hardware guide covers what VRAM actually buys you.
K_M vs K_S vs I-Quants Explained Simply
The cryptic suffixes encode two things: how many bits per weight and which strategy was used.
- The number (
Q4,Q5,Q6,Q8) = the average bits per weight. Higher = bigger and more faithful to the original model. _K= a K-quant, the modern default. K-quants group weights into super-blocks of 256 and store separate scaling factors per sub-block, spending more bits on the layers that matter most. This is why a 4-bit K-quant beats an old “legacy” 4-bit quant of the same size._S/_M/_L= Small, Medium, Large variants of the same bit level._Mkeeps a few critical tensors (like attention layers) at higher precision;_Ssqueezes them down to save space._Mis almost always the right pick —_Ssaves a few hundred MB for a measurable quality hit.IQ(I-quants) = a newer family (IQ2_XS,IQ3_M,IQ4_XS,IQ4_NL…) that uses an importance matrix (imatrix) — a calibration pass that figures out which weights matter — to preserve quality at very low bitrates. AnIQ4_XSis smaller thanQ4_K_Myet, when calibrated well, can match it.
The practical rule of thumb:
| If you… | Use |
|---|---|
| Fit a normal K-quant comfortably | Q-K quants (Q4_K_M, Q5_K_M) — simpler, robust |
| Are squeezing a model that barely fits | I-quants (IQ4_XS, IQ3_M) — more quality per GB |
| Run on CPU or a weak GPU | Lean toward K-quants — I-quants can be slower to decode on some hardware |
I-quants buy you the most quality per gigabyte, which matters most at the low end (2–3 bit) or when you’re trying to cram a 70B model into limited VRAM. For everyday 4–5 bit use on a GPU that fits the model, plain K-quants are the safe default.
Quality vs Size by Quant Level
This is the table to bookmark. Sizes are relative multipliers — multiply by the model’s parameter count in billions to get a rough GB figure (the estimator below makes this exact).
| Quant | ~Bits/weight | Size vs fp16 | Quality | Verdict |
|---|---|---|---|---|
| Q2_K | ~2.6 | ~16% | Noticeably degraded | Last resort to make a big model fit at all |
| Q3_K_M | ~3.4 | ~21% | Usable, some slips | Squeeze tier for large models |
| IQ3_M | ~3.5 | ~22% | Better than Q3_K at same size | Best low-bit option if imatrix is good |
| Q4_K_M | ~4.5 | ~28% | Very good — the sweet spot | Default for most people |
| IQ4_XS | ~4.3 | ~26% | ≈ Q4_K_M, slightly smaller | Great when you need a hair more room |
| Q5_K_M | ~5.5 | ~34% | Near-imperceptible loss | Upgrade if it still fits |
| Q6_K | ~6.6 | ~41% | “Almost lossless” | Diminishing returns begin here |
| Q8_0 | ~8.5 | ~53% | Essentially lossless (~0.01 perplexity vs fp16) | Only when VRAM is plentiful |
| fp16/bf16 | 16 | 100% | The original | Rarely worth it for inference |
The honest takeaway: the quality difference between Q4_K_M and Q8_0 is small for chat, roleplay, and most writing. The difference between Q4_K_M and Q2_K is large. Spend your VRAM getting to Q4_K_M first; spend what’s left climbing toward Q5/Q6.
Copy-Paste Size Estimator
You don’t need to memorize file sizes. Multiply billions of parameters × bits-per-weight ÷ 8 to get GB, then add ~10–20% overhead, plus room for context (KV cache).
Drop this into a terminal:
# usage: ./quantsize.sh <params_in_billions>
# prints approximate file size for common quants
params=${1:-8}
declare -A bpw=( [Q3_K_M]=3.4 [Q4_K_M]=4.5 [IQ4_XS]=4.3 [Q5_K_M]=5.5 [Q6_K]=6.6 [Q8_0]=8.5 )
echo "Model: ${params}B parameters"
printf "%-9s %8s %12s\n" "QUANT" "FILE_GB" "FITS_8GB?"
for q in Q3_K_M Q4_K_M IQ4_XS Q5_K_M Q6_K Q8_0; do
gb=$(awk -v p="$params" -v b="${bpw[$q]}" 'BEGIN{printf "%.1f", p*b/8*1.1}')
fit=$(awk -v g="$gb" 'BEGIN{print (g<6.5)?"yes":"tight/no"}')
printf "%-9s %8s %12s\n" "$q" "$gb" "$fit"
done
Run bash quantsize.sh 8 for an 8B model. The ×1.1 is overhead; you still need 1–2 GB on top for context. A quick mental shortcut: a 4-bit (Q4_K_M) model is roughly half its parameter count in GB — an 8B is ~4.5 GB, a 13B is ~7.5 GB, a 70B is ~40 GB.
Which Quant for Which VRAM
Match the model+quant to your card so it fully offloads. These assume ~1–2 GB reserved for context and your desktop.
| VRAM | Comfortable pick | What it runs well |
|---|---|---|
| 6 GB | 7–8B @ Q4_K_M or IQ4_XS | Solid 8B chat; tight on context |
| 8 GB | 8B @ Q4_K_M / Q5_K_M | The mainstream sweet spot — see best local LLM for 8GB VRAM |
| 12 GB | 8B @ Q6_K, or 13–14B @ Q4_K_M | Bigger models or higher quant — your call |
| 16 GB | 14B @ Q5_K_M, or 22B @ Q4_K_M | Comfortable headroom |
| 24 GB | 32–34B @ Q4_K_M, or 14B @ Q8_0 | Genuinely capable local models |
| 48 GB+ | 70B @ IQ4_XS/Q4_K_M | Frontier-class local — see VRAM for a 70B model |
Once you’ve picked a model, pulling it is one line with Ollama, which auto-selects a sensible quant:
ollama run llama3.1:8b # default tag, usually Q4_K_M
ollama run qwen2.5:14b-instruct-q5_K_M # pin an explicit quant
If Ollama isn’t set up yet, start with how to run AI locally.
When Q8 or fp16 Is Worth It
For chat and roleplay, almost never — the gap over Q5/Q6 is too small to feel, and you’re paying double the VRAM. Reach for Q8_0 only when:
- You have VRAM to burn and the model still fits with full context. Free quality is free quality.
- The task is precision-sensitive: code generation, math, structured/JSON output, function-calling, or long multi-step reasoning where small errors compound.
- You’re building a dataset or doing fine-tuning prep, where you want the cleanest possible base behavior.
fp16/bf16 is almost never worth it for inference — Q8_0 is within a rounding error of it at half the size. fp16 matters when you’re training or fine-tuning, not when you’re chatting. If you find yourself reaching for Q8 to “fix” a model that feels dumb, the problem is usually the model choice or your sampling settings, not the quant.
Common Quant Mistakes
- Picking a quant that spills off the GPU. The #1 error. A
Q6_Kthat offloads two layers to CPU is slower and feels worse than aQ4_K_Mthat runs fully on the card. Fit first. - Forgetting the KV cache. Long context eats VRAM separately from weights. A model that fits at 2K context can OOM at 32K. Always leave headroom.
- Choosing
_Sto save a few hundred MB. The quality hit isn’t worth it. Use_M. - Going below Q4 when you didn’t have to. Q2/Q3 are squeeze tiers for when a model otherwise won’t fit at all — not a default. Prefer a smaller model at Q4 over a bigger model at Q2.
- Using a non-imatrix I-quant. I-quants depend on a good importance matrix. An
IQ4_XSfrom a careless uploader can underperform; from a trusted one it shines (next section). - Assuming bigger quant = smarter. Above Q5/Q6 you’re paying real VRAM for changes you can’t perceive in normal use.
Where to Find Trustworthy Quants
Quantization quality varies by who made the file, especially for I-quants that need a calibration pass. Two uploaders are the community standard on Hugging Face:
- bartowski — fast, well-labeled repos with the full ladder of K-quants and imatrix I-quants for nearly every major model. The README usually includes a quality/size chart so you can pick at a glance.
- mradermacher — enormous coverage, including static quants (
-GGUF) and imatrix quants (-i1-GGUF). If a model exists, mradermacher has probably quantized it.
For I-quants specifically, look for the i1- prefix or “imatrix” in the repo name — that’s the calibrated version. Before downloading anything, sanity-check the source: a quant is just a binary blob, so provenance matters. Our guide on downloading GGUF models safely from Hugging Face covers what to verify.
Once you’ve internalized the 30-second rule, picking a quant really is a 30-second decision: estimate the size, match it to your VRAM, grab the _M K-quant (or an imatrix I-quant when you’re tight), and run. The harder question is usually the one before the quant — which model, on which hardware, for what. If you’d rather skip the model-shopping and VRAM math entirely and just talk to a private AI that’s already tuned to run on your machine, Ember ships a curated local companion you own outright — and if you don’t have the GPU for any of this, Freya runs the same kind of experience in the cloud with zero setup.
