If you’ve spent any time reading about running AI on your own computer, you’ve hit a wall of jargon: 7B, Q4_K_M, GGUF, VRAM, tokens per second, context window, abliterated. None of it is hard once someone explains it in plain English — the problem is that almost nobody does. This glossary fixes that. Each term gets a one-paragraph definition, why it matters in practice, and real numbers you can act on. Read it top to bottom and the rest of the local-AI world stops looking like a code dump and starts looking like a set of dials you understand.
If you’re brand new to the whole idea of running a model on your own hardware, start with local AI for beginners and circle back here for the vocabulary.
Model & parameters (the “B” in 7B, 70B)
A model is the actual AI — a big file full of numbers (called weights) that, together, encode everything the model “knows” about language. When you see a name like Llama 3.1 8B or Qwen3 32B, the B stands for billion parameters. A parameter is a single tunable number inside the model. So a 7B model has roughly 7 billion of them; a 70B model has 70 billion.
More parameters generally means a smarter, more capable, more coherent model — but also a bigger file and more hardware needed to run it. The rough size-to-capability ladder looks like this:
| Size | Rough character | Typical use |
|---|---|---|
| 1B–3B | Fast, simple, makes mistakes | Phones, tiny PCs, quick tasks |
| 7B–9B | The sweet spot for most home setups | Chat, writing, roleplay |
| 12B–32B | Noticeably smarter and more consistent | Serious daily use |
| 70B+ | Near-frontier quality | Power users with big GPUs |
A 7B model is the most common starting point because it’s genuinely useful and runs on modest hardware. Bigger is not automatically better for you — the best model is the largest one that fits comfortably on your machine and runs fast enough to be pleasant. That tradeoff is the whole game.
There’s also a clever middle path called a Mixture-of-Experts (MoE) model, where only a fraction of the parameters fire on each token. That lets a model behave like something larger while using less compute. If that intrigues you, see MoE models on low VRAM, explained.
Quantization & GGUF: shrinking models to fit
Here’s the catch with parameters: in their raw form, each one is stored at high precision (16 bits), so a 7B model is around 14 GB and a 70B model is well over 100 GB. Most people can’t fit that. Quantization is the trick that makes local AI practical.
Quantization means storing each weight at lower precision — say 4 bits instead of 16 — which shrinks the file dramatically with surprisingly little quality loss. Think of it like saving a photo as a high-quality JPEG instead of a giant RAW file: a fraction of the size, and your eye barely notices.
You’ll see quantization written as tags like Q4_K_M, Q5_K_M, Q6_K, or Q8_0. Read them like this:
- The number is roughly the bits per weight — Q4 = ~4-bit, Q8 = ~8-bit. Lower number = smaller file, slightly lower quality.
- The K means a modern “k-quant” method that’s smarter about which weights get more precision.
- The M / S / L suffix is the variant (Medium, Small, Large) within that level.
For most people, Q4_K_M is the default recommendation — it cuts a 7B model down to roughly 4–5 GB while keeping nearly all the quality. Drop below Q4 and you start to feel the model getting dumber; go to Q5 or Q6 if you have VRAM to spare and want a little more polish.
GGUF is the file format these quantized models ship in. It’s a single-file container (you’ll download something like model-name.Q4_K_M.gguf) used by the most popular local runtimes. When someone says “grab the GGUF,” they mean the ready-to-run quantized file. For the full breakdown of which quant to pick, see the GGUF quantization cheat sheet, and check are GGUF models safe to download from HuggingFace before you grab files from random uploaders.
VRAM & RAM: where the model lives
This is the single most important hardware concept, so it’s worth getting right.
- VRAM is the memory on your graphics card (GPU). It’s fast — exactly the kind of memory AI models love.
- RAM is your computer’s normal system memory, attached to the CPU. It’s slower for this job.
When a model runs on your GPU, its weights get loaded into VRAM. If the whole model fits in VRAM, it runs fast. If it doesn’t fit, the runtime spills the overflow into regular RAM and runs that part on the CPU — which works, but is much slower.
So the practical rule is: your VRAM budget decides which model size and quant you can run well. A rough guide:
| VRAM | Comfortable model range |
|---|---|
| 8 GB | 7B–8B at Q4 |
| 12 GB | up to ~12B–14B |
| 16 GB | comfortable 12B–22B |
| 24 GB | 32B-class models |
| 48 GB+ | 70B territory |
You can absolutely run local AI without a GPU using system RAM and your CPU — it’s just slower. For the deeper distinction, see RAM vs VRAM for local AI, and to size your own rig for a chat companion, how much VRAM you need for a local AI companion walks through it model by model.
Tokens & tokens/second: how AI reads and how fast it replies
AI models don’t read words — they read tokens. A token is a chunk of text, usually a word or a piece of one. As a rough rule of thumb, 1 token ≈ 0.75 words, so 1,000 tokens is around 750 words. The model reads your prompt as tokens and generates its reply one token at a time.
Tokens per second (tok/s) is therefore the headline speed metric for local AI: how many tokens the model produces each second. The feel maps roughly like this:
- Below ~5 tok/s — sluggish, you watch it crawl.
- ~7–10 tok/s — fine for reading along; this is “usable.”
- 15–40+ tok/s — comfortably faster than you can read; feels snappy.
For conversation, anything at or above your reading speed feels real-time. The full breakdown of what speed is genuinely good enough is in how many tokens per second is actually usable. If your setup feels slow, the usual culprit is the model not actually running on your GPU — see Ollama not using GPU and Ollama slow, how to speed it up.
Context window: the model’s short-term memory
The context window is how much text the model can “hold in mind” at once — your prompt, the conversation history, and its own replies all count against it. It’s measured in tokens. A model with an 8K context can keep about 8,000 tokens (~6,000 words) in working memory; 32K and 128K windows are increasingly common.
Why it matters: once a conversation exceeds the context window, the oldest messages fall out of memory — the model literally forgets the start of the chat. For a long roleplay or a document you’re discussing, a small context is the difference between “remembers your character” and “who are you again?”
Two important caveats. First, bigger context costs VRAM — extending the window eats memory, so there’s a tradeoff with model size. Second, a model offered with a huge advertised context doesn’t always use it well. If your local model is forgetting things, you may need to raise the limit yourself; how to increase Ollama’s context window shows exactly how. For chats that need to remember you across sessions, that’s a different feature — see local AI with persistent memory.
Runtime / inference engine: Ollama, llama.cpp, and friends
The model file is just data. You need a program to actually run it — that program is the runtime (also called an inference engine). It loads the weights, manages VRAM, and turns your text into the model’s reply.
- llama.cpp is the foundational C/C++ engine that made fast, quantized, GGUF-based local inference possible. Many other tools are built on top of it.
- Ollama is the most popular beginner-friendly runtime. It wraps llama.cpp in a simple command-line tool and a local server. You install it with one line:
curl -fsSL https://ollama.com/install.sh | sh
Then pull and chat with a model in one command:
ollama run llama3.1
Ollama exposes a local API on 127.0.0.1:11434 (that’s loopback — only your own machine can reach it), which other apps connect to. Full install walkthrough: how to install Ollama.
- LM Studio and Jan are graphical desktop apps for people who’d rather click than type. KoboldCpp is a favorite for roleplay and creative writing.
There’s no single “best” — it depends on whether you want a GUI, a server, or roleplay tooling. The honest comparison is in Ollama vs LM Studio vs Jan.
Abliterated / uncensored: removing refusals, explained plainly
Most chat models ship with alignment — built-in guardrails that make them refuse certain requests, often with a canned “I can’t help with that.” That’s useful for some products and frustrating for others, especially mature fiction, security research, or simply asking a blunt medical or legal question without a lecture.
An uncensored model is one trained or tuned to skip those reflexive refusals. Abliterated is a specific, technical method of getting there: researchers identify the internal “refusal direction” in the model’s activations and surgically suppress it — ablating the refusal behavior without fully retraining the model. The result keeps the base model’s intelligence but stops it from saying no by default.
Two honest points. “Uncensored” means fewer built-in refusals, not “lawless” — you are still responsible for what you do with it, and these models are strictly for adults (18+). And abliteration can slightly dent a model’s sharpness on unrelated tasks, since you’re nudging its internals. The full plain-English mechanism is in abliterated models explained, with curated picks in the best uncensored local AI models. The deeper reason people seek these out is covered in why cloud AI censors you.
Local vs hosted vs cloud: the three ways to run AI, in one sentence each
These terms get muddled, so here they are cleanly:
- Local — the model runs entirely on your own hardware; nothing leaves your machine, you own it, and it works offline.
- Hosted — a company runs the model on their servers and you use it through an app or website with zero setup (most AI products work this way).
- Cloud — effectively the same as hosted; “cloud” just emphasizes that the compute lives in a remote data center rather than your living room.
The tradeoff is simple. Local gives you privacy, ownership, no subscription, and no censorship-by-default — at the cost of needing a capable PC and a little setup. Hosted/cloud gives you instant, no-GPU access on any device — at the cost of your conversations living on someone else’s servers, subject to their policies. The full comparison lives in local AI vs cloud AI.
Putting it together
Here’s the whole glossary as one mental model: a model has a size in billions of parameters (B); quantization (e.g. Q4_K_M) shrinks it into a GGUF file; that file loads into VRAM (or RAM if your GPU is small); a runtime like Ollama runs it, producing tokens per second; the context window sets how much it remembers at once; and abliterated/uncensored variants drop the built-in refusals. Choose local for privacy and ownership, hosted/cloud for convenience.
Now that the words make sense, the natural next step is to actually run something — how to run AI locally takes you from zero to a working chat.
Once the vocabulary clicks, you’ve got two clean roads. If you want to own the whole thing on your own machine — no subscription, no servers, fully private and uncensored — Ember is a one-time-purchase local AI companion that runs on Ollama exactly as described above. If you’d rather skip the hardware and just start talking today, Freya runs the same kind of companion in the cloud with zero setup — pick whichever fits the tradeoff you just learned.
