An 8GB graphics card is the single most common gaming GPU on the planet, and it happens to be the exact size where uncensored local AI stops being a science project and starts being genuinely good. With 8GB of VRAM you can run a 7B–8B model entirely on the GPU, get fast, fluid responses, and never send a single token to a cloud server that logs, trains on, or refuses your conversations. The trick is knowing which models actually fit, which quantization to pull, and how to avoid the one mistake that turns a snappy chatbot into a stuttering mess. This guide is the tested, no-fluff answer.
Why 8GB is the budget sweet spot
The RTX 3060 (in both its 8GB and 12GB variants) and the RTX 4060 have been near the top of the Steam Hardware Survey for years. Tens of millions of people already own one. That matters because the local-AI community optimizes hardest for the hardware most people actually have — which is why the 7B and 8B model class is the most polished, most fine-tuned, and most uncensored corner of the open-weight world.
Below 8GB (4–6GB cards), you’re forced into heavy quantization or 3B models that feel noticeably dumber. Above 8GB, you unlock 12B–14B models, but you’re spending real money for diminishing returns on casual chat and roleplay. At 8GB you get the full 8B experience — coherent, in-character, fast — without a GPU upgrade. It’s the floor where “this is actually good” begins. For the bigger picture on matching models to hardware, see the local AI hardware guide.
The Q4_K_M VRAM math for an 8GB card
Here’s the only formula you need. A model’s GPU memory footprint is roughly:
VRAM ≈ (model file size) + (KV cache for your context length) + ~0.5–1 GB overhead
Quantization is what shrinks the model file. The tag you want on an 8GB card is Q4_K_M — a 4-bit quant that keeps the most important weights at higher precision. It’s the community-default sweet spot of quality vs. size. For a 7B/8B model, Q4_K_M lands around 4.5–5.0 GB on disk and in VRAM.
| Model size | Quant | File size | Fits in 8GB? |
|---|---|---|---|
| 7B | Q4_K_M | ~4.4 GB | Yes, comfortably |
| 8B | Q4_K_M | ~4.9 GB | Yes |
| 8B | Q5_K_M | ~5.7 GB | Yes, but tighter context |
| 8B | Q6_K | ~6.6 GB | Only with short context |
| 13B | Q4_K_M | ~7.9 GB | No — will spill to CPU |
The part people forget is the KV cache: the memory that holds your conversation history. At a 4K context it’s small; push to 16K or 32K and it can eat 1–2GB on its own. So the real math on an 8GB card is: ~5GB model + ~1.5GB context + ~1GB overhead ≈ 7.5GB. That fits, with a sliver to spare. Go bigger on any one term and you spill into system RAM — which is the cardinal sin we cover at the end. For a deeper breakdown of every quant tier, the GGUF quantization cheat sheet is the companion read.
Best uncensored picks for 8GB
These are the three categories worth your VRAM. Each is uncensored in a different way — some by design (Dolphin), some by surgical de-refusal (abliteration), some by roleplay-focused fine-tuning (Stheno/Lumimaid).
Dolphin-Mistral 7B — the reliable uncensored generalist
The Dolphin family, fine-tuned by Eric Hartford, strips the refusal-heavy instruction layer and produces a model that just answers. Dolphin-Mistral 7B is based on Mistral 0.2 and is excellent for uncensored Q&A, creative writing, coding help, and “explain this without a lecture” tasks. The Ollama library reports up to 32K context, though the 2.8 fine-tune itself was trained at 16K sequence lengths, so treat ~16K as the reliable usable window and 32K as a softer ceiling. It’s the safest first pull — it’s on the official Ollama library, so there’s no GGUF hunting.
Stheno / Lumimaid 8B — the roleplay specialists
If your goal is character roleplay, the L3-8B-Stheno line (and the similar Llama-3-Lumimaid 8B) are the community’s gold standard. Per their Hugging Face model cards, Stheno is fine-tuned on a deliberate mix of SFW and NSFW story and roleplay data, which is exactly what makes it hold a character voice over long scenes where a vanilla model breaks character or moralizes. These aren’t in the official curated Ollama library — so you grab the community Q4_K_M GGUF from Hugging Face (or a community Ollama mirror), import it, or run it through a front-end like SillyTavern. See the best local LLM for roleplay guide for the full roleplay-tuned roster.
Llama-3.1-8B-abliterated — the smart, modern all-rounder
When you want Llama 3.1’s intelligence without its (substantial) refusal training, an abliterated build is the answer. Abliteration mathematically suppresses the model’s refusal direction while leaving the rest of its capability intact — so you keep the reasoning and instruction-following, minus the “I can’t help with that.” The widely-pulled mannix/llama3.1-8b-abliterated is the easiest route on Ollama. It’s the best balance of brains and freedom in this weight class, and a great daily driver. More options live in best uncensored local AI models and the Ollama uncensored models list.
Per-model tok/s and context limits
Real numbers on an 8GB RTX 3060/4060, model fully in VRAM, Q4_K_M, short-to-moderate context. Treat these as ballpark — your tokens/sec varies with quant, context length, and driver:
| Model | Quant | VRAM use | Context | Speed (typical) | Best for |
|---|---|---|---|---|---|
| Dolphin-Mistral 7B | Q4_K_M | ~4.5 GB | up to 32K (~16K reliable) | Fast, very fluid | Uncensored chat, writing & code |
| Stheno 8B | Q4_K_M | ~5.0 GB | 8K usable | Fast | One-on-one roleplay |
| Lumimaid 8B | Q4_K_M | ~5.0 GB | 8K usable | Fast | Character / story RP |
| Llama-3.1-8B-abliterated | Q4_K_M | ~4.9 GB | 8K–16K | Fast | Smart general all-rounder |
A few honest notes: on a 3060 these 7B/8B models read faster than you can — comfortably above the ~7–10 tok/s threshold where text feels real-time (more on what’s actually usable tokens per second). The bigger constraint on 8GB isn’t speed, it’s context length: every extra thousand tokens of history costs VRAM, so cap your context to what fits rather than maxing it out and spilling to CPU.
Exact Ollama pull commands
First install Ollama (one line, works on Linux/Mac; Windows has an installer):
curl -fsSL https://ollama.com/install.sh | sh
Then pull and run. The Ollama API is local-only at 127.0.0.1:11434 — nothing leaves your machine.
# Dolphin-Mistral 7B — uncensored generalist (official library)
ollama run dolphin-mistral
# Llama 3.1 8B abliterated — smart all-rounder
ollama run mannix/llama3.1-8b-abliterated:q4_k_m
For Stheno and Lumimaid, which aren’t in the official curated Ollama library, download the Q4_K_M GGUF from Hugging Face (or pull from a community Ollama mirror) and import it with a tiny Modelfile:
# After downloading e.g. L3-8B-Stheno-v3.2-Q4_K_M-imat.gguf
printf 'FROM ./L3-8B-Stheno-v3.2-Q4_K_M-imat.gguf\n' > Modelfile
ollama create stheno -f Modelfile
ollama run stheno
New to all of this? The step-by-step how to install Ollama and how to run AI locally guides walk it from zero.
Roleplay vs. chat vs. coding: pick by job, not hype
One model rarely wins everything. Match the model to the task:
- Roleplay / companion / story → Stheno or Lumimaid 8B. They hold character, embrace a persona, and don’t moralize mid-scene. Pair them with a front-end like SillyTavern for memory and character cards.
- General uncensored chat & writing → Dolphin-Mistral 7B or Llama-3.1-8B-abliterated. Coherent, answers directly, good at “help me think.”
- Coding → Dolphin-Mistral 7B is the standout here for its size: the 2.8 fine-tune is coding-focused and posts a respectable ~46.9% pass@1 on HumanEval, so it’s genuinely useful for snippets, refactors, and explanations. Just keep expectations honest — any 8B model is a helper, not a senior engineer, and for large, serious codebases a coding-specialized model and more VRAM is still the real answer.
The reason any of this beats a cloud chatbot isn’t just freedom — it’s that the model can’t refuse you and can’t log you, because it runs on your hardware. That’s the whole pitch behind why cloud AI censors you.
Don’t want to do the VRAM math?
Everything above is doable, but it’s also a chunk of an afternoon: install Ollama, learn quant tags, hunt down the right GGUF, write a Modelfile, wire up a front-end, and tune context so you don’t spill to CPU. Plenty of people just want to talk to the thing.
That’s the gap Ember fills. It runs these exact uncensored local models on your own machine through Ollama — handling the model selection, the quant, and the VRAM budgeting for you — so an 8GB RTX 3060 or 4060 owner gets a working, private, uncensored companion without touching a Modelfile. And if you don’t have a capable GPU at all, Freya is our hosted, zero-setup option that needs no VRAM math because there’s no VRAM for you to manage.
Common 8GB mistakes (spilling to CPU)
The number-one mistake on an 8GB card is silent CPU offload. When a model plus its context doesn’t fit in VRAM, Ollama doesn’t crash — it quietly moves some layers to system RAM and runs them on the CPU. Speed falls off a cliff: a model that ran at conversational speed suddenly limps at a few tokens per second.
How to avoid it:
- Watch your VRAM. Run
nvidia-smiwhile the model is loaded. If GPU memory is pinned near 8GB and your CPU is busy during generation, you’re spilling. - Don’t max the context. A 32K context on an 8GB card is often the thing that pushes you over. Set a realistic context (e.g. 8K) —
OLLAMA_CONTEXT_LENGTHor a ModelfilePARAMETER num_ctx 8192keeps the KV cache in check. - Don’t over-quant upward. Q6_K or Q8 “for quality” on an 8GB card just guarantees a spill. Q4_K_M is the right call here — the quality gap to Q5/Q6 is small, and staying fully on-GPU matters far more than a tiny precision bump.
- Close the VRAM hogs. A browser with 40 tabs and a game launcher can quietly eat 1–2GB. Free it before you load an 8B model.
Get those four right and an 8GB card runs an uncensored 8B model beautifully — fast, fully on-GPU, and completely private.
If you’d rather skip the tuning entirely and just have a private, uncensored companion that already knows how to run these models on your hardware, that’s exactly what Ember is built to do.
