A 12GB or 16GB GPU is the real sweet spot for running AI locally. It’s the point where you stop squeezing tiny 7B models and start running 12B–14B models that actually feel smart — coherent over long conversations, good at following character, and quick enough to be enjoyable. If you own an RTX 3060 12GB, a 4070, or a 4060 Ti 16GB, this is the tier where local AI stops being a science project and starts being something you’d use every day.

This guide is specific: exact ollama pull commands, real VRAM footprints, honest token-per-second expectations, and the one spec NVIDIA buried that quietly throttles two of these cards. We’ll also call out the best uncensored picks for each tier, since that’s why a lot of people go local in the first place.

12GB (3060-12G / 4070) vs 16GB (4060 Ti / 4070 / 4080) — what each unlocks

The jump from 8GB to 12GB is bigger than the 12-to-16 jump, but both matter. Here’s the practical breakdown:

TierCardsComfortable model sizeWhat you unlock
12GBRTX 3060 12G, 4070, 4070 Super12B–14B at Q4_K_MFull 12B/14B models with room for ~8K–16K context
16GBRTX 4060 Ti 16G, 4070 Ti Super, 408014B comfortably, 22B-ish tightSame models at higher quant (Q5/Q6), much bigger context, or a 22B at Q4

12GB is enough to run any of the modern 12B–14B models at Q4_K_M with a usable context window. That’s the headline answer to “is 12GB VRAM enough for local AI?” — yes, comfortably, for the model class most people actually want.

16GB doesn’t unlock a dramatically smarter class of model — you’re still in 12B–14B territory for the everyday sweet spot — but it removes compromises. You can run the same model at a higher-quality quantization (Q5_K_M or Q6_K), keep a much larger context window loaded, or stretch to a ~22B model at Q4 when you want maximum coherence. For roleplay and companion use, that extra headroom mostly buys you memory — longer conversations the model can actually keep in its head.

If you’re shopping rather than benchmarking what you own, our local AI hardware guide covers the price-per-VRAM tradeoffs across the whole stack, and if you’re on a smaller card, best local LLM for 8GB VRAM covers the tier below this one.

The 12B–14B sweet spot models

These three are the workhorses of this tier. All run well on 12GB, and 16GB lets you run them at higher quant or with more context.

Qwen3 14B — the strongest reasoner of the group. Excellent at instruction-following, coding, and structured output. At Q4_K_M the weights are roughly ~9GB, leaving room for context on a 12GB card. This is the “smartest” pick if you want a capable general assistant.

Gemma 3 12B (QAT) — Google’s 12B is a standout because of its QAT (Quantization-Aware Training) build. QAT models are trained to survive 4-bit quantization, so the int4 version keeps far more of the full model’s quality than a naively-quantized model would. The QAT 12B lands in the ~8.9GB range and is one of the best-feeling models you can run on a single 12GB card. Strong multilingual, good vision variants, and pleasantly natural prose. To actually get the QAT build, pull the gemma3:12b-it-qat tag — the bare gemma3:12b tag is the standard quantization, not QAT.

Mistral Nemo 12B — the efficiency champion. At Q4_K_M it’s only about ~7GB, the smallest footprint of the three, and it ships with a native 128K context window. It’s a little less “academic” than Qwen3 but writes warmly and conversationally, which makes it a favorite for character work. The slim footprint means even a 12GB card has lots of room left for a long context.

ModelPull~Q4_K_M sizeNotes
Qwen3 14Bollama run qwen3:14b~9 GBBest reasoning / coding
Gemma 3 12B QATollama run gemma3:12b-it-qat~8.9 GBBest quality-per-bit (QAT)
Mistral Nemo 12Bollama run mistral-nemo:12b~7 GBSmallest, native 128K context

Best uncensored picks for each tier

The instruction-tuned base models above all refuse some requests by design. If you want a model that won’t lecture you or break character, you want an abliterated or community-finetuned variant. (For how that works under the hood, see abliterated models explained and our roundup of the best uncensored local AI models.)

For 12GB cards, the best uncensored picks are abliterated versions of the same sweet-spot models — an abliterated Qwen3 14B or Gemma 3 12B gives you the same intelligence with the refusal behavior removed, and these are available on Ollama and Hugging Face from established community uploaders. Mistral Nemo 12B is also the base for several popular uncensored roleplay finetunes, and its small footprint plus 128K context makes it a natural fit here.

For 16GB cards (best uncensored local AI for 16GB VRAM), you can do everything above at higher quant, or step up to a finetuned model in the ~20–22B range at Q4 for noticeably more coherent long-form character work. The extra VRAM is best spent on context headroom so an uncensored companion can remember a long conversation, rather than on a marginally larger model.

A note on sourcing: only pull GGUF files from reputable uploaders — see our uncensored models guide for what to look for. When in doubt, prefer well-known community maintainers with a long download history.

Quantization explained simply for these cards

Quantization is compression for model weights. A model trained in 16-bit precision can be squeezed down to ~4 bits per weight, cutting the file (and VRAM use) by roughly 4x with only a small quality loss. The tag you’ll see most is Q4_K_M — a 4-bit “K-quant, medium” format that’s the accepted sweet spot for these cards.

Here’s the rule of thumb for picking a quant on a 12/16GB GPU:

QuantQualityUse it when
Q4_K_MVery goodDefault. Best balance on 12GB.
Q5_K_MBetterYou have 16GB and want a quality bump
Q6_KNear-lossless feel16GB + smaller model, want max fidelity
Q8_0Essentially losslessOnly worth it on small models with VRAM to spare
Q3 / Q2DegradedAvoid unless you’re desperate to fit a bigger model

On 12GB, run your 12B–14B at Q4_K_M and don’t overthink it. On 16GB, you have the freedom to go Q5_K_M or Q6_K on the same model for a real, noticeable quality bump. Gemma 3’s QAT builds are a special case — they’re engineered to hold up at 4-bit, so the int4 version already feels like a higher quant would on other models.

The memory-bandwidth caveat (the 128-bit bus)

Here’s the part most “best model” lists skip, and it’s the single most important thing to understand at this tier: VRAM capacity decides if a model fits; memory bandwidth decides how fast it generates text. Token generation is memory-bound — every token requires reading the entire active model from VRAM, so raw bandwidth largely sets your tokens-per-second.

This is where these three cards diverge sharply, even though two of them have the same 12GB:

CardVRAMBusBandwidth
RTX 3060 12GB12 GB192-bit360 GB/s
RTX 407012 GB192-bit~504 GB/s (GDDR6X)
RTX 4060 Ti 16GB16 GB128-bit288 GB/s

The 4060 Ti 16GB has the most VRAM here but the least bandwidth — its 128-bit bus delivers only 288 GB/s, lower even than the older 3060. NVIDIA offset this with a large 32MB L2 cache (up from 4MB on the 3060 Ti), which reduces memory traffic, but for LLM inference the narrow bus still shows up as slower generation than the bandwidth-rich 4070. So the honest tradeoff is: the 4060 Ti 16GB lets you load bigger models and longer contexts, but generates tokens more slowly, while the 4070 generates faster but caps out at 12GB. Neither is wrong — it depends on whether you value headroom or speed.

Context-window budgeting at 12/16GB

The model weights aren’t the only thing in VRAM. The KV cache — the model’s working memory for the current conversation — grows with your context length, and it can eat several gigabytes on its own. This is why a model that “fits” might still run out of memory mid-chat.

A practical way to budget:

  • 12GB: weights (~7–9GB) + leave ~3–4GB for context. That’s comfortable for 8K–16K tokens on most 12B–14B models. Mistral Nemo’s small footprint gives you the most context room here.
  • 16GB: same weights + ~7GB free → easily 16K–32K+ tokens, which is where companions start feeling like they remember you across a long session.

If your conversations get cut off or the model “forgets” what you said earlier, you’re hitting the context ceiling — raise num_ctx in Ollama if VRAM allows, or move to a 16GB card. For companions specifically, pairing a generous context window with a memory layer is the real unlock — see local AI with persistent memory.

Exact pull commands and tok/s

Install Ollama first if you haven’t:

curl -fsSL https://ollama.com/install.sh | sh

Then pull and run any of these (Ollama defaults to Q4_K_M):

# The 12B–14B sweet spot
ollama run qwen3:14b              # ~9 GB, strongest reasoning
ollama run gemma3:12b-it-qat      # ~8.9 GB, QAT, best quality-per-bit
ollama run mistral-nemo:12b       # ~7 GB, native 128K context

For Gemma 3, use the gemma3:12b-it-qat tag specifically — the bare gemma3:12b tag is the standard quantization, while -it-qat is the quantization-aware-trained build the section above is about.

16GB owners — pushing past Q4. Ollama’s official qwen3:14b library only ships Q4_K_M, Q8_0, and FP16 tags (there is no official Q5_K_M tag), so you have two honest routes for a quality bump:

# Route A (official): pull the QAT Gemma for higher quality-per-bit at 4-bit
ollama run gemma3:12b-it-qat              # ~8.9 GB

# Route B (community upload): a Q5_K_M GGUF of Qwen3 14B
ollama run dengcao/Qwen3-14B:Q5_K_M       # ~11 GB, community-uploaded

Route B is a community-maintained GGUF, not an official Qwen/Ollama build — vet the uploader before pulling. And it’s only worth it if it still fits alongside your context budget: at ~11GB on a 16GB card, Qwen3 14B at Q5_K_M plus a growing KV cache can get tight, so leave room or keep num_ctx modest.

The Ollama API runs on loopback at 127.0.0.1:11434 — nothing leaves your machine. Rough generation speeds for a 12B–14B at Q4_K_M (single GPU, short context):

CardExpected tok/s (12B–14B, Q4)
RTX 3060 12GB~25–35 tok/s
RTX 4070~40–55 tok/s
RTX 4060 Ti 16GB~20–30 tok/s

These are ballpark figures, not cited benchmarks — your numbers vary with quant, context length, and driver/runtime. Anything above ~15–20 tok/s reads faster than most people, so all three cards are comfortably usable. (See tokens per second: what’s actually usable for the perception thresholds.)

Verdict by card → companion experience

For a conversational AI companion — long chats, in-character, uncensored — here’s where each card lands:

  • RTX 3060 12GB: the budget hero. Runs every 12B–14B sweet-spot model at Q4_K_M with usable context and ~25–35 tok/s. The single best value card for getting into local AI. This is “enough” and then some.
  • RTX 4070: the speed pick. Same model class as the 3060 but ~40–55 tok/s thanks to GDDR6X bandwidth. Snappy, responsive companion experience; the only limit is the 12GB context ceiling.
  • RTX 4060 Ti 16GB: the headroom pick. Slower generation (128-bit bus) but the extra 4GB means longer memory, higher quant, or a larger uncensored finetune. Best when you want a companion that remembers a lot and don’t mind a slightly more relaxed pace.

All three clear the bar for a great local companion. If you’d rather not assemble the model + frontend + memory stack yourself, Ember packages a fully-local, uncensored AI companion that runs entirely on your own GPU through Ollama — exactly the 12B–14B sweet spot this guide describes, with persistent memory and zero cloud, zero logging. It’s a one-time purchase, and your card is more than ready for it.