If you want to run a local LLM and you’re shopping on a budget, here’s the short version: buy a used or new RTX 3060 12GB. It’s the cheapest card that runs genuinely useful models — 8B–14B class, the sweet spot for chat and roleplay — without constantly slamming into “out of memory.” Below that you’re stuck with tiny models that feel like a worse cloud. Above it you’re paying real money for marginal speed gains until you hit the next true tier (24GB). This guide explains exactly why, what each budget card can actually run, and how to spend the least money for the most capable private AI.
The budget verdict up front: RTX 3060 12GB
For most people asking “what’s the cheapest GPU to run a local LLM,” the answer is the NVIDIA RTX 3060 12GB. Not the 8GB version — the 12GB one. That extra 4GB of VRAM is the entire reason this card matters, and the 8GB variant should be avoided for AI specifically.
Why this card wins on value:
- 12GB of VRAM fits an 8B model at a comfortable quantization (
Q4_K_M/Q5_K_M) entirely on the GPU, with room for a decent context window. It also squeezes in 12B–14B class models atQ4. - CUDA — it’s NVIDIA, so everything (Ollama, llama.cpp, koboldcpp, every front-end) “just works” with zero driver wrestling.
- It’s cheap and everywhere. It was the most-sold GPU of its generation, so the used market is flooded, and prices are low.
- It sips power relative to the big cards (170W TDP), so it won’t blow up your PSU budget or your electric bill.
It’s not fast in absolute terms — it’s a mid-range card from a few generations back — but “fast enough to be pleasant” is a low bar for chat, and the 3060 clears it. For a full breakdown of how VRAM maps to model sizes, see our local AI hardware guide.
The contenders: 3060 12GB vs Arc B580 vs used 3090
Three cards dominate the budget conversation in 2026. They sit at different price points and solve different problems.
| RTX 3060 12GB | Intel Arc B580 | Used RTX 3090 | |
|---|---|---|---|
| VRAM | 12 GB | 12 GB | 24 GB |
| Memory bus | 192-bit | 192-bit | 384-bit |
| Ecosystem | CUDA (best) | Vulkan/SYCL (improving) | CUDA (best) |
| Power (TDP) | 170W | 190W | 350W |
| New/used | both | new | used only |
| Best for | cheapest reliable entry | newest cheap card | most VRAM per dollar |
RTX 3060 12GB — the safe, cheap default. CUDA support is flawless and the VRAM is enough for the models most people actually want.
Intel Arc B580 — the interesting newcomer. Also 12GB, brand new, often cheap, and Intel has put real work into its AI software stack. The catch: outside CUDA you’re relying on Vulkan or SYCL backends, which are good and getting better but still occasionally rougher than NVIDIA’s “it just works.” Ollama and llama.cpp support it, but you’ll do slightly more troubleshooting. If you specifically want the AMD path instead, that’s a separate calculus — see AMD GPU for local LLM.
Used RTX 3090 — the value bomb. It’s two generations old, draws a lot of power, and you’re buying it second-hand off someone who mined or gamed on it. But it has 24GB of VRAM, and VRAM is the thing that actually limits what you can run. Dollar-for-dollar of usable AI capability, a used 3090 is frequently the smartest buy on this list — it jumps you a whole tier. We treat it as its own category below.
Real tok/s on each, for chat and roleplay
Tokens per second (tok/s) is the speed you actually feel. Roughly, anything above ~15 tok/s reads faster than you do and feels conversational; below ~8 tok/s it starts to drag. (More on what’s actually usable: tokens per second, usable.)
Exact numbers vary with quantization, context length, driver version, and which inference engine you use, so treat these as honest ballparks for a fully-GPU-resident model — not lab benchmarks:
| Model size (Q4) | RTX 3060 12GB | Arc B580 | Used RTX 3090 |
|---|---|---|---|
| 8B (chat sweet spot) | ~30–45 tok/s | ~25–40 tok/s | ~70–110 tok/s |
| 12B–14B (richer roleplay) | ~18–28 tok/s | ~15–25 tok/s | ~45–70 tok/s |
| 32B (near-cloud quality) | won’t fit fully | won’t fit fully | ~25–35 tok/s |
The takeaways:
- For an 8B model, all three are comfortably “faster than reading.” The 3060 and B580 are roughly in the same league; the 3090 is dramatically quicker.
- For roleplay, where long character cards and growing chat history eat context, the 3090’s headroom matters more than its raw speed — the others slow down as context grows because they’re closer to their memory ceiling.
- The 3090 is the only card here that runs a 32B model entirely on the GPU at a usable speed. That’s the whole reason to buy one.
Why VRAM matters even more for uncensored and long-context
People underestimate this. A model’s parameters are only part of what lives in VRAM. The other part is the KV cache — the memory that holds your conversation context — and it grows with every token of history.
This is exactly where roleplay and uncensored companion use punish a small card:
- Character cards and system prompts can be thousands of tokens before you say a word.
- Long sessions accumulate history; a good companion remembers the conversation, and that memory lives in VRAM.
- Bigger context windows (8K, 16K, 32K) multiply the KV cache size.
When you run out of VRAM, layers spill to system RAM and inference slows to a crawl — your tok/s can fall off a cliff mid-conversation. So the practical question isn’t “does the model fit?” but “does the model plus a long conversation fit?” That buffer is why 12GB is the real floor and why 24GB feels luxurious.
It’s also why uncensored models don’t change the math much: an abliterated 8B model is the same size as its censored sibling — same VRAM, same speed. You’re not paying a performance tax to remove the refusals, you’re just choosing a different fine-tune. What you do want is enough headroom to run the slightly larger, more coherent uncensored models without choking. For the model picks themselves, see the best GPU for uncensored LLM and our 8GB VRAM model guide if you’re working with less.
New vs used: the used-3090 value play
This is the single highest-leverage decision in a budget build.
A new RTX 3060 12GB is a clean, warrantied, low-power, no-surprises card. It’s the right call if you want zero fuss and you’re happy living in the 8B–14B world.
A used RTX 3090 is the move if you want to punch above your budget. You typically get it for a fraction of a new 24GB card, and it unlocks the 24GB tier: 32B-class models, much longer context, faster everything. The trade-offs are real and worth saying plainly:
- No warranty, and you’re buying someone’s used silicon. Test it immediately under load.
- 350W TDP — you need a beefier power supply (think 750W+) and good case airflow.
- It runs hot and can be loud, especially the blower and some triple-fan models.
- Two generations old, so no newer architectural features — but for LLM inference, raw VRAM and bandwidth are what count, and the 3090 has both.
For most people whose goal is “run the best private AI I can afford,” a used 3090 beats a new mid-range card on capability-per-dollar. If you’d rather build the whole machine around it sensibly, our best budget AI PC build covers PSU, cooling, and pairing.
Power, noise, and total cost
The sticker price isn’t the whole cost. Budget cards differ a lot in what they demand from the rest of your system:
| RTX 3060 12GB | Arc B580 | Used RTX 3090 | |
|---|---|---|---|
| TDP | 170W | 190W | 350W |
| Recommended PSU | 550W | 600W | 750W+ |
| Noise under load | quiet | quiet–moderate | moderate–loud |
| Heat output | low | low | high |
The 3090’s extra ~180W matters two ways. First, it may force a PSU upgrade — factor that into the total. Second, if your machine is in your bedroom (and for a private companion, it often is), the noise and heat of a 3090 under sustained load is noticeable in a way the 3060 simply isn’t. The 3060 and B580 are quiet, cool, “set it and forget it” cards. The 3090 is a workhorse that announces itself.
If you’re running the AI for hours a day, the 3090 also costs more in electricity — not enough to change the verdict, but enough to mention honestly.
What each budget GPU lets you run (Ember-capable tiers)
Here’s the part that actually matters for a private AI companion. Apps like Ember run entirely on your own machine via Ollama, so what you can run is what your GPU can hold. Mapping cards to experience:
| GPU | VRAM | Comfortable model tier | What the companion feels like |
|---|---|---|---|
| RTX 3060 8GB (avoid) | 8 GB | up to ~8B, tight | works, but cramped context |
| RTX 3060 12GB | 12 GB | 8B comfortably, 12B–14B at Q4 | the budget sweet spot — coherent, remembers the conversation |
| Arc B580 | 12 GB | same as 3060, slightly more setup | same tier, CUDA-free path |
| Used RTX 3090 | 24 GB | up to 32B, long context | near-cloud quality, deep memory, fast |
All three of the recommended cards (3060 12GB, B580, used 3090) are fully Ember-capable — they each clear the bar for a smooth, private companion with real conversational memory. The 3060 12GB is the floor where it feels good; the 3090 is where it feels premium. If you’re unsure how much VRAM your specific use needs, how much VRAM for a local AI companion and the broader hardware guide go deeper.
Buy advice by budget band
Cut to the chase:
-
Rock-bottom budget → Used RTX 3060 12GB. The cheapest path to a genuinely good local AI. Verify it’s the 12GB SKU before you pay. Don’t bother with the 8GB.
-
Small budget, want new + warranty → New RTX 3060 12GB or Intel Arc B580. Pick the 3060 if you want zero software friction (CUDA); pick the B580 if it’s cheaper where you are and you don’t mind a little setup.
-
A bit more to spend, want the most capability → Used RTX 3090. The value king. Budget for a 750W+ PSU and good airflow, test it under load on day one, and you’ve bought yourself the 24GB tier for budget-card money.
-
Already have a 3090 → you’re done. That’s the 24GB sweet spot; run 32B models and enjoy.
Whichever you land on, the through-line is the same: VRAM is the budget, everything else is details. Buy the card that gives you the most VRAM you can reliably afford, install Ollama, and you’ve got a private AI that owes nothing to the cloud.
Once the hardware’s sorted, the next question is what to actually run on it — and if you want a private, uncensored AI companion that lives entirely on that GPU with real memory and no logging, that’s exactly what Ember is built for.
