Picking a GPU for a stock assistant model is easy: buy whatever fits a small quant and move on. Picking one for uncensored local AI is a different problem, and most buying guides get it wrong because they benchmark short, polite prompts. Uncensored and abliterated models are run for long, persistent, in-character roleplay and sensitive conversations — the exact workload that fills VRAM and never lets go. The right card isn’t the fastest one; it’s the one with enough memory to hold a capable model plus thousands of tokens of conversation history without spilling to system RAM and grinding to a crawl.
This guide maps real VRAM tiers to the uncensored model classes you’d actually run, explains why your session balloons past the model’s “base size,” and names the specific cards worth buying in 2026. The short version: a 24GB card is the sweet spot, and a used RTX 3090 is the best dollar-for-dollar pick on the planet for this job. Here’s the reasoning.
Why uncensored use changes the GPU calculus
A safety-tuned cloud-style model is built to refuse, deflect, and keep replies short. The whole point of an abliterated model — a model whose refusal direction has been surgically removed — is the opposite: it stays in character, follows your scene, and writes long. (For the mechanics, see abliterated models explained.) That changes hardware needs in three concrete ways:
- You want bigger, smarter base models. Refusal-removal works best on models that were already coherent. The popular uncensored finetunes live in the 8B–13B and 22B–34B ranges, with 70B as the prestige tier. Bigger models need more VRAM.
- You run them in less aggressive quants. A heavily compressed model loses nuance and “voice” — fatal for roleplay. Enthusiasts favor Q5_K_M or Q6_K over the smaller Q4 quants when VRAM allows, trading a few hundred MB for noticeably better prose.
- Context length is the real workload. A throwaway question uses a few hundred tokens. A multi-hour companion session, with a character card, world info, and chat history, can sit at 8K–32K tokens — and that context lives in VRAM the entire time.
That last point is the one cheap guides miss, so let’s make it concrete.
VRAM tiers mapped to uncensored model classes
VRAM is the binding constraint. Everything else — clock speed, CUDA core count — only matters once the model fits. If a model and its context don’t fit in VRAM, the runtime offloads layers to system RAM and your tokens-per-second falls off a cliff (the difference between a snappy companion and one that types a sentence every ten seconds — see what tokens-per-second is actually usable).
Here’s the honest mapping for uncensored/roleplay use, assuming you want room for a real conversation, not just a one-line demo:
| VRAM | Comfortable uncensored model class | Quant | Realistic experience |
|---|---|---|---|
| 8 GB | 7B–8B abliterated | Q4_K_M | Works, but tight on context; fine for short chats (8GB picks) |
| 12–16 GB | 8B at long context, or 13B | Q4–Q5 | The practical floor for a real companion (12–16GB picks) |
| 24 GB | 22B–34B, or 8B–13B with huge context | Q5_K_M / Q6 | The sweet spot — smart model and long memory (24GB picks) |
| 32–48 GB | 70B partially, or 34B with massive context | Q4–Q5 | Diminishing returns for most; serious money |
| 2× 24 GB (48 GB) | 70B comfortably | Q4_K_M | Enthusiast ceiling; full 70B “feel” |
Notice the cliff at 24GB. Below it you’re constantly choosing between a smarter model or a longer memory. At 24GB you stop choosing — you get a genuinely capable 22B–34B model and enough headroom that a long session never evicts your character’s memory.
Why long roleplay sessions blow past the base footprint
A model’s download size is not its runtime size. The thing that grows is the KV cache — the running memory of every token in the conversation. It scales roughly linearly with context length, and it sits in VRAM alongside the model weights.
A rough mental model: a 13B model at Q5 is around 9–10GB of weights. Add a long session and the KV cache can add several more gigabytes on top. A 34B model at Q5 is in the low-to-mid 20s of gigabytes for weights alone — which is exactly why 24GB cards and 32K-token roleplay are the natural pairing, and why anything less means truncating your companion’s memory mid-scene.
This is the trap behind “but it fit when I tested it.” It fit for the first three messages. Two hours into a persistent-memory companion (how persistent memory works locally), the context has grown, the KV cache has grown, and a card that was 90% full at message one is now overflowing into system RAM. Buy for the session you’ll actually have, not the demo prompt. That single insight is why the value answer skews one card higher than most guides suggest.
Best value GPU for abliterated / roleplay models: used RTX 3090
The used NVIDIA RTX 3090 is, in 2026, the best-value GPU for running uncensored local LLMs — full stop. It has 24GB of GDDR6X, runs every CUDA-accelerated runtime with zero fuss, and trades on the second-hand market for a fraction of a new high-end card. For abliterated 22B–34B models at a good quant with long context, nothing else comes close on price-per-capability. (For the absolute floor on cost, see cheapest GPU for local AI; for this workload, the 3090 is the value pick, not the cheapest entry.)
Why the 3090 specifically and not a newer 16GB card:
- 24GB is the unlock. A new mid-tier card with 16GB is faster on paper but forces the model-or-memory compromise above. The 3090’s extra 8GB is worth more than raw speed for this use case.
- CUDA “just works.” Ollama, llama.cpp, KoboldCpp, and every quant format run on NVIDIA with no driver gymnastics (how to install Ollama).
- Used supply is healthy. It’s a previous flagship; gamers upgrade and dump them. Buy from a reputable source, and prefer cards with good cooling — the 3090 runs warm under sustained inference.
The newer RTX 4090 / 5090-class cards are faster and have more (or equal) VRAM, but you pay a steep premium for tokens-per-second you may not feel in a chat workload. They’re the right call only if you also game at the top tier or want to push toward 70B.
Used 3090 / RX 7900 XTX as 24GB value picks
If you prefer AMD or find a better local deal, the Radeon RX 7900 XTX is the other 24GB value champion. It matches the 3090’s memory and, with ROCm-supported runtimes, runs uncensored models perfectly well.
| Used RTX 3090 | RX 7900 XTX | |
|---|---|---|
| VRAM | 24 GB GDDR6X | 24 GB GDDR6 |
| Software path | CUDA — universal, zero friction | ROCm — supported, slightly more setup |
| New vs used | Best as used | Available new at a fair price |
| Best for | Anyone who wants it to just work | AMD-platform builders, new-card buyers |
Both give you the same headline number that matters — 24GB — which is the actual product you’re buying. The 3090 wins on software simplicity; the 7900 XTX wins if you want a new card with a warranty or you’re already on an AMD platform. (AMD owners: read AMD GPU for local LLM before you buy, because the ROCm setup details genuinely matter.)
NVIDIA vs AMD for this use case
For uncensored local AI specifically, here’s the honest trade:
- NVIDIA (CUDA) is the path of least resistance. Every guide, every runtime, every quant format assumes it. If you want to install Ollama, type
ollama run <model>, and start chatting with zero troubleshooting, buy NVIDIA. The local LLM ecosystem is CUDA-first and probably will be for years. - AMD (ROCm) has closed most of the gap. Mainstream runtimes support it, and a 7900 XTX delivers excellent tokens-per-second. The cost is occasional setup friction — driver/ROCm version matching, the rare tool that’s CUDA-only.
If this is your first local-AI build and you value “it works on the first try,” go NVIDIA. If you’re comfortable on Linux and already lean AMD, the 7900 XTX is a legitimately great card that won’t hold you back.
What each GPU unlocks for an uncensored companion
Translating VRAM into the experience you’ll actually have with a private, uncensored companion like Ember:
- 8GB: A real, private companion that runs 100% offline — but with a shorter memory and a smaller brain. A good entry, not the destination.
- 12–16GB: A capable 8B–13B companion that holds a coherent personality across a normal conversation. The practical floor for “this feels good.”
- 24GB (3090 / 7900 XTX): The target. A smart 22B–34B model with a long memory that doesn’t forget your scene mid-session. This is where local uncensored AI stops feeling like a tech demo and starts feeling like the real thing.
- 48GB (dual 24GB): A full 70B companion. Lovely, expensive, and overkill for most people.
The pattern is clear: every tier up buys you either a smarter companion or a longer-memoried one, and 24GB is the first tier where you stop compromising on either.
Buy verdict
- Best value, full stop: a used RTX 3090 (24GB). CUDA simplicity, the right amount of VRAM, and a price the second-hand market keeps reasonable.
- Best new / AMD pick: RX 7900 XTX (24GB) — same memory, new-card warranty, ROCm setup tax.
- First build, want zero friction: any NVIDIA card; stretch for 24GB if the budget allows, accept 16GB if it doesn’t.
- Don’t bother (for chat): paying flagship 4090/5090 prices purely for roleplay — the tokens-per-second gain is real but rarely felt in conversation; put that money into the extra VRAM instead.
Whatever you land on, buy for the session, not the prompt — the long roleplay is where uncensored models earn their keep, and 24GB is what keeps that session from falling apart. If you want a deeper model-side companion, pair your read here with the best uncensored local AI models and the broader uncensored local AI guide.
Once your card arrives, the fastest way to put that VRAM to work is a companion built to run entirely on your own hardware — no cloud, no logging, no refusals. Ember installs locally on top of Ollama and turns a 24GB card into a private, uncensored companion that’s yours alone, bought once and run forever.
