A 70B model is the largest thing most people will ever seriously try to run on hardware they own, and for uncensored companions it’s the holy grail: the prose gets richer, the character stays more consistent across a long scene, and the model stops “forgetting” what kind of conversation you’re having. The question is whether you can actually feed it. The honest answer is that the GPU to run a 70B model uncensored at home comes down to one number — VRAM — and a 70B at a usable 4-bit quant wants roughly 40-45GB before you’ve typed a word of context. That rules out a lot of hardware and rules in a surprisingly short list of real builds.
This page is the no-marketing version. Real VRAM math, real cards, real tokens-per-second ranges, and the cheapest path that doesn’t end in a model that types one sentence every ten seconds. If you only want the weight-side math in isolation, the companion piece VRAM requirements for 7B to 70B models has the full per-quant chart; here we’re answering the buying question.
Why 70B is the aspirational ceiling for home uncensored companions
For a stock assistant, 70B is overkill — a sharp 30B-class model answers your questions fine. The reason 70B is the aspirational ceiling for uncensored companions specifically is that the workload is the opposite of a one-shot Q&A. You run an abliterated or finetuned model for long, persistent, in-character sessions, and at that size the model holds tone, remembers the thread, and writes long without collapsing into repetition. Bigger base models also take refusal-removal more gracefully — they were more coherent to begin with, so an abliterated 70B keeps its “voice” where a small one gets dumber.
It’s a ceiling, not a requirement. A great 30B-class uncensored model on a 24GB card beats a 70B that’s crawling on offload every single time. But if you want the best companion experience that consumer silicon can deliver, 70B is the target — and it’s expensive enough that you should know exactly what it costs before you buy.
The VRAM target: ~40-45GB at Q4 and what that rules in/out
The math is simple and unforgiving. A model is parameters × bytes-per-param, plus overhead. At Q4_K_M (the 4-bit sweet spot), real-world cost is about 0.55 bytes/param, not a clean 0.5:
70 × 0.55 × 1.2 ≈ 46 GB (weights + base inference overhead)
Call it ~40-45GB for weights and overhead at short context, climbing toward and past 46GB once you add a real conversation. Then the KV cache — the per-token memory of your chat history — piles on top, adding roughly 2-4GB at 8K context and 10-20GB at 32K on a 70B. A genuinely comfortable 70B at long context lives in the 48-64GB band.
Here’s what that target rules in and out:
| Hardware | VRAM | 70B @ Q4 verdict |
|---|---|---|
| RTX 3060 / 4060 (8-12GB) | 8-12 GB | No — not even close |
| RTX 4080 / 5080 (16GB) | 16 GB | No — wrong tier entirely |
| RTX 3090 / 4090 (24GB) | 24 GB | Tight — only with heavy offload, slow |
| RTX 5090 (32GB) | 32 GB | Workable with low quant; comfy at Q3 |
| 2× RTX 3090 (48GB) | 48 GB | Yes — the classic home 70B build |
| Apple Silicon 64GB+ unified | ~48GB usable | Yes — slower, but it fits |
| Workstation 48GB (A6000-class) | 48 GB | Yes — expensive, single-slot |
The cliff is real: there is no single consumer card with enough VRAM to hold a 70B at Q4 plus long context entirely on-chip. The 5090’s 32GB gets close but still requires either aggressive quantization or a little offload. Everything genuinely comfortable involves either two GPUs or a big unified-memory Mac.
Single 24GB (3090/4090) at Q4: tight, with offload — real tok/s
A 24GB card is the workhorse of local AI, and it’s a phenomenal home for a 30B-class model. For a 70B at Q4, though, you’re ~22GB short on weights alone. The runtime — Ollama or llama.cpp — will happily load it anyway by offloading the overflow layers to system RAM and running them on the CPU. It works. It is also slow, because token generation is bandwidth-bound: a GPU moves data at ~1000 GB/s, dual-channel DDR system RAM at ~50-100 GB/s, and your speed is gated by the slowest tier any layer lives on.
In practice, a 70B at Q4 with maybe half its layers offloaded to a 24GB 3090 or 4090 lands somewhere around 2-5 tokens/second depending on your RAM speed and CPU — readable, but a noticeable wait, and it gets worse as context grows. For what counts as bearable, see tokens-per-second, explained; roughly, below ~5 tok/s a chat starts to feel like a slideshow.
You can claw some speed back by dropping to a Q3 quant so more layers fit on the GPU, but Q3 on a 70B introduces visible quality loss — and a sharp 30B at Q5 will both feel smarter and run many times faster. If you already own a single 24GB card, the honest move for daily use is to run the best 30B-class model it can hold fully in VRAM, not to chase a half-offloaded 70B. (The 3090’s dollar-per-capability for exactly this kind of work is why it’s such a strong used buy — see the used RTX 3090 value case.)
Single 32GB (5090): comfortable, fully in VRAM — real tok/s
The RTX 5090 changes the picture with 32GB of VRAM. That’s not enough for a 70B at Q4 fully in VRAM (you need ~46GB), but it’s enough to either:
- Run a 70B at Q3_K_M (~38GB) with a small offload and stay much faster than a 24GB card — think double-digit tokens/second when the offload is light, dropping as context fills.
- Run a 70B at Q4 with a modest CPU offload that’s far less painful than on 24GB, because a smaller fraction lives off-GPU.
The 5090’s other advantage is raw bandwidth and compute — it’s a genuinely fast card, so even the layers it does hold scream. For a single-card 70B experiment, 32GB is the first tier where it stops being a tech demo. It’s still a compromise (you’re either quantizing harder than ideal or offloading a little), but it’s a usable compromise.
Whether the 5090 is the right buy for this is a value question. It’s the best single consumer GPU for local AI, full stop, but you’re paying flagship money and still not getting a clean Q4 70B. If 70B specifically is the goal, two cheaper cards beat one expensive one on VRAM-per-dollar — which is the next build.
Dual 3090: the headroom build, power and slot-spacing realities
The enthusiast standard for a home 70B is two used RTX 3090s — 24GB + 24GB = 48GB of combined VRAM. Both Ollama and llama.cpp split a model’s layers across multiple GPUs automatically, so a 70B at Q4 (~46GB of weights) fits across the pair with room for context, and it runs fully on GPU — no CPU offload, no bandwidth cliff. Expect comfortably double-digit tokens/second on a dual-3090 rig running a 70B at Q4, which is the “real 70B feel” most people are chasing.
But two 3090s is a real systems-engineering job, not a drop-in:
- Power. Two 3090s can pull ~350W each. You want a 1000W+ PSU (1200W is safer with headroom), and you should consider power-limiting each card — capping a 3090 to ~280W costs only a few percent of inference speed and dramatically cuts heat and draw.
- Slot spacing. 3090s are triple-slot bricks. Two of them on a consumer motherboard usually means they’re physically jammed together, and the top card bakes. You either need a board with well-spaced PCIe slots, a case with vertical mounting, or PCIe riser cables to separate them.
- PCIe lanes. Inference is far less lane-sensitive than training — even running the second card at x4 barely dents tokens/second — so you don’t need a HEDT/Threadripper platform. A normal consumer board with two physical x16 slots is fine.
- Cooling and noise. Two cards under sustained inference is a space heater. Good case airflow is non-negotiable; blower-style 3090s help in tight builds.
Done right, dual-3090 is the cheapest path to a no-compromise 70B at home — and because the second card is also “just more VRAM,” it future-proofs you for bigger models and longer context. The best GPU for uncensored LLMs guide covers the single-card value picks; dual-3090 is the step past it when 70B is non-negotiable.
The Apple unified-memory alternative for 70B
There’s a non-NVIDIA path that quietly solves the VRAM problem: Apple Silicon with large unified memory. On an M-series Mac, the GPU can address most of system RAM, so a 64GB machine can hold a 70B at Q4 plus context, and a 96GB/128GB machine has room to spare — no second card, no PSU upgrade, no riser cables, and near-silent operation.
The trade-off is memory bandwidth. A 3090 pushes ~936 GB/s; even the high-end Mac chips top out lower (and the base/Pro tiers far lower), so a 70B on a Mac generally runs slower than a dual-3090 rig — often in the mid-single-digit to low-double-digit tokens/second range depending on the chip tier. It fits and it’s usable; it’s just not the fastest. The wins are simplicity, power efficiency, and that it’s a finished computer rather than a build. If you’re already a Mac person or want a quiet always-on box, it’s a legitimate 70B host — see Apple Silicon vs NVIDIA for local AI for the full comparison. Just make sure you buy enough unified memory up front; you can’t add more later.
Picking the uncensored 70B-class model to actually run
Hardware is half the answer — you also need a model worth feeding it. For uncensored use you’re choosing among open-weight 70B-class families and their finetunes:
- Llama 3.x 70B and its uncensored/abliterated derivatives are the most common 70B-class base — well-supported, widely finetuned, and the reference point most roleplay and companion finetunes build on.
- Qwen large models are the strong alternative, often sharper on reasoning; the Qwen vs Llama 3.3 comparison covers which suits which workload.
- Abliterated versions remove the refusal direction so the model stays in-character instead of moralizing — the mechanism is explained in abliterated models explained, and broader picks live in the best uncensored local AI models.
Two practical notes. First, prefer Q4_K_M or higher if your VRAM allows — uncensored roleplay is exactly where over-compression shows up as flatter, more repetitive prose, so the headroom of a dual-GPU or large-Mac build pays off in quality, not just speed. Second, don’t assume 70B is automatically smarter than a modern 30B-class model for your use; the gap has narrowed enormously, and a 30B at Q5 you can run fully in VRAM may genuinely be the better companion. Always download GGUFs from reputable sources (how to vet GGUF models).
Verdict: the cheapest honest path to an uncensored 70B at home
| Build | VRAM | 70B @ Q4 experience | Rough cost posture |
|---|---|---|---|
| Single 24GB (3090/4090) | 24 GB | ~2-5 tok/s with offload — tolerable, not great | Lowest, but compromised |
| Dual used RTX 3090 | 48 GB | Fully in VRAM, double-digit tok/s | Best value for real 70B |
| Single RTX 5090 | 32 GB | Q3 fully-ish in VRAM, or light Q4 offload | High; one-card simplicity |
| Apple Silicon 64GB+ | ~48GB usable | Fits, mid-single to low-double-digit tok/s | Premium, silent, simple |
| Workstation 48GB card | 48 GB | Clean Q4, single slot | Most expensive |
The honest verdict: the cheapest path to a real uncensored 70B at home is two used RTX 3090s — about 48GB of VRAM for far less than a single 48GB workstation card or a 64GB+ Mac, running the model fully on GPU at usable speed. Budget for a 1000W+ PSU and proper slot spacing and you have a no-compromise rig. A single 5090 is the best one-card option but still makes you quantize or offload for 70B; a single 24GB card can technically do it but you’ll live with the offload tax; and Apple Silicon is the simplest fits-in-one-box answer if you’re willing to trade speed for silence.
And the most honest verdict of all: most people don’t need 70B. A used 3090 and a strong 30B-class model is a smarter, cheaper, faster companion for the vast majority of users — the place to start before you commit to a two-GPU power bill.
Whichever build you land on, the GPU is only the engine — you still need software that turns it into a private, uncensored companion that never phones home. Ember installs locally on top of Ollama and runs entirely on your own hardware, so once you’ve sorted the VRAM, it’s the fastest way to put a 70B (or a sharp 30B) to work as a companion that’s bought once and yours forever.
