If you want to run real large language models on your own machine — not a 3B toy, but a genuine 32B-class uncensored model that holds a conversation, writes, and reasons — you do not need a $3,000 rig. You need one part chosen well and everything else chosen cheaply around it. For 2026, the best budget AI PC build for running LLMs lands around $900, and the whole thing pivots on a single used component: an NVIDIA RTX 3090 with 24GB of VRAM, bought second-hand for roughly $600–$750.

This is a transactional guide. Below is an exact parts list, what it actually runs, the tokens-per-second you should expect, and how to scale to dual-3090 for 70B models later. No fluff, no affiliate padding — just the cheapest honest path to a machine that runs serious AI fully offline.

The ~$900 parts list (used 3090 core)

VRAM is the only spec that decides whether a model loads at all, so the GPU eats most of the budget and everything else is “good enough to feed it.” Here’s the core build:

PartPickApprox. price (used/new)
GPURTX 3090 (24GB), used$600–$750 (used)
CPURyzen 5 5600 / 5700X (AM4)$90–$130 (new)
MotherboardB550 ATX$90–$110 (new)
RAM32GB DDR4-3600 (2×16)$50–$65 (new)
PSU850W 80+ Gold$90–$110 (new)
Storage1TB NVMe Gen3/Gen4 SSD$55–$75 (new)
CaseMid-tower, good airflow$55–$75 (new)
Total~$880–$950

You can shave another $100+ if you buy a used case, used RAM, or a used PSU — but a used PSU is the one corner I would not cut, because it sits next to a 350W second-hand GPU. The rest is forgiving.

If you want the deep-dive comparison on why this card beats the alternatives at every price point, see our dedicated breakdown of the cheapest GPU for local AI. For the full system-level view across budget tiers, the local AI hardware guide covers the trade-offs end to end.

Why a used 3090 is the value king for 24GB

Here is the uncomfortable truth NVIDIA would rather you not dwell on: VRAM has barely gotten cheaper at the consumer tier. The 3090 shipped with 24GB in 2020. Years later, you still pay a steep premium for 24GB on any new card. That’s exactly why the second-hand 3090 is the value king — the depreciation curve handed budget builders a 24GB card at roughly half its launch price.

What 24GB buys you:

  • A 32B model at Q4_K_M quantization fits comfortably in VRAM with room for a usable context window.
  • You stay entirely on the GPU — no spilling layers into system RAM, which is where tokens-per-second falls off a cliff.
  • The 3090 has the memory bandwidth (around 936 GB/s) that actually drives inference speed. LLM generation is memory-bandwidth-bound, not compute-bound, and this is where the 3090 quietly outclasses many newer, cheaper, narrower cards.

The newer RTX 4090 and 5090 are faster, but they cost two to four times as much for the same or only marginally more VRAM — terrible value if your goal is “run 32B locally for the least money.” The 3090 is the sweet spot. For a full rundown of what fits in 24GB and which card wins for unfiltered models, see the best GPU for uncensored LLMs and our guide to the best local LLM for 24GB VRAM.

One buying caveat: used 3090s were often mining or gaming cards. Buy from a seller who’ll let you test, prefer cards with fresh thermal pads or a reputable repaste, and run a stress test on arrival. A 3090 that throttles is still fine for inference (LLMs don’t pin the GPU the way gaming does), but you want to confirm the VRAM is healthy.

CPU, RAM, PSU, case, storage picks

For pure LLM inference, the CPU barely matters — once a model is on the GPU, the processor mostly shuffles tokens. A 6-core Ryzen 5 5600 on the AM4 platform is plenty and dirt cheap. AM4 is a deliberate choice: the boards and chips are mature and inexpensive, and you don’t pay the DDR5 tax.

  • RAM: 32GB DDR4 is the floor. It’s not for holding the model (the GPU does that) — it’s for the OS, the model loader, and headroom if you ever offload a few layers. If you plan to dabble with 70B partially on CPU, go 64GB; RAM is cheap.
  • PSU: An 850W 80+ Gold unit comfortably feeds a single 3090 (which can spike hard) with margin to spare. If dual-3090 is in your future, buy 1000W–1200W now and save yourself a re-purchase.
  • Storage: A 1TB NVMe SSD is the realistic minimum — model files are large (a 32B GGUF at Q4 runs ~18–20GB, and you’ll collect several). Fast storage also means faster model load times. Consider 2TB if you’re a collector.
  • Case: Any mid-tower with honest airflow and clearance for a triple-slot 3090. The card is long and hot; prioritize a front mesh panel and room for the GPU over looks.

What this build runs: uncensored 32B-class models

This is the payoff. With 24GB on tap, you comfortably run the 32B-class open-weight models — the tier where local AI stops feeling like a demo and starts feeling like a tool. At Q4_K_M quantization, a 32B model fits in VRAM with usable context to spare.

Crucially, you control the model, which means you can run abliterated and uncensored variants that simply refuse nothing — no cloud filter sitting between you and the output, no logging, no terms-of-service kill switch. That’s the entire reason to build local instead of renting an API. For the category overview, see the best uncensored local AI models and our explainer on abliterated models, which covers exactly how the refusal behavior gets removed and what to expect.

Practically, on this rig you can run:

  • 8B–14B models instantly, at high speed, for fast chat and drafting.
  • 32B models at conversational speed — the headline capability of this build.
  • Quantized larger models with some compromise, if you’re willing to trade context or speed.

It all runs through the local loopback API at 127.0.0.1:11434 — nothing leaves the machine.

Tok/s expectations

Honest numbers matter, so here’s the realistic shape of it. On a single 3090, with the model fully in VRAM:

Model sizeQuantRealistic generation speed
7B–8BQ4_K_Mvery fast — well above reading speed
13B–14BQ4_K_Mfast — comfortably above reading speed
32B-classQ4_K_Musable conversational speed (single digits to low-teens tok/s)
70BQ4 (offloaded)slow on a single card — patience required

The number that matters is “is it faster than I read?” Average reading speed is roughly 5–8 tokens/second, so anything above that feels live. A 32B model on a 3090 lands in usable territory for back-and-forth chat; it isn’t instant, but it doesn’t feel like waiting on a page to load either. For a deeper treatment of what counts as “fast enough,” see our guide on usable tokens per second.

The killer rule: the moment a model doesn’t fit in VRAM and spills to system RAM, speed collapses. That’s why 24GB matters and why “just buy more RAM” is not a substitute for VRAM.

Scaling up: dual-3090 for 70B

The cheapest way to run a 70B model locally is two used 3090s. A 70B model at Q4 needs roughly 40–48GB of VRAM to run entirely on-GPU — out of reach for one card, but two 3090s give you 48GB combined, and modern loaders split the model across both cards automatically.

What changes when you go dual:

  • PSU: you now want 1000–1200W — this is why buying it up front pays off.
  • Motherboard + case: you need physical clearance and PCIe lanes for two triple-slot cards. A more open case and an ATX board with spaced x16 slots matter here.
  • Cost: roughly +$650–$750 for the second card, landing the full 70B-capable build near $1,600 — still a fraction of a single new flagship GPU.

This is the genuinely cheapest path to local 70B, and it’s why so many home AI builders converge on the dual-3090 layout. For the VRAM math behind the 70B tier specifically, see how much VRAM a 70B model needs.

Assembly and first-boot local-AI setup

Assembly is a standard PC build — seat the CPU, RAM, mount the board, install the GPU, connect the PSU’s PCIe cables (use separate cables per connector on a 3090, not a single daisy-chained one), and boot. Install your OS of choice (Linux is excellent for this and squeezes a bit more out of the hardware, but Windows works fine).

Then the AI part is genuinely three commands. Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model — start with something small to confirm the GPU is being used, then step up to a 32B:

ollama run <model>

That’s it. Ollama serves a local API at http://127.0.0.1:11434 that nothing else can reach unless you open it — your conversations stay on your disk. If you want the full walkthrough with troubleshooting, our guides on how to run AI locally and how to install Ollama cover every step, and Ollama vs LM Studio vs Jan helps you pick a front end if you’d rather have a GUI.

To confirm the model loaded onto the GPU rather than the CPU, check VRAM usage (nvidia-smi) while a prompt is generating — you should see the card filling up and the tokens flowing.

From rig to companion

A parts list gets you a machine; it doesn’t get you something to talk to. Once your 3090 box is humming and Ollama is serving an uncensored 32B model on 127.0.0.1:11434, the natural next step is a real interface — persistent memory, a personality that sticks, voice, the works — instead of a bare terminal prompt. That’s exactly the layer Ember provides: a private, uncensored AI companion that runs entirely on the hardware you just built, with nothing routed to the cloud. You bought the rig to own your AI outright — Ember is what turns it into a companion worth keeping.