You don’t need a data center, a subscription, or anyone’s permission to run a capable AI. If you have a reasonably modern computer, you can run one entirely on your own machine — offline, private, and yours. No prompts leave your hardware. Nothing is logged. Nothing is “filtered for safety” by a company you’ve never met.

This guide gets you from zero to a working local AI in about fifteen minutes, then helps you pick the right model for the hardware you actually have.

Why run AI locally at all?

Cloud AI (ChatGPT, Claude, Gemini) is convenient, but you’re renting it, and the rent comes with terms:

  • Your conversations are stored, and on most consumer tiers they can be used to train future models.
  • The model is filtered — not just for genuinely harmful things, but for whatever the provider decides is off-limits this quarter.
  • It can change or vanish — prices rise, models get “updated” into something you liked less, accounts get suspended.

Local AI flips all of that. The weights sit on your disk. The inference happens on your CPU or GPU. Turn off your Wi-Fi and it still works. That’s the whole pitch: private by construction, not by policy.

What you need

Local models come in sizes, measured in billions of parameters (e.g. “7B”, “8B”, “70B”). Bigger is smarter but needs more memory. The single biggest factor is how much RAM (or VRAM, if you have a dedicated GPU) you have.

Your hardwareRealistic model sizeWhat it feels like
8 GB RAM, no GPU1B–3B (quantized)Fast, fine for simple tasks
16 GB RAM, or 6–8 GB GPU7B–9BThe sweet spot — genuinely useful
24 GB+ GPU (e.g. RTX 3090/4090)14B–32BExcellent, near-cloud quality
64 GB+ RAM / multi-GPU70B+Frontier-class, slower on CPU

A dedicated NVIDIA or Apple Silicon GPU makes everything dramatically faster, but you can run small models on a plain laptop CPU. Quantization — compressing the model to 4-bit (you’ll see tags like Q4_K_M) — roughly halves the memory needed for a small quality cost. It’s almost always worth it.

Step 1 — Install Ollama (the easy path)

Ollama is the simplest way in. It’s free, open-source, and runs on macOS, Linux, and Windows.

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: download the installer from ollama.com and run it.

That’s the whole install. Ollama runs quietly in the background and exposes a local API on 127.0.0.1:11434 — note the address: it’s loopback, your own machine, reachable by nothing outside it.

Step 2 — Run your first model

Pick a model that fits your hardware from the table above. For most people on 16 GB of RAM, start here:

ollama run llama3.2

The first run downloads the model (a few gigabytes — this is the one time it touches the internet). Then you get a prompt. Type a question. You’re now talking to an AI that runs 100% on your computer.

Some solid starting points by size:

  • Small / fast: ollama run llama3.2:3b or ollama run gemma2:2b
  • The sweet spot: ollama run llama3.1:8b or ollama run qwen2.5:7b
  • If you have a big GPU: ollama run qwen2.5:32b

To leave the chat, type /bye.

Step 3 — Give it a real interface (optional)

The terminal is fine, but most people want a chat window. Two easy options:

  • LM Studio — a polished desktop app; download models with a click, chat in a clean UI. Great for non-terminal folks.
  • Open WebUI — a self-hosted, ChatGPT-style web interface that connects to Ollama. A little more setup, very capable.

Both keep everything local. Nothing changes about the privacy story — they’re just nicer front doors to the same local model.

”Uncensored” vs. “abliterated” models

You’ll quickly notice the default models still refuse things — they carry the same alignment training as their cloud cousins. If you want a model that doesn’t lecture or refuse, look for community fine-tunes tagged “uncensored” or “abliterated” (a technique that surgically removes the refusal behavior while keeping the model’s competence).

These run exactly the same way — ollama run an uncensored variant and it behaves. Because it’s on your machine, you set the boundaries, not a corporate policy team. We cover the best current ones in our uncensored models guide.

Common gotchas

  • It’s slow on CPU. Expected. Either use a smaller/quantized model or add a GPU. On Apple Silicon, Ollama uses the GPU automatically.
  • “Out of memory.” The model is too big for your RAM/VRAM. Drop to a smaller size or a heavier quantization (Q4 instead of Q8).
  • First response lags. The model is loading into memory; subsequent replies are faster.

Where this leads

Running a model in a terminal is the foundation. The next step most people want is an AI that remembers them, talks, and feels like a presence rather than a command line — without giving up the local, private nature that made you leave the cloud in the first place.

That’s a harder build than ollama run, and it’s exactly the gap a few local-first apps now fill.