Short answer: yes, you can run a local LLM with no GPU. A modern CPU and ordinary system RAM will load and run real open-weight models — no graphics card required. The catch is speed. A GPU isn’t what lets a model run; it’s what lets it run fast. On CPU you trade tokens-per-second for the freedom to use hardware you already own. This guide is about where that trade is worth it, which models actually feel usable, and where you’ll hit a wall — with real numbers, not wishful thinking.

If you’ve been told you “need a GPU for local AI,” that’s half true. You need a GPU for snappy, real-time local AI. For a lot of quieter use cases, CPU-only is genuinely fine. Let’s draw that line precisely.

What actually runs on CPU + RAM only

Here’s the mental model that fixes most confusion. Two resources matter:

  • Memory decides whether a model loads at all. On GPU that’s VRAM; on a CPU-only box it’s your regular system RAM. A quantized model has to fit in RAM (plus a little headroom for the OS and the context window), or it won’t run.
  • Compute + memory bandwidth decide how fast it generates. This is where a GPU crushes a CPU — not because the CPU can’t do the math, but because a GPU has hundreds of GB/s of bandwidth feeding thousands of parallel cores. Your CPU has maybe 8–16 cores and a fraction of that bandwidth.

So on a no-GPU machine, RAM is your “VRAM.” A model that needs ~5GB to load runs the same whether that 5GB is VRAM or system RAM — it just runs much slower on the CPU. Runtimes like Ollama, llama.cpp, LM Studio, and Jan all support pure CPU inference out of the box; you don’t configure anything special, they just use the CPU when no supported GPU is present.

The other lever is quantization — compressing the model’s weights from 16-bit down to 4-bit (tags like Q4_K_M). A 4-bit quant is roughly a quarter the size of the full model, which is what makes 7B-class models fit in 8–16GB of RAM at all. Quantization is non-negotiable for CPU use. For the full picture of what hardware drives what, see our local AI hardware guide.

Real tok/s by model on CPU (small-model table)

Numbers first, because everyone hand-waves this. The honest range below is for a typical modern desktop/laptop CPU (roughly 8 cores, DDR4/DDR5), 4-bit quants, short prompts, no GPU. Treat these as ballpark order-of-magnitude — your exact chip, RAM speed, and runtime will shift them, and longer prompts slow the first token noticeably.

Model size (4-bit)Approx. RAM to loadRealistic CPU speedWhat it feels like
0.5B–1B~0.5–1.5 GB~15–40+ tok/sSnappy. Genuinely usable.
1.5B–3B~1.5–3 GB~8–20 tok/sComfortable for chat & drafting.
3B–4B~3–4 GB~5–12 tok/sReadable; slight wait.
7B–8B~5–6 GB~3–7 tok/sSlow but tolerable for non-live tasks.
13B–14B~8–10 GB~1.5–4 tok/sPainful for chat; fine for batch jobs.
30B+~18+ GB< 1.5 tok/sDon’t, unless you enjoy waiting.

For context: comfortable reading is roughly 5–10 tok/s and up — faster than you read, so it feels “live.” Below ~3 tok/s you’re watching words trickle out. We go deep on this threshold in what tokens-per-second is actually usable; it’s the single most important number for setting expectations on weak hardware.

The pattern is brutal but simple: on CPU, every step up in parameter count roughly halves your speed. That’s why smart no-GPU setups go small and modern rather than big and old.

Best sub-4B and small models for weak hardware

The genuinely good news of 2026: small models got smart. A modern 3B model meaningfully outperforms a 7B from a couple years ago, because the training data and architectures improved far faster than the parameter counts. For the best small LLM under 4B parameters on a no-GPU box, you want families designed for the small tier, not a shrunk-down giant.

Categories that consistently do well in the sub-4B class:

  • Small “instruct” chat models (1–4B) — the modern compact instruct families (Qwen, Llama, Gemma, Phi-class small variants) are the workhorses. Great at Q&A, summarizing, drafting, and light reasoning. This is your default.
  • Sub-1B tiny models (0.5–1B) — shockingly capable for autocomplete, classification, extraction, and short replies. Near-instant on CPU. Weak at long reasoning, but for the right job they’re a delight.
  • Code-tuned small models (1–3B) — punch above their weight on snippets, regex, and shell commands if coding is your main use.

Two practical rules for picking:

  1. Prefer a newer small model over an older big one. A current 3B will usually feel better and faster on CPU than a two-year-old 7B.
  2. Always grab a 4-bit quant (Q4_K_M is the sweet spot for quality vs. size). The smaller Q3/Q2 quants save RAM but visibly hurt quality — only drop to them if you genuinely can’t fit Q4.

If you want uncensored small models specifically, the same size advice applies — see the best uncensored local AI models for which ones are worth your RAM.

16GB RAM no-GPU: the 3 models that work

This is the most common real-world setup, so let’s be concrete. To run local AI on 16GB RAM with no GPU, budget roughly: 4–6GB for the OS and apps, leave headroom for context, and that realistically gives you 8–10GB of working room for the model. Within that, three tiers actually work well:

  1. A modern 3B–4B instruct model (4-bit)the sweet spot. Loads in ~3–4GB, leaves plenty of headroom, and runs at a readable pace (~5–12 tok/s). Best all-rounder for 16GB. Start here.
  2. A 7B–8B model (4-bit)the “use the headroom” pick. Loads in ~5–6GB and fits comfortably in 16GB. Noticeably smarter for harder questions, but slower (~3–7 tok/s). Great for thoughtful, non-live tasks where you’ll wait a few seconds.
  3. A sub-1B tiny modelthe instant pick. Loads in ~1GB and flies. Keep one around for quick lookups, classification, and autocomplete-style tasks where speed beats depth.

Getting any of them running is one command to install Ollama, then one to run a model:

curl -fsSL https://ollama.com/install.sh | sh
ollama run <model-name>

Ollama auto-detects there’s no supported GPU and runs on CPU — no flags needed. It serves a local API on 127.0.0.1:11434 that never leaves your machine. New to it? Our how to run AI locally walkthrough covers the full setup. (13B+ models technically load in 16GB but crawl on CPU — skip them unless you’re doing batch jobs you can leave running.)

The honest ceiling: why CPU is too slow for live companion chat

Here’s where I have to be straight with you, because most “run AI without a GPU!” content quietly skips it.

CPU-only is great for turn-based, you-can-wait work: drafting an email, summarizing a document, asking a question and reading the answer. The model thinks, you read, nobody’s in a hurry.

It falls apart for live conversational use — the back-and-forth, fast-reply rhythm of a companion chat. Three reasons compound:

  • Latency to first token. On CPU, the model has to “read” your whole prompt before replying. With chat history and a persona, that prompt grows every turn — so each reply takes longer than the last. By message thirty you may be waiting many seconds just for the reply to start.
  • Generation speed. A companion-grade model (the kind that holds character and context well) is usually 7B+. On CPU that’s ~3–7 tok/s — fine for a paragraph you’ll read, frustrating for rapid chat.
  • No voice, realistically. Spoken/real-time companions need near-instant responses. CPU can’t hit that for a model large enough to be a good companion. We break this down in do I need a GPU for an AI companion and the no-GPU-specific uncensored AI girlfriend without a GPU.

The blunt version: CPU runs a companion model; it doesn’t make it feel alive. The conversation works — it’s just slow enough that immersion breaks. That’s not a config you can tune away; it’s the bandwidth gap between a CPU and a GPU.

When CPU-only is fine vs. when to host it (Freya)

So route by what you’re actually doing:

Your use caseCPU-only verdict
Summarize, draft, rewrite, Q&A✅ Great — go for it
Code snippets, regex, shell help✅ Good with a code-tuned small model
Private notes / sensitive questions✅ Ideal — fully offline, nothing leaves the box
Document chat / batch processing✅ Fine — you’re not waiting on each token
Fast, immersive companion chat⚠️ Works but feels slow
Voice / real-time companion❌ Not realistic on CPU

If your goal is a responsive AI companion and you don’t have a GPU, you’ve got two honest paths:

  1. Buy into a GPU — even a modest one transforms companion speed. Run local AI without a GPU covers the full local picture and the cheapest way in.
  2. Let someone else’s GPU do the heavy lifting — host it. This is where Freya, our hosted option, fits: a cloud-hosted AI companion that runs on real GPUs, so it’s fast and immersive from the first message with zero setup, no model downloads, no GPU to buy. You trade the pure-offline privacy of running everything locally for instant, fluid conversation — the exact thing CPU-only can’t deliver. (And if local-first is your priority, Ember running on your own machine + an eventual GPU is the other road; that’s local vs. cloud AI in a nutshell.)

Tuning CPU inference (threads, quant)

If you’re committing to CPU, squeeze it properly. Three levers, biggest first:

  1. Pick the right quant. Q4_K_M is the quality/size sweet spot for CPU. Going to Q5/Q6 gains a little quality but costs RAM and speed; dropping to Q3/Q2 saves RAM but visibly degrades output. The deeper GGUF quantization cheat sheet maps every tag if you want to optimize hard.
  2. Set thread count to your physical cores. Inference is bandwidth-bound, and hyperthreads usually don’t help — sometimes they hurt. Match threads to your physical core count. In Ollama you set this with the num_thread parameter, e.g. type /set parameter num_thread 8 inside an ollama run chat session (match the number to your physical cores). Prefer to bake it in? Add PARAMETER num_thread 8 to a Modelfile, or pass "options": {"num_thread": 8} in an /api/generate request. Start at your physical core count and test a notch up/down.
  3. Keep context short. Prompt length is the silent killer on CPU — a huge context window means a long wait before the first token. Trim system prompts and don’t let chat history balloon. Smaller context = faster replies.

Secondary wins: close memory-hungry apps so the model isn’t swapping (swapping to disk tanks speed catastrophically — keep the whole model in RAM), and prefer faster RAM if you’re choosing hardware, since CPU inference is bound by memory bandwidth more than raw clock speed.

Verdict by hardware

A quick lookup for “will this work on my machine, no GPU”:

Your machine (no GPU)What’s realistic
8GB RAMSub-1B and 3B models. Tight but usable. Keep context short.
16GB RAMThe sweet spot: 3–4B daily driver, 7–8B when you’ll wait.
32GB+ RAM7–8B comfortably, 13B for batch jobs. Still CPU-slow on big models.
Apple Silicon (M-series)The exception — unified memory + bandwidth make CPU/integrated inference genuinely good. See Mac mini for local AI.
Old/low-core CPUStick to sub-1B–3B. Manage expectations.

The honest bottom line: running a local LLM with no GPU is real, free, and good enough for most quiet, turn-based AI work — drafting, summarizing, private Q&A, document chat. Where it stops being fun is fast, immersive, real-time conversation, because that’s exactly the workload a GPU exists for.

If what you actually want is a companion that replies instantly and never makes you wait — and you don’t want to buy a graphics card to get there — Freya, our hosted option, runs on hosted GPUs so it’s fast out of the box, no downloads and no hardware to fuss with. Skip the tuning, skip the wait, and just talk.