If Ollama suddenly feels like it’s typing through molasses, the cause is almost never mysterious — and it’s almost never “the model is just slow.” Slowness in local AI is a layer problem: either part of the model is running on your CPU instead of your GPU, the context window is bloated, the model is reloading from disk on every prompt, or your hardware is quietly throttling under heat or a power cap. The good news is that each of these has a concrete, observable signature and a specific fix. This guide walks them in the order that pays off fastest, starting with a single command that explains most slowness in about three seconds.
First, diagnose: ollama ps and the CPU/GPU split
Before you change a single setting, run this while a model is loaded:
ollama ps
You’ll get output like a table with a PROCESSOR column. That column is the whole game. It will say one of three things:
- 100% GPU — the entire model is in VRAM. This is what fast looks like.
- 100% CPU — nothing made it onto the GPU. Expect single-digit tokens per second.
48%/52% CPU/GPU(or any split) — the model is partially offloaded, and this is the most common cause of “it used to be fast and now it crawls.”
A partial split is brutal because of how transformers work: every token has to flow through every layer in order. If even a few layers live in system RAM, the GPU finishes its share and then waits on the slow CPU layers for every single token. You don’t get the average of CPU and GPU speed — you get something much closer to the CPU speed, dragged down by the constant handoff across the PCIe bus.
So the diagnostic flow is simple. If ollama ps shows full GPU and it’s still slow, the problem is context size or thermals (covered below). If it shows any CPU percentage at all, that is your bottleneck, and the next two sections fix it.
If ollama ps shows 100% CPU when you have a perfectly good GPU installed, that’s a different class of problem — a driver, CUDA, or detection failure rather than a tuning issue. Our walkthrough on why Ollama won’t use your GPU covers that case end to end.
The cold-start tax: KEEP_ALIVE and preloading
A surprising amount of perceived slowness isn’t generation speed at all — it’s the load. By default Ollama unloads a model from VRAM after about five minutes of inactivity. The next prompt then has to re-read several gigabytes from disk back into the GPU before a single token appears. On a spinning disk or a busy SSD, that’s a multi-second stall that feels like the model “thinking” when it’s really just loading.
If you use one model regularly, keep it resident. Set the keep-alive duration via an environment variable:
OLLAMA_KEEP_ALIVE=30m
Set -1 to keep it loaded indefinitely (until you run something else), or 0 to unload immediately after each request. You can also pass keep_alive per request through the API. For a companion or assistant you talk to throughout the day, a long keep-alive is the single highest-impact comfort fix — it turns every follow-up from “wait, then respond” into “respond.”
You can also preload a model so the very first message of the day is instant. A bare call with an empty prompt loads the weights without generating:
ollama run llama3.1 ""
The trade-off is honest: a resident model holds VRAM you can’t use for anything else. On an 8 GB card that’s a real constraint; on a 24 GB card it’s usually free real estate.
Partial GPU offload: the single biggest speed killer
This is the one that wrecks the most setups, so it gets its own section. When ollama ps shows a CPU/GPU split, it means the model didn’t fit in VRAM, so Ollama spilled the overflow layers into system RAM rather than failing. It’s trying to be helpful, and the result is slow.
What pushes a model over the edge:
- The model is genuinely too big for your VRAM at its current quant (a 24B model at Q4 needs roughly 14–16 GB just for weights).
- The context window grew. A large
num_ctxreserves a big KV cache in VRAM, leaving less room for the weights and forcing layers onto the CPU. People raise context for a long chat, then wonder why it got slow — this is why. - Something else is eating VRAM — a browser with hardware acceleration, a second model, a game, a desktop compositor.
How to force full-GPU:
- Free up VRAM. Close the browser/game, run
nvidia-smi(NVIDIA) or check your GPU monitor, and unload other models withollama stop <model>. - Shrink the footprint so the whole thing fits — either a smaller context (next section) or a smaller/more-aggressive quant (section after that).
- Set the offload layers explicitly. Ollama auto-detects how many layers to push to GPU, but you can override it with
num_gpu(the number of layers to offload). A Modelfile parameter or an API option ofnum_gpuset to a high number forces maximum offload; if you set it too high it’ll error or fall back, so tune down untilollama psreads 100% GPU.
The target is unambiguous: you want ollama ps to say 100% GPU. Everything else in this guide is fine-tuning; this is the difference between 4 tok/s and 40.
Trim the context: num_ctx and KV-cache type
The context window (num_ctx) is how many tokens of conversation the model holds in working memory. Bigger context = bigger KV cache = more VRAM consumed and more compute per token. Many people inflate it reflexively, not realizing it’s a direct tax on both speed and the GPU-fit problem above.
Right-size it to what you actually use:
| Use case | Reasonable num_ctx |
|---|---|
| Quick Q&A, short chats | 2048–4096 |
| Ongoing companion / roleplay | 8192 |
| Long documents, RAG, code review | 16384–32768 (if VRAM allows) |
Set it in a Modelfile (PARAMETER num_ctx 8192) or per request via the API. If raising context dropped you into a CPU/GPU split, lowering it back is often the fastest path to full-GPU speed. The full trade-off — including when a bigger window is genuinely worth the VRAM — is in our guide on setting the Ollama context window.
A second lever: KV-cache quantization. Recent Ollama supports storing the KV cache in lower precision (q8_0 or q4_0) instead of full f16, which roughly halves or quarters the cache’s VRAM cost — letting you keep a long context and stay fully on the GPU. Enable it with the cache-type environment variable (e.g. OLLAMA_KV_CACHE_TYPE=q8_0), which generally needs flash attention turned on as well. q8_0 is a near-free quality trade; q4_0 saves more but can dull long-context recall, so test it on your own chats.
Pick the right quant: Q4_K_M vs higher quants vs speed
Quantization is how the model’s weights get compressed to fit in less memory. The tag tells you the trade. Q4_K_M means 4-bit weights with a “medium” mixed-precision scheme — and it is the default sweet spot for local use for a reason: it cuts the model to roughly a quarter of its full size while keeping quality close to the original.
Going to a higher quant (Q5_K_M, Q6_K, Q8_0) buys marginal quality at a steep cost: more VRAM, and — critically — the higher VRAM use is what tips you back into a partial-offload split. A model that runs 100% on GPU at Q4_K_M but spills to CPU at Q6_K will be dramatically slower at the “better” quant. On constrained hardware, the faster quant is usually the better experience.
| Quant | Relative size | When to use |
|---|---|---|
| Q4_K_M | ~25% of full | Default. Best balance of quality, size, and speed. |
| Q5_K_M / Q6_K | ~30–40% | Only if it still fits 100% in VRAM and you want a quality nudge. |
| Q8_0 | ~50% | Rarely worth it locally; near-lossless but heavy. |
| Q3 / Q2 | <25% | Last resort to fit a bigger model; noticeable quality loss. |
The honest rule: pick the largest quant that still shows 100% GPU in ollama ps, and not one tag higher. If you want the full decoder ring on what every letter and number means, see our GGUF quantization cheat sheet.
Enable flash attention and other runtime flags
Flash attention is a more memory-efficient attention algorithm that speeds up processing and shrinks the KV cache, especially at longer contexts. It isn’t always on by default, so turn it on:
OLLAMA_FLASH_ATTENTION=1
On supported GPUs this is close to a free win — faster prompt processing and lower VRAM use, which (again) helps you stay fully offloaded. It’s also the prerequisite for the KV-cache quantization mentioned earlier, so the two pair naturally.
A few other knobs worth knowing:
OLLAMA_NUM_PARALLEL— how many requests Ollama serves at once. More parallelism splits your context budget across slots and can slow each individual chat. For a single-user setup, keeping this low (1–2) gives each conversation the full resources.num_batch— the prompt-processing batch size. Larger can speed up ingesting long prompts at some VRAM cost.num_thread— only matters for the CPU portion; if you’re fully on GPU it’s irrelevant. If you’re stuck partly on CPU, matching it to your physical core count can help slightly.
Set environment variables where Ollama actually reads them — for the systemd service on Linux that means an override in the service file, not just your shell, or the change silently won’t apply.
Background causes: thermal throttling, power limits, parallel requests
If ollama ps says 100% GPU, your context is sane, and it’s still slower than it was last week, look outside Ollama:
- Thermal throttling. A GPU that hits its temperature ceiling clocks itself down to survive. A laptop with a dust-clogged fan, or a desktop in a hot room, can lose a large chunk of performance silently. Watch temps with
nvidia-smiduring a long generation — if clocks drop as temperature climbs, that’s throttling. Fix the airflow. - Power limits. Laptops on battery or in a “quiet/eco” power profile cap the GPU hard. Plug in and switch to a performance profile. Some desktops ship with a conservative power limit you can raise.
- Parallel requests. If something else is hitting your Ollama endpoint — a second app, an Open WebUI tab mid-generation, a background script — they share the same GPU and each one slows. Check
ollama psfor multiple loaded models, and keepOLLAMA_NUM_PARALLELmodest. - VRAM contention from other apps. A game or a GPU-accelerated video call started after Ollama loaded can squeeze it into a spill. Close them and reload the model.
When the hardware is simply the ceiling
Sometimes nothing is misconfigured — the machine just can’t do more. If a model is fully on GPU, context is trimmed, flash attention is on, and you’re still under your comfort threshold, you’ve hit the physical ceiling. (Worth knowing where that threshold is: for back-and-forth chat, usable speed plateaus around 7–10 tokens per second, roughly reading pace — you don’t need 60.)
At that point you have two clean options:
- Run a smaller model. Dropping from a 24B to an 8B–14B class model, or to a MoE model that activates few parameters per token, often quadruples your speed while staying genuinely useful for chat. The right pick is mostly a VRAM question — our best local model by VRAM guides match models to cards. And if you have no usable GPU at all, running local AI on CPU is possible but will always be the slow path.
- Let someone else’s hardware do the work. If you want the experience now, without buying a GPU or babysitting offload settings, a hosted companion runs on datacenter-class GPUs and is fast out of the box.
That’s the real fork at the bottom of every speed problem: tune and own your stack, or skip the hardware entirely. If you’d rather keep everything private and on your own machine — and squeeze every token out of it — Ember runs a fully local companion on Ollama, with these tuning choices handled for you. If you just want fast, zero-setup conversation and don’t have the GPU for it, Freya runs in the cloud and never asks you to read an ollama ps table.
