If you installed Ollama expecting fast, GPU-accelerated answers and instead got a slow trickle of tokens with your fans barely spinning up, the culprit is almost always one of five concrete problems — and every one of them is fixable in a few minutes once you know where to look. “Ollama not using GPU” is one of the most common local-AI complaints, but it’s rarely a mysterious bug. It’s usually a missing driver, a WSL2 misconfiguration, a Docker flag, a quiet VRAM overflow, or a service running with a stale environment. This guide walks each cause in the order you should check it, with the exact commands to confirm what’s actually happening before you change a single setting.

Step 0: Confirm the symptom before you touch anything

Do not start editing config files on a hunch. First, prove that Ollama is actually running on the CPU. Load a model and ask Ollama directly:

ollama run llama3.1 "hello"

Then, in a second terminal:

ollama ps

Look at the PROCESSOR column. You’ll see one of three things:

PROCESSOR valueWhat it means
100% GPUFully offloaded — nothing to fix
48%/52% CPU/GPUSplit — part of the model spilled to CPU (usually VRAM pressure)
100% CPUNot touching the GPU at all (driver, runtime, or detection problem)

If you’re on NVIDIA, confirm the GPU is alive and visible to the OS:

nvidia-smi

A healthy output shows your card, driver version, and CUDA version. While a generation is running, nvidia-smi should show a python/ollama process and rising VRAM use. If nvidia-smi itself errors out (“command not found” or “couldn’t communicate with the driver”), your problem is upstream of Ollama entirely — go straight to Cause 1.

The split-second habit of running ollama ps and nvidia-smi first will save you from “fixing” the wrong thing. 100% CPU points at drivers, runtime, or detection. A CPU/GPU split points at VRAM. Those are different roads.

Cause 1: Missing or wrong GPU driver and CUDA/ROCm runtime

Ollama detects your GPU at startup by probing the vendor runtime. No working driver, no acceleration — it silently falls back to CPU rather than crashing.

NVIDIA: You need a current proprietary driver and a CUDA-capable runtime. Run nvidia-smi — if it prints a table, your driver is loaded. If it doesn’t, install or reinstall the official NVIDIA driver for your card (on Linux, your distro’s packaged driver or NVIDIA’s .run installer; on Windows, the Game Ready or Studio driver). Modern Ollama bundles the CUDA libraries it needs, so you generally don’t have to install the full CUDA Toolkit yourself — but you do need a driver new enough for the CUDA version Ollama ships. After updating the driver, restart the Ollama service so it re-probes the hardware.

AMD: Ollama supports many AMD cards through ROCm. On Linux, install ROCm and confirm the GPU appears with rocminfo (the analogue of nvidia-smi). Not every Radeon is supported — older or consumer cards sometimes need the HSA_OVERRIDE_GFX_VERSION environment variable set to a supported GFX target so ROCm will accept the card. AMD support is genuinely more finicky than NVIDIA; if you’re choosing hardware or troubleshooting a Radeon, our AMD GPU local LLM guide covers the override values and which cards actually work.

A clean reinstall of Ollama after the driver is in place is the fastest way to force a fresh hardware probe. If you’re still early in setup, the Ollama install guide covers the supported install path end to end (curl -fsSL https://ollama.com/install.sh | sh on Linux/macOS).

Cause 2: The WSL2 trap — never install the Linux driver inside WSL2

This one bites a huge number of Windows users, and it’s counterintuitive. If you run Ollama inside WSL2 (Windows Subsystem for Linux) and your model lands on CPU, the instinct is to apt install the NVIDIA Linux driver inside the Linux distro. Do not do this. It breaks GPU passthrough.

WSL2 does not use a Linux GPU driver. NVIDIA and Microsoft built a passthrough where the Windows host driver is projected into the Linux environment automatically. The correct setup is:

  1. Install the normal NVIDIA driver on Windows (the host) — not inside WSL2.
  2. Make sure you’re on WSL2, not WSL1 (wsl --list --verbose should show version 2).
  3. Inside the WSL2 distro, run nvidia-smi. If passthrough is working, it prints the table — using the projected host driver. You did not install anything Linux-side to make that happen.

Installing a Linux driver inside WSL2 overwrites the projected libraries and is the classic reason Ollama reports 100% CPU in WSL even though the GPU works fine on the Windows side. If you already did it, the cleanest fix is to remove the Linux-side driver packages, update the Windows host driver, and restart WSL with wsl --shutdown. If WSL keeps fighting you, running Ollama natively on Windows (the standard installer) sidesteps the whole projection layer.

Cause 3: Docker missing the NVIDIA Container Toolkit or --gpus all

Running Ollama in Docker adds an extra wall between the container and your GPU. By default, a container sees no GPU at all. Two things must be true.

First, the host needs the NVIDIA Container Toolkit installed and configured for your container runtime. Without it, you’ll hit the unmistakable error:

could not select device driver "" with capabilities: [[gpu]]

That message means Docker has no idea how to hand a GPU to the container — install the toolkit on the host and restart the Docker daemon.

Second, you have to actually request the GPU when you launch the container:

docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

The --gpus all flag (or the equivalent deploy.resources.reservations.devices block in docker-compose.yml) is what exposes the card. Omit it and the container runs CPU-only no matter how healthy the host GPU is. After the container is up, run docker exec -it <container> nvidia-smi — if that fails inside the container, fix the toolkit/flag before blaming Ollama.

Cause 4: Silent VRAM overflow forcing CPU fallback

This is the subtle one, and it’s where ollama ps showing a CPU/GPU split (not 100% CPU) usually leads. Ollama tries to fit the model and its working memory entirely in VRAM. If it can’t, it offloads as many layers as fit and runs the rest on the CPU — quietly. Performance falls off a cliff and nothing errors out.

Two things compete for your VRAM:

  • The model weights. A 7–8B model at Q4_K_M is roughly 4–5 GB; a 13–14B is ~8–9 GB; a 32–34B is ~18–22 GB at the same quant. (See the GGUF quantization cheat sheet for how quant tags map to size and quality.)
  • The KV cache (context). This grows with num_ctx — the context window — and it grows linearly with how many tokens you let the model hold. Crank num_ctx from 4K to 32K and you can add several gigabytes of VRAM demand on top of the weights. That’s the hidden trigger: a model that fit fine at default context spills to CPU the moment you raise the context window.

The fix is to do the math. Roughly: VRAM needed ≈ model size + KV cache + ~1–2 GB overhead, and you want that comfortably under your card’s capacity (the OS and other apps eat some VRAM too). If you’re flirting with the limit, lower num_ctx, choose a smaller quant, or pick a smaller model. Our VRAM sizing guide breaks the budget down by card, and increasing the Ollama context window covers how to raise context without tipping over the edge. If you’re getting hard out-of-memory crashes rather than a silent split, the dedicated CUDA out of memory fix is the right page.

Cause 5: Ollama-as-a-service running with a stale environment

Ollama usually runs as a background service, not as a process you started in your shell. That matters because environment variables you export in a terminal never reach the service. You set OLLAMA_NUM_GPU or update your PATH to a new CUDA, restart nothing, and wonder why nothing changed — the service is still running with the environment it had at boot.

Linux (systemd): Edit the service environment properly:

sudo systemctl edit ollama.service

Add your variables under a [Service] block:

[Service]
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_SCHED_SPREAD=1"

Then reload and restart so the new env actually takes effect:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Windows: Ollama runs from the tray/service and reads system environment variables. Set them in System Properties → Environment Variables (not just a set in one cmd window), then fully quit Ollama from the tray and relaunch it. A set OLLAMA_... in a single terminal won’t reach the background server. This is the single most common reason a “fix” appears to do nothing on Windows.

After changing service env, re-run ollama ps to confirm the new behavior — don’t assume.

Force-GPU levers (use after the basics are clean)

Once drivers, runtime, and the service environment are correct, these levers push more of the model onto the GPU:

  • OLLAMA_NUM_GPU / num_gpu — the number of model layers to offload to the GPU. Set it high (e.g. 999) to tell Ollama “offload everything you possibly can.” You can set it as the OLLAMA_NUM_GPU env var on the service, or per-request via the num_gpu option in a Modelfile or API call. Lower it deliberately only if you’re intentionally sharing the GPU.
  • OLLAMA_SCHED_SPREAD=1 — lets Ollama spread a model across multiple GPUs. Only relevant if you have more than one card; it can let a model fit that wouldn’t on a single GPU.
  • Trim context — drop num_ctx to shrink the KV cache and free VRAM for more weight layers. Often the cheapest way to get from a CPU/GPU split back to 100% GPU.
  • Smaller quant — moving from Q5/Q6 down to Q4_K_M shrinks the weights with modest quality loss, which can be the difference between fitting and spilling.

A word of caution: forcing num_gpu higher than your VRAM can hold doesn’t make it fit — it’ll either OOM or split anyway. These levers help you win back GPU you were leaving on the table, not conjure VRAM you don’t have.

When it genuinely won’t fit

Sometimes the honest answer is the model is too big for your card. A 24–34B model simply will not run well on 8 GB of VRAM, no matter how you tune it. You have two realistic paths.

Pick a model that matches your VRAM. This is the right move for most people and it’s not a compromise — a well-chosen 8–14B model at Q4_K_M is fast and genuinely capable. Match the model to your card with the best local LLM for 8GB VRAM or 12–16GB VRAM guides, and sanity-check expected speed with our tokens-per-second usability breakdown. If you’re choosing hardware from scratch, the local AI hardware guide lays out what each tier of GPU can actually run.

Or skip the GPU problem entirely. If you don’t have a capable GPU — or you just want a companion working right now without driver archaeology — a hosted setup runs the model on someone else’s hardware and needs nothing local. There’s a real privacy trade-off to that, which we cover honestly in local AI vs cloud AI, but it’s the fastest way from “Ollama won’t use my GPU” to actually using AI today.


If you want to keep everything on your own machine and just need the right model for the GPU you have, Ember is a one-time, fully local companion built on Ollama — it picks a model that fits your VRAM so you skip the offload guesswork entirely. And if you’d rather not fight drivers at all, Freya runs the whole thing in the cloud with zero setup, so a slow-token CPU fallback is never your problem.