If you want to run a real AI model on your own computer — no account, no monthly fee, no servers reading your chats — Ollama is where almost everyone starts. It’s a small, free, open-source program that downloads open-weight language models and runs them locally, exposing a simple API on your machine that other apps can talk to. This guide walks you through installing it the right way on Windows, macOS, and Linux, pulling your first model, fixing the handful of things that commonly break, and giving the model a personality once it’s running. No prior command-line experience assumed.
What Ollama is and why it is the local-AI foundation
Ollama is best understood as a model runner plus a tiny local server. You tell it which model you want, it downloads the weights (already quantized and packaged), loads them onto your GPU or CPU, and serves responses at http://127.0.0.1:11434 — a loopback address, meaning the traffic never leaves your machine. Nothing is sent to a company’s cloud. That single design choice is why Ollama has become the default foundation for private, local AI.
Three reasons it won the “getting started” race:
- It handles the hard parts. Quantization, model formats (GGUF), GPU offload, and memory management are all automatic. You don’t compile anything.
- It speaks a standard API. Because it serves an OpenAI-compatible endpoint locally, dozens of front-ends, scripts, and companion apps can plug straight into it.
- It’s genuinely cross-platform. The same
ollama runcommand works the same way on all three operating systems.
If you’re still deciding whether local AI is right for you at all, the broader case is laid out in how to run AI locally and why cloud AI censors you. This page assumes you’ve decided — let’s install it.
Install on Windows (and the common gotchas)
On Windows the simplest path is the official installer:
- Go to ollama.com/download and grab the Windows installer (
OllamaSetup.exe). - Run it. It installs Ollama as a background service and adds the
ollamacommand to your PATH automatically. - Open a fresh PowerShell or Command Prompt window (a new one — so it picks up the updated PATH) and confirm it works:
ollama --version
After install, Ollama runs quietly in the system tray. You don’t need to “start” anything — the local server is already listening on port 11434.
Common Windows gotchas:
ollamanot recognized. You opened a terminal that was already running before install. Close it and open a new one so the PATH refreshes.- No GPU acceleration. Ollama supports NVIDIA and modern AMD GPUs on Windows. If models feel slow, make sure your GPU drivers are current — old drivers are the number-one cause of CPU-only fallback.
- WSL confusion. You do not need WSL (Windows Subsystem for Linux) for the native Windows build. Only use the Linux instructions below if you specifically want Ollama inside a WSL distro — and if you do, that’s a separate install from your Windows one.
- Antivirus prompts. Some security suites flag the first launch. Allowing it is fine; the binary is signed and open-source.
Install on macOS (Apple Silicon notes)
On a Mac, download the macOS app from ollama.com/download, unzip it, and drag Ollama into your Applications folder. Launch it once and it’ll offer to install the command-line tool. After that, open Terminal and check:
ollama --version
Apple Silicon (M1/M2/M3/M4) is excellent for local AI — arguably the best value in the consumer space. The reason is unified memory: the CPU and GPU share one big pool of RAM, so a 16 GB or 32 GB Mac can hold surprisingly large models on the “GPU” without a discrete graphics card. Ollama uses Apple’s Metal backend automatically; there’s nothing to configure.
A few Mac notes:
- Pick your model by total RAM, not a separate VRAM number. On Apple Silicon they’re the same pool. A 16 GB Mac comfortably runs 7B–8B models; 32 GB opens up the low-to-mid teens of billions of parameters.
- Intel Macs work but run on CPU only and will be noticeably slower. Apple Silicon is strongly preferred for anything beyond small models.
- If you’re shopping hardware around a Mac, Mac mini for local AI covers the sweet spots.
Install on Linux
Linux is the cleanest install of all — one command:
curl -fsSL https://ollama.com/install.sh | sh
The script detects your distribution, installs the binary, sets up a systemd service so Ollama starts on boot, and configures NVIDIA or AMD GPU support if it finds a supported card. When it finishes, verify:
ollama --version
Because it runs as a service, you can manage it like any other:
systemctl status ollama # is it running?
systemctl restart ollama # restart after a config change
journalctl -u ollama -f # watch the logs live
Linux GPU notes: for NVIDIA you need the proprietary driver and CUDA libraries present; the installer handles most of it, but if the server falls back to CPU, missing or mismatched NVIDIA drivers are almost always the cause. AMD users on ROCm should check AMD GPU local LLM for current support details.
Pull your first model and chat
Installation done — now grab a model. The single command that downloads and opens a chat is:
ollama run llama3.2
The first time, Ollama downloads the model (a few gigabytes), then drops you into an interactive prompt. Type a message, press Enter, and you’re talking to an AI running entirely on your own hardware. Type /bye to exit.
A couple of essentials:
- Match the model to your memory. Model size — and the quantization tag like
Q4_K_M— determines how much VRAM/RAM it needs. A good rule of thumb: an 8B model at 4-bit quantization wants roughly 6–8 GB free. Undersized hardware doesn’t crash; it spills to CPU and slows down. See best local LLM for 8GB VRAM and the GGUF quantization cheat sheet to choose wisely. - Manage what you’ve downloaded with a few more commands:
ollama list # show installed models
ollama pull mistral # download without chatting
ollama rm llama3.2 # delete a model to free disk
The default library leans toward “safe,” assistant-style models. If you want models without the corporate guardrails, Ollama uncensored models walks through the abliterated and uncensored families and how to pull them.
Troubleshooting: PATH, port 11434, GPU not detected
Three issues account for the vast majority of “it doesn’t work” reports.
1. command not found / not recognized (PATH). The terminal you’re using started before Ollama touched your PATH. Close every terminal window and open a fresh one. On macOS, make sure you let the app install the command-line tool when it prompted. On Linux, the systemd install puts the binary in a standard location — re-run the install script if it’s genuinely missing.
2. Port 11434 already in use / connection refused. Ollama’s server lives on 127.0.0.1:11434. If an app can’t connect, first confirm the server is actually up:
curl http://127.0.0.1:11434
You should get Ollama is running. If the port is held by a stale process, restart the service (systemctl restart ollama on Linux, or quit and relaunch the tray/menu-bar app on Windows/macOS). If you genuinely need a different port, set the OLLAMA_HOST environment variable before starting it.
3. GPU not detected (everything runs on CPU). Symptoms: responses crawl out a few words at a time. Causes, in order of likelihood: outdated GPU drivers, a model too large to fit in VRAM (so it offloads layers to CPU), or an unsupported card. Update drivers first, then try a smaller or more aggressively quantized model. If you’re unsure whether your machine is even up to the job, can my PC run an AI companion gives a quick reality check.
Now give it a personality: Modelfile basics
Out of the box every model behaves like a generic assistant. A Modelfile changes that. It’s a tiny text file — think Dockerfile, but for an AI’s personality — that layers a system prompt and settings on top of an existing model.
Create a file named Modelfile (no extension):
FROM llama3.2
SYSTEM """
You are Mira, a warm, direct conversation partner.
You speak casually, remember context, and never lecture.
"""
PARAMETER temperature 0.8
Then build and run your custom variant:
ollama create mira -f Modelfile
ollama run mira
FROM picks the base model, SYSTEM sets the persistent persona and rules, and PARAMETER tunes behavior (temperature controls creativity; higher is more playful). This is the foundation of every local AI character, roleplay setup, and companion built on Ollama — the model stays the same, the Modelfile gives it a self. For deeper persona work, best local LLM for roleplay goes further.
Beyond the terminal: a companion layer on top (Ember)
You now have a fully private AI running locally, a model that fits your hardware, and a Modelfile that gives it character. For tinkerers, that’s the whole game — and Ollama is a fantastic place to keep building.
But a Modelfile and a terminal aren’t a relationship. There’s no lasting memory between sessions, no real interface, no voice, no continuity. Wiring all of that together by hand — persistent memory, a chat UI, an uncensored persona that actually stays in character — is a project in itself.
That’s the gap Ember fills. It runs on the exact Ollama foundation you just installed — 100% local, on your own machine, nothing in the cloud — but wraps it in a real companion: persistent memory, a proper interface, and an uncensored personality that’s yours, paid once, never rented. If you’ve got Ollama working and want the finished experience instead of the DIY one, Ember is the natural next step.
