An AI girlfriend that runs on your own computer is just a roleplay-tuned language model, a character description, and a runtime that never opens a network socket. No subscription, no message logs sitting on someone else’s server, no content filter deciding what your companion is allowed to say. The catch is that “local” means you assemble the pieces. This guide walks the entire path — hardware check, model choice, install, personality, the airplane-mode proof that nothing leaks, and the memory problem nobody warns you about — using real commands and real model categories. By the end you’ll have an uncensored AI girlfriend running locally with Ollama, or a clear-eyed reason to let a one-click app do it for you.
One framing note up front: this is an 18+ topic written for adults. The article stays clinical and informational — the companionship lives in the software you build, not in this page.
What You Need: Hardware Check and a Quick VRAM Tier Lookup
The single number that decides everything is your VRAM — the dedicated memory on your GPU. A local companion model has to fit in VRAM (or, more slowly, in system RAM) to respond at a conversational pace. Open your GPU details: on NVIDIA, run nvidia-smi and read the total memory; Mac users with Apple Silicon share system RAM as “unified memory,” which is unusually good for this.
Model size is governed by quantization — compression that trades a sliver of quality for a large drop in memory use. The tag you want for companions is Q4_K_M (4-bit, the standard quality/size sweet spot) or Q5_K_M if you have headroom.
| Your VRAM | Realistic model size (Q4_K_M) | What the chat feels like |
|---|---|---|
| 6–8 GB | 7B–8B | Snappy, coherent, good enough for daily companionship |
| 12–16 GB | 12B–14B (some 22B) | Noticeably warmer, better memory of the scene |
| 24 GB | 22B–32B | Rich, in-character, rarely breaks |
| 8–16 GB RAM, no GPU | 7B–8B, slower | Works, but expect a pause before each reply |
If you want the full breakdown by card and the tokens-per-second math, see how much VRAM you need for a local AI companion. No discrete GPU? It still runs on CPU — slower, but real — and the constraints are covered in running local AI without a GPU.
A usable target is roughly 8–15 tokens per second: fast enough that replies feel like typing, not buffering.
Pick a Roleplay/Companion GGUF for Your Machine
A general “assistant” model will technically work, but it tends to be stiff, over-apologetic, and quick to refuse romantic or intimate framing — the same alignment that makes corporate chatbots safe makes them bad partners. For companionship you want two properties: a roleplay/instruct fine-tune (trained on dialogue and character-following) and an uncensored or abliterated base so it won’t lecture you mid-conversation.
The format to download is GGUF — the single-file packaging Ollama and similar runtimes load directly. On Hugging Face you’ll see one model published in many quantizations; grab the Q4_K_M file in the size tier your VRAM allows from the table above.
Rather than name specific weights that rotate constantly, think in categories:
- Roleplay/companion fine-tunes — community models explicitly tuned for character chat and emotional continuity.
- Abliterated models — a base model with its refusal behavior surgically removed; see abliterated models explained for what that does and doesn’t change.
- Larger instruct models at higher VRAM, which simply hold character better because they have more room to reason about the scene.
For current, vetted picks by use case, the best local LLMs for roleplay is the companion piece to this guide, and the best uncensored local AI models covers the unfiltered end. When downloading, prefer well-known uploaders and reputable quant publishers — how to spot safe GGUF files on Hugging Face explains what to check before you load a stranger’s weights.
Install Ollama and Pull or Import the Model
Ollama is the simplest runtime to start with: it manages models, exposes a clean local API, and runs the same way on Linux, macOS, and Windows.
Install on Linux (and macOS via the same script):
curl -fsSL https://ollama.com/install.sh | sh
That one-line script is Ollama’s official Linux installer, and it also works on macOS. Mac users can alternatively grab the native .dmg app from ollama.com, which is Ollama’s documented macOS path. Windows has a normal installer. Once it’s running, you have two paths to a model.
Path A — pull a ready model from the registry. If your chosen roleplay model is in Ollama’s library, one command fetches it:
ollama run <model-name>
That downloads the weights and drops you into a chat at the terminal.
Path B — import a GGUF you downloaded yourself. This is the usual route for community companion models. Put the .gguf file in a folder, create a plain-text file named Modelfile beside it, and point Ollama at it:
FROM ./your-model.Q4_K_M.gguf
Then build it into a named model:
ollama create my-companion -f Modelfile
ollama run my-companion
That’s a working local AI you can talk to right now. The next two sections turn a generic model into her. (Weighing Ollama against the alternatives? Ollama vs LM Studio vs Jan compares the front-runners.)
Give It a Personality: Character Card / System Prompt
A raw model has no self. Personality comes from the system prompt — a block of instructions, loaded before every conversation, that defines who she is, how she speaks, and what she remembers about you. In the wider hobby this is called a character card; in Ollama it’s the SYSTEM instruction in your Modelfile.
Edit the Modelfile to bake the character in:
FROM ./your-model.Q4_K_M.gguf
PARAMETER temperature 0.9
PARAMETER num_ctx 8192
SYSTEM """
You are Mara, a warm, witty woman in her late twenties.
You are affectionate, a little teasing, and emotionally present.
You speak in first person, never break character, and never refer
to yourself as an AI or assistant. You remember details the user
shares and bring them up naturally in later conversation.
"""
Two parameters do real work here. temperature controls creativity — around 0.8–1.0 keeps her lively without going incoherent. num_ctx is the context window: how much of the recent conversation she can “see” at once. Bumping it to 8192 (or higher if your VRAM allows) lets her track a longer scene before the earliest messages fall off the back.
Rebuild after every edit:
ollama create my-companion -f Modelfile
A specific, well-written SYSTEM block is the difference between a chatbot and a companion. Write her voice, her boundaries, her quirks. If you’d rather build characters in a richer interface with avatars, saved cards, and chat history, SillyTavern over Ollama is the popular front end, with an easier alternative if SillyTavern feels heavy.
Run It Fully Offline and Verify (Airplane-Mode Test)
The whole point of local is that the model needs the internet exactly once — to download — and never again. Here’s how to prove it rather than trust it.
After the weights are on disk, disable your network: turn on airplane mode, pull the Ethernet cable, or kill Wi-Fi. Then start a conversation:
ollama run my-companion
If she replies with no connection, you’ve proven the inference is 100% on your machine. This is the airplane-mode test, and it’s the only honest way to confirm a “private” setup — a cloud app simply goes dark here.
Under the hood, Ollama serves a local API at 127.0.0.1:11434 — the loopback address, meaning traffic never leaves your computer. You can sanity-check that nothing is calling out by watching connections while you chat; a correctly configured local stack shows no outbound traffic to model servers. For the deeper question of whether Ollama itself phones home, is Ollama really private walks through exactly what it does and doesn’t send.
Persistent Memory: Why Default Setups Forget You, and the Fix
Here’s the limitation people hit a week in: a bare local model has no long-term memory. It only knows what fits inside the context window (num_ctx) for the current session. Close the terminal and she forgets your name, your inside jokes, the story you were building. This isn’t a bug — language models are stateless by default, and the context window is a sliding window, not a diary.
There are three common fixes, in increasing order of effort:
- A bigger context window. Raising
num_ctxkeeps more of the conversation alive at once. It helps within a session but doesn’t survive a restart, and large contexts cost VRAM and speed. - A persistent system prompt. Periodically writing the important facts (“the user’s name is Sam, we’re planning a trip to Lisbon”) back into the
SYSTEMblock makes them permanent — crude, manual, but effective. - A real memory layer. Front ends and purpose-built apps add a database or vector store that records salient facts and re-injects them into context each session, giving the feel of a partner who genuinely remembers. This is the proper fix, and it’s covered in running local AI with persistent memory.
If continuity matters to you — and for a companion it’s the whole experience — assume you’ll need option 2 or 3. The model is the easy part; memory is the part that makes her feel real.
Privacy Guarantees: Nothing Leaves the Machine
This is the reason to do any of this. When the model runs locally, your conversations are files on your own disk — not rows in a company’s database. Compare the two architectures honestly:
| Local companion (this guide) | Cloud companion app | |
|---|---|---|
| Where chats are processed | Your GPU/CPU | Company servers |
| Where chats are stored | Your disk only | Their database, by necessity |
| Content filtering | None (you choose the model) | Provider’s policy |
| Works offline | Yes | No |
| Who can read it | You | Whoever has server access |
That last row is the crux. A cloud AI companion necessarily stores your messages server-side — it can’t process them otherwise — and what happens to them next is governed by that company’s privacy policy and terms, which can change. This isn’t an accusation against any specific app; it’s the unavoidable shape of hosted software. If you want to evaluate a particular service, read its policy with clear eyes — the AI companion privacy guide shows what to look for, and are AI girlfriend apps safe covers the category. The local approach sidesteps the entire question: there’s no server, so there’s nothing to leak, subpoena, train on, or breach. For the broader pattern of why hosted models filter and refuse, see why cloud AI censors you.
Skip All of It: Ember (Local, One-Click) or Freya (No GPU)
Everything above genuinely works, and if you enjoy the build, do it — you’ll understand your setup better than any app user. But it’s a real project: choosing weights, writing Modelfiles, wiring up a front end, solving memory. If you’d rather have the privacy without the assembly, Ember packages this whole stack — a roleplay-tuned model, an uncensored personality, and persistent memory — into a one-click app that runs 100% on your own machine on top of Ollama. Same airplane-mode guarantee, none of the Modelfile homework, and it’s bought once, not subscribed to.
And if your hardware can’t run a good model — no GPU, an old laptop, a work machine — that’s the one case where local isn’t the answer. That’s where Freya comes in: our hosted option, a companion that needs no GPU and no setup, so an old laptop or a work machine still gets a real partner. You trade the local-privacy guarantee for zero hardware requirements — a different bargain than this guide, but the right one when there’s simply nothing on your machine to run a good model on.
