Most local AI setups have the memory of a goldfish. You spend an evening getting a model talking just the way you like, you tell it your name, your projects, the things you’re working through — and the next time you launch it, you’re a stranger again. It greets you like the first day, every day. That’s not a flaw in your model. It’s the default behavior of every language model on Earth: they don’t remember anything you don’t paste back in.
This guide explains exactly why that happens, the real difference between a model’s context window and actual memory, and the practical ways to fix it — from do-it-yourself plumbing to apps that ship persistent memory out of the box. The goal is a local AI that remembers conversations the way a person would: recalling your name weeks later, picking up a thread from last month, building on what it already knows about you.
Why most local setups forget you
When you run ollama run llama3.1:8b and start chatting, the model is stateless. It has no hard drive of its own, no diary, no concept of “yesterday.” Each time you send a message, the entire conversation so far is fed back into the model as one long block of text. The model reads that block, predicts the next words, and then forgets everything the instant it finishes.
The illusion of memory within a single session comes from your chat tool quietly re-sending the whole transcript on every turn. That’s why a long chat feels coherent — the tool is re-feeding the history each time. But close the terminal, or start a new session, and that transcript is gone. There’s no automatic “save my relationship with this user” step. Per-session context is all you get by default, and it evaporates when the session ends.
This catches almost everyone who follows a basic how to run AI locally walkthrough. The model is excellent. The forgetting is structural, not a quality problem.
The difference between a context window and real memory
These two get conflated constantly, and the distinction is the whole point of this article.
A context window is the model’s short-term working memory — the maximum amount of text it can “see” at once, measured in tokens (roughly ¾ of a word each). An 8B model might have an 8K, 32K, or 128K-token window. Everything in that window is visible to the model right now. Everything outside it does not exist as far as the model is concerned.
Real memory is different: it’s information that persists across sessions and gets selectively pulled back into the context window when relevant. The model itself still has no memory — but a layer around it stores facts, summaries, and past conversations on disk, and re-injects the right pieces at the right time.
| Context window | Persistent memory | |
|---|---|---|
| Lives where | In RAM/VRAM, only during inference | On disk, across restarts |
| Lifespan | One session, then gone | Indefinite |
| Size limit | Hard token cap (e.g. 32K) | Effectively unlimited |
| Cost of growth | More VRAM, slower replies | Cheap storage + retrieval logic |
| What it is | The model’s eyes | The model’s notebook |
The trap: people try to fake memory by cramming everything into a giant context window. It works until it doesn’t. Bigger windows eat VRAM, slow every reply down, and the model still loses focus when its window fills with thousands of lines of old chat. You can’t brute-force a notebook by making the eyes bigger. Persistent memory means storing things outside the window and retrieving them on demand.
DIY approaches: RAG, Mem0, and summary memory
If you want to build memory yourself on top of Ollama, there are three established patterns, from simplest to most capable.
1. Summary memory (the easiest). After each session, you ask the model to write a short summary of what was discussed and what it learned about you. You save that summary to a text file. Next session, you paste the summary in (or your tool does) as part of the system prompt. It’s crude but genuinely effective — a few hundred words of “here’s what I know about this person” goes a long way. The downside: summaries lose detail and drift over time.
2. RAG (Retrieval-Augmented Generation). This is the serious approach. You store every past message — or chunked notes — in a vector database (e.g. Chroma, Qdrant, or a SQLite-based store). Each chunk gets an embedding, a numeric fingerprint of its meaning. When you send a new message, the system embeds that too, searches the database for the most semantically similar past chunks, and injects only those into the context window. The model “remembers” your dog’s name from three weeks ago because the retrieval step found the relevant chunk and pasted it back in. RAG is the same machinery people use to chat with documents locally — pointed at your conversation history instead of PDFs.
3. Mem0 and dedicated memory layers. Tools like Mem0 sit between you and the model and automate the whole loop: they extract salient facts (“user is vegetarian,” “user’s project is called Atlas”), store them, deduplicate them, decide what’s worth keeping, and retrieve them automatically. It’s RAG plus opinionated fact-extraction and lifecycle management. It runs locally and plugs into the Ollama API on 127.0.0.1:11434.
A bare-bones local memory stack looks like this conceptually:
your message
→ embed it
→ search vector DB for relevant past memories
→ prepend those memories + a running summary to the prompt
→ send to ollama (127.0.0.1:11434)
→ after the reply, extract new facts → store them back
Why this is fiddly to get right
Reading that diagram, it sounds tidy. Building it is not. The hard parts are the ones nobody warns you about:
- What’s worth remembering? Store everything and your retrieval fills with noise; the model gets “your name is Sam” alongside fifty irrelevant fragments. Store too little and it forgets the thing that mattered.
- Retrieval quality. Semantic search returns the closest chunks, not necessarily the right ones. Tune the similarity threshold wrong and it either pulls junk or misses the obvious.
- Summary drift. Each re-summarization of a summary loses fidelity, like a photocopy of a photocopy. After a month your “memory” is a blurry caricature.
- Context budget. Every memory you inject costs tokens that could go to the actual conversation. You’re constantly trading recall against window space.
- Conflicting facts. You said you liked X in March and disliked it in June. Which wins? Naive systems keep both and confuse the model.
- Embedding + DB plumbing. You’re now running an embedding model, a vector store, and glue code — and keeping them in sync as your history grows.
None of this is impossible. People build it for fun, and it’s a great weekend project if you enjoy the wiring. But it’s real engineering, it breaks in subtle ways, and “my AI forgot me again” usually traces to one of the six issues above. This is the honest gap between a model that can remember and a system that reliably does.
What companion apps do differently
A purpose-built AI companion app with memory solves all of that for you — because remembering you is the entire product, not a bolt-on. Instead of you tuning retrieval thresholds, the app ships a tested memory pipeline: it decides what to store, summarizes intelligently, retrieves the right facts, resolves conflicts, and manages the context budget so replies stay fast and coherent.
The practical difference is night and day. With a DIY stack you’re a database administrator for your own chats. With a good companion app, you just talk — and weeks later it still knows your name, your history, the running jokes, the things you told it once. That’s the feature people actually want when they search for an ai girlfriend that remembers you: not a vector database, but the felt sense of continuity. The plumbing is real; you just shouldn’t have to be the plumber.
Memory + no-filter: the combination cloud apps can’t give you privately
Here’s where it gets interesting, and where the local angle stops being merely technical.
Persistent memory is intimate by definition. To remember you usefully, the system has to store the real details — your name, your moods, your private thoughts, the unfiltered conversations. On a cloud companion app, all of that lives on someone else’s server. Cloud companion apps necessarily store messages server-side to function across devices, and you’re trusting their retention and privacy policies with the most personal log you’ll ever generate. (It’s worth knowing whether Character.AI or Replika can read your chats before you pour your inner life into one.)
Now stack the second thing people want: no filter. Cloud companions are heavily moderated — partly by genuine policy, partly because payment processors and app stores force it. So the cloud offer is: we’ll remember everything about you, store it on our servers, and still refuse half of what you actually want to talk about. You get the surveillance without the freedom.
Local flips both at once. A model running on your machine can carry deep persistent memory and run uncensored — because the memory file never leaves your disk and no remote policy team is in the loop. Memory and no-filter only combine privately when the whole thing is local. That’s the core argument in local AI vs cloud AI, and it’s the entire reason the local approach exists. (For the privacy mechanics of any companion setup, the AI companion privacy guide goes deeper.)
Local memory you own (Ember) vs hosted memory, no setup (Freya)
Two honest paths, depending on what you’ve got and what you value:
| Local & owned | Hosted & instant | |
|---|---|---|
| Memory lives | On your machine, encrypted, yours | On the provider’s servers |
| Setup | Install once, runs on your hardware | Sign up, start talking |
| Needs a GPU | Helps a lot (see hardware notes) | No — works on any device |
| Privacy ceiling | Maximum — nothing leaves your disk | Bounded by their policy |
| No-filter + memory | Both, fully private | Depends on the service |
| Best for | Owners, self-hosters, max privacy | ”I want it now, no GPU” |
Ember is the local route: a sold-once companion that runs 100% on your own machine via Ollama, with persistent memory stored locally — the freedom and privacy of the DIY stack, minus the weekend of wiring. If you’re building toward this yourself, how to run an AI girlfriend locally covers the full local path.
Freya is the hosted route: cloud-based, zero setup, persistent memory built in, nothing to install — for the reader who doesn’t have a GPU or simply wants to start talking tonight.
Same core feature — an AI that remembers you — two delivery models. Pick by your hardware and your privacy bar.
Setting up basic persistent memory yourself
If you do want to roll your own, here’s the minimum viable version on top of a working Ollama install:
-
Start with a summary file. Create
memory.txt. After each session, prompt your model: “Summarize what you learned about me and what we discussed, as concise bullet points.” Append it. -
Inject it on launch. Put the contents of
memory.txtinto your system prompt every new session. Many front ends (Open WebUI, SillyTavern) have a persistent system-prompt or “character notes” field for exactly this. -
Graduate to retrieval when the file gets big. Once
memory.txtis too long to inject wholesale, switch to a vector store. A lightweight option ischromadbplus a local embedding model:
ollama pull nomic-embed-text
Embed each memory chunk, store it in Chroma, and at query time retrieve the top few matches to prepend — instead of pasting the whole file.
- Let a memory layer automate it. When the manual loop gets tedious, drop in Mem0 (or a similar library) pointed at your local Ollama endpoint. It handles fact extraction, dedup, and retrieval so you stop babysitting it.
Start at step 1 — a plain summary file genuinely surprises people with how much continuity it buys for ten minutes of effort. Climb the ladder only when you outgrow each rung.
Building memory yourself is a satisfying project, and now you know exactly where the sharp edges are. But if what you actually want is to talk to something that remembers you — privately, without filters, without becoming the database admin for your own conversations — that’s a solved problem, whether you’d rather own it on your own machine (Ember) or skip the hardware entirely and start tonight in the cloud (Freya).
