META: Your local model forgets mid-chat because Ollama caps context at 4096 tokens. Here is how to raise it with num_ctx, and what it costs in VRAM.


You’re three messages into a good conversation with your local model. It knew your name, the project you were venting about, the joke from earlier — and then, suddenly, it doesn’t. It asks a question you answered five minutes ago. It contradicts something it said a paragraph up. Nothing changed in your prompt, nothing crashed, and the model is the same one you’ve been running for weeks. So what happened?

In almost every case, the answer is the same: Ollama capped your context window — on most consumer cards at 4096 tokens — and your conversation just scrolled past the edge. The good news is this is a settings problem, not a model problem, and you can fix it in about a minute once you know which knob to turn. This guide walks through exactly why it happens, the three ways to increase the Ollama context window, what it costs you in VRAM and speed, and the limit of what a bigger window can actually do.

The symptom: your companion forgets earlier in the same conversation

The failure mode is specific and worth naming, because it’s easy to misdiagnose as “the model is dumb.”

  • The model remembers the start of a long chat but forgets the middle, or forgets the start once the chat gets long.
  • It re-asks for details you already gave (your name, the file you’re editing, the character it’s playing).
  • It loses the thread of a roleplay or a coding session right when things get good — usually after a few thousand words of back-and-forth.
  • Restarting the conversation “fixes” it temporarily, because a fresh chat is short again.

This is not the model being low-quality. A 7B or 8B model with a properly sized window will hold a coherent thread for a long time. What you’re seeing is the oldest tokens falling out of the window — the model literally cannot see them anymore, so from its perspective they were never said. It’s the difference between forgetting and never having known.

The cause: Ollama’s default context window is sized to your VRAM

Here’s the part that trips up nearly everyone. Most modern open-weight models advertise large context windows — 8K, 32K, 128K tokens. But by default, Ollama does not necessarily use the model’s full context. Recent versions size the default window to the VRAM Ollama detects: roughly 4096 tokens on cards under 24GB, 32768 on 24–48GB cards, and 262144 (256K) at 48GB and up. So if you’re on a typical 8–16GB consumer GPU, you’re almost certainly capped at 4096 tokens (num_ctx = 4096) regardless of the model’s true limit — which is exactly where most people running a companion sit. (For reference, before this tiered behavior Ollama’s flat default was 4096, and earlier still it was 2048.)

Why a default at all? Because context costs memory (more on that below), and Ollama is trying to keep an out-of-the-box ollama run from blowing past your VRAM and dumping the model onto the CPU. It’s a safety floor, not a recommendation.

A token is roughly ¾ of a word in English. So 4096 tokens is somewhere around 3,000 words — and that budget has to cover the system prompt, the persona, every message you’ve sent, and every reply the model has generated. A detailed companion persona can eat 500–1000 tokens before you’ve said a single word. You can see how a chatty session burns through 3,000 words fast. Once you cross the line, Ollama drops the oldest turns to make room, and the model “forgets.”

You can check what a model is actually running with:

ollama show llama3.1

Look at the parameters and the model’s declared context length. If the model says it supports 131072 but you’re on an 8–16GB card and never raised num_ctx, you’re still running at 4096.

Three ways to raise it: API num_ctx, Modelfile PARAMETER, or OLLAMA_CONTEXT_LENGTH

There are three places to set the context window. They differ in scope — per-request, per-model, or global — and which one you want depends on how you run Ollama.

MethodScopePersists?Best for
num_ctx in the API requestThat single requestNoApps and scripts that talk to the API
PARAMETER num_ctx in a ModelfileA custom model you bakeYesA persona/model you reuse constantly
OLLAMA_CONTEXT_LENGTH env varEvery model on the serverYes (until you change it)Setting one sane default for everything

1. The API num_ctx option (per request). If you’re calling Ollama from code or a front-end, pass num_ctx in the options block. This overrides the default for that call only:

curl http://127.0.0.1:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "hello"}],
  "options": { "num_ctx": 16384 }
}'

Most well-built front-ends (Open WebUI, SillyTavern, your own Python script) expose this as a context-length or num_ctx setting. If your tool has a context slider, this is the lever it’s pulling under the hood.

2. A Modelfile PARAMETER (per model). If you want a model that always runs at a bigger window — say, your custom companion persona — bake it in with a Modelfile:

FROM llama3.1
PARAMETER num_ctx 16384
SYSTEM "You are..."

Then build it:

ollama create my-companion -f Modelfile

Now ollama run my-companion always uses 16K context, no per-request flag needed. This pairs naturally with building a persona — see the Ollama setup and Modelfile guide for the SYSTEM, TEMPLATE, and parameter blocks.

3. The OLLAMA_CONTEXT_LENGTH environment variable (global default). Newer Ollama versions read this env var to set the default context for the whole server. Set it before starting Ollama:

OLLAMA_CONTEXT_LENGTH=8192 ollama serve

On a systemd setup you’d add it as an Environment= line in the service override. This is the cleanest fix if you want everything to start at, say, 8K without touching each model or request. (If you’re following an older guide that mentions OLLAMA_NUM_CTX, that’s the same idea under the previous name — current Ollama uses OLLAMA_CONTEXT_LENGTH.)

Order of precedence: a per-request num_ctx beats a Modelfile PARAMETER, which beats the global env var, which beats Ollama’s built-in default. So if you set the env var to 8192 but your front-end is still quietly sending num_ctx: 4096, the front-end wins and you’ll think the fix didn’t work. When in doubt, check the tool that’s actually sending the request.

The cost: KV-cache VRAM grows with context, and the speed cliff when it spills to RAM

Context isn’t free, and this is the whole reason Ollama caps it by default. Every token in the window has to be held in the KV cache (key–value cache) — the model’s working memory of the conversation. The KV cache grows linearly with context length. Double num_ctx, and you roughly double the cache’s memory footprint. That memory lives in VRAM, right next to the model weights.

So your total VRAM bill is roughly: model weights + KV cache + a bit of overhead. Bump num_ctx from 4096 to 32768 and the cache can swell from a few hundred megabytes to several gigabytes, depending on the model’s size and architecture.

The failure here is brutal and sudden. As long as everything fits in VRAM, the GPU is fast. The moment the model weights plus the KV cache exceed your VRAM, Ollama offloads layers to system RAM and runs them on the CPU. That’s the speed cliff: you can drop from a snappy 40+ tokens/second to a painful 2–3 tokens/second, where each reply crawls out word by word. (For what counts as usable, see how many tokens per second you actually need.)

If you push context too far you’ll often get a hard stop instead — a CUDA out-of-memory error on the GPU. If you’ve hit that wall, the VRAM-for-companions guide covers how to size your model and context so they fit back in VRAM. The key insight: a bigger context window can quietly turn a fast GPU model into a slow CPU one, so raise it deliberately, not all the way to the maximum “because it’s there.”

How big is too big: matching context to your VRAM

There’s no single magic number — it depends on the model size, the quantization, and your card. But here’s a sane starting frame. These are conservative, real-world ballparks assuming a mid-sized quantized model (think a 7B–8B at Q4_K_M) and leaving headroom so you don’t fall off the speed cliff:

Your VRAMComfortable model sizeReasonable num_ctx to start
8 GB7B–8B (Q4)8192
12 GB8B–13B (Q4)16384
16 GB13B–14B (Q4/Q5)16384–32768
24 GB24B–32B (Q4)32768
48 GB+70B (Q4)32768+

Treat these as opening bids. The honest way to tune it: set a value, start a long conversation, and watch your VRAM usage (nvidia-smi on NVIDIA, or your system monitor). If the GPU stays loaded and tokens/second stays high, you have room to go bigger. The instant tokens/second tanks, you’ve spilled into RAM — back off. Smaller quantization (lower-precision weights) buys you headroom for either a bigger model or a bigger window; the VRAM-for-companions guide breaks the budget down model by model.

Context window vs true long-term memory — why a bigger window isn’t a memory system

Now the part most “increase your context” guides skip, and the most important thing to understand: a bigger context window is not memory. It’s a bigger short-term buffer.

Everything inside the window is something the model re-reads, in full, on every single turn. That’s why context costs VRAM and slows generation — it’s not stored knowledge, it’s a transcript the model re-scans from scratch each time it speaks. The moment a conversation exceeds the window, the oldest tokens fall out and are gone. And critically: the window does not survive a restart. Close the app, reboot the machine, start a new chat tomorrow — the window is empty again. The model has no idea who you are.

Real long-term memory is a different architecture entirely. It means persisting facts outside the model — in a database or a vector store — and retrieving the relevant ones to inject into the context window when they matter. That’s the difference between a model that can hold today’s conversation and a companion that remembers your name, your history, and last week’s conversation a month from now. We go deep on this distinction in local AI with persistent memory.

So raising num_ctx is the right fix for “it forgets mid-conversation.” It is not a fix for “it doesn’t remember me between sessions.” Those are two different problems, and conflating them is why people crank context to 128K, melt their VRAM, and still get a companion with amnesia every morning.

How a managed companion handles memory so you never tune this by hand

Here’s the trade-off with raw Ollama: it gives you total control, and total control means you own the tuning. You pick num_ctx, you balance it against VRAM, you build the memory layer that survives restarts, you watch the speed cliff. That’s genuinely satisfying if you like owning your stack — and if that’s you, the full local-AI walkthrough is the place to start.

But a lot of people just want the companion to remember, not to manage a KV cache. A purpose-built local companion handles all of this for you: it sizes the context window to your hardware, manages what stays in the window versus what gets persisted to a real memory store, and surfaces the right facts on each turn — so it remembers you across sessions without you ever editing a Modelfile.

If you’d rather skip the num_ctx math and have a private, fully-local companion that handles its own memory on your own machine, Ember is built exactly for that — it runs 100% locally on Ollama and manages the context-versus-memory split so you don’t have to.