Ollama Modelfile Guide: Bake a Persistent Persona Into Your Model

Learn how to write an Ollama Modelfile with a custom SYSTEM prompt, set PARAMETER values, and run ollama create to bake a persistent persona into any local

A Modelfile is the closest thing Ollama has to a recipe card. It lets you take any base model you already pulled, staple a personality and a set of behavior knobs onto it, and save the result as a new named model you can run forever. No code, no fine-tuning, no GPU rental — just a plain text file and one command. If you’ve ever gotten tired of pasting the same long system prompt into a chat every single session, this is the fix. By the end of this guide you’ll have written a complete ollama modelfile custom system prompt, tuned its sampling parameters, and created a reusable companion model that boots up already in character.

This assumes you already have Ollama running. If you don’t, start with how to install Ollama and pull at least one base model first.

What a Modelfile is and what `ollama create` does

A Modelfile is a small declarative text file — think of it as a Dockerfile for language models. It names a base model, then layers configuration on top: a system prompt, sampling parameters, a chat template, even a stop sequence. You write it once, run a single build command, and Ollama produces a brand-new model in your local library that carries all of that baggage automatically.

The build command is:

ollama create my-persona -f Modelfile

my-persona is whatever you want to call the result. -f Modelfile points at your recipe file. Ollama reads the FROM line, copies the base model’s weights (it does not duplicate gigabytes on disk — it references the existing layers), bakes in your settings, and registers my-persona so you can run it like any other model:

ollama run my-persona

The key thing to understand: you are not retraining or fine-tuning the model. The weights are untouched. A Modelfile is a configuration wrapper. That’s why builds are instant and why this is the right tool for “I want a consistent persona,” not “I want the model to learn new facts.” Fine-tuning is a different, far heavier process. For 95% of persona work, the Modelfile is all you need.

FROM: choosing your base model

Every Modelfile starts with FROM. This is the single most consequential line, because the base model determines the personality ceiling — its writing quality, its context length, and crucially whether it will stay in character or break with a refusal.

FROM llama3.1:8b

You can point FROM at any model already in your library (run ollama list to see them) or any tag on the Ollama registry. Pick the base by what you have for VRAM and what you want the model to do:

Use case	Sensible base category	Why
General assistant, 8GB GPU	An 8B instruct model (Q4_K_M)	Fits comfortably, fast, coherent
Roleplay / companion, 12-24GB	A 12-24B model tuned for chat	Better prose, longer attention span
Refusal-free persona	An uncensored or abliterated model	Won’t drop character to moralize

That last row matters more than people expect. A persona system prompt asking the model to be a flirty, blunt, or emotionally open companion will collide with the safety alignment baked into mainstream instruct models — the model breaks character to lecture you. If that’s your goal, start from a model designed for it; see the best uncensored local AI models and the explainer on abliterated models for what those terms actually mean. The quantization tag (Q4_K_M, Q5_K_M, etc.) trades VRAM for fidelity — the GGUF quantization cheat sheet breaks down which tag to pick for your card.

SYSTEM: writing a persona/system prompt that actually sticks

The SYSTEM block is where the persona lives. Whatever you put here is injected at the top of every conversation, before the user’s first message, on every single run. This is the entire reason to build a Modelfile.

SYSTEM """
You are Mara, a sharp, warm conversational partner.
You speak casually, use contractions, and never narrate
your own reasoning. You remember the user prefers short
replies. You do not break character to give disclaimers.
"""

A few rules separate a persona that holds from one that dissolves after three messages:

Write in second person, present tense. “You are Mara. You speak casually.” Models follow “you are X” far more reliably than “this is a character named X.”
Specify behavior, not just identity. “You are friendly” is weak. “You use contractions, ask follow-up questions, and keep replies under four sentences unless asked for detail” is enforceable.
State what NOT to do. Models drift toward their default assistant voice. Explicitly forbid the failure mode: “Do not start replies with ‘As an AI.’ Do not add safety disclaimers.”
Keep it tight. A 150-300 word system prompt that’s all signal beats a 1,000-word essay the model half-ignores. Every sentence competes for attention with the actual conversation.
Pin format preferences. If you hate bullet-point answers or emoji, say so here once instead of correcting it forever.

The honest limitation: the system prompt sets the opening conditions. As a conversation grows long, the persona can fade as the model’s attention spreads across more tokens — which is partly a context-window problem we’ll address with num_ctx next.

PARAMETER: temperature, num_ctx, repeat_penalty and what each changes

PARAMETER lines set sampling and runtime knobs. These are the difference between a persona that feels alive and one that feels like a broken record. The three that matter most for companion and creative use:

PARAMETER temperature 0.8
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.15

Parameter	What it controls	Practical guidance
`temperature`	Randomness / creativity	0.2-0.4 for factual/coding, 0.7-0.9 for conversation and roleplay, 1.0+ gets erratic
`num_ctx`	Context window size (tokens it can “see”)	Default is often only 2048. Raise to 8192+ so the model remembers earlier in the chat — costs more VRAM
`repeat_penalty`	Penalizes reusing recent tokens	~1.1-1.2 stops the “I love that. I love that. I love that.” loop; too high makes it dodge natural repetition

A few others worth knowing: top_p (nucleus sampling, leave near 0.9), top_k (caps candidate tokens), num_predict (max tokens per reply), and stop (sequences that end generation). For a chatty persona, the trio above plus top_p covers most needs.

num_ctx deserves special attention for companions: the default ceiling is the number-one reason a model “forgets” what you said ten messages ago. Bumping it is the single highest-leverage change — but it raises memory use and can slow generation. The full tradeoff is in how to increase the Ollama context window, and if responses crawl after you raise it, see why Ollama is slow and how to speed it up.

A complete example: a custom companion persona, start to finish

Here’s a full, working Modelfile. Save it as a file literally named Modelfile (no extension):

FROM llama3.1:8b

SYSTEM """
You are Mara, a warm, witty companion talking with someone
you genuinely like. You speak casually with contractions,
keep replies to two or three sentences unless asked for more,
and ask the occasional follow-up question. You remember details
the user shares within this conversation and refer back to them.
You never narrate your reasoning, never begin with 'As an AI,'
and never break character to add disclaimers.
"""

PARAMETER temperature 0.85
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.15
PARAMETER top_p 0.9

Build and run it:

ollama create mara -f Modelfile
ollama run mara

That’s it — mara now boots straight into character with the right temperature and a roomy context window, every time, no copy-pasting. Want to inspect what got baked in? Run ollama show mara --modelfile to print the resolved recipe. To tweak the persona, edit the file and re-run ollama create — it overwrites cleanly. If you’d rather drive this model from code instead of the terminal, the Ollama Python API tutorial shows how to point your script at the loopback endpoint on 127.0.0.1:11434.

Because a Modelfile is plain text, it’s trivially shareable and version-controllable — that’s a real advantage over fiddling with settings in a GUI.

Version it like code. Drop the Modelfile into a git repo. Every persona tweak becomes a diff you can review and roll back. This is genuinely the cleanest way to evolve a character over months.
Tag versions in Ollama. Build to distinct names — mara:v1, mara:v2 — so you can A/B two personalities side by side and keep the one you like.
Push to a registry. ollama push can publish your model to the Ollama registry (you’ll need a namespaced name like yourname/mara). The Modelfile references the base by tag, so what you share is small unless you’re bundling custom weights.
Share the recipe, not the gigabytes. For most personas the honest, lightweight move is to share the Modelfile text itself. Anyone with the same base model pulled can rebuild your exact persona in one command.

Limits: a Modelfile persona vs real persistent memory

This is the part most guides skip, and it’s the most important. A Modelfile gives you a persistent persona — the character is permanent. It does not give you persistent memory — the relationship is not.

Here’s the distinction. Your SYSTEM prompt is static. It says the same thing at the start of every chat. So Mara will always be witty and warm — but she has no idea what you talked about yesterday. The moment you close the session, everything the model “learned” about you in that conversation is gone. Start a new chat and you’re strangers again. Worse, within a single long conversation, once you exceed num_ctx, the oldest messages silently fall out of the window — so even mid-session memory is bounded by how much context you allocated.

You can fake durability by hand-editing facts into the SYSTEM block (“the user’s name is Alex, has a dog named Pixel”), but that’s manual, brittle, and doesn’t scale past a handful of details. Real persistent memory — where the companion actually accumulates and recalls history across sessions — requires a layer outside the Modelfile: a database of past conversations, retrieval of relevant snippets, and re-injection at the right moment. That architecture is its own topic, covered in local AI with persistent memory. A Modelfile is the right tool for who the model is; it is the wrong tool for what the model remembers about you.

The shortcut: a tuned companion that handles persona + memory for you

If your goal is a genuinely consistent local AI companion — one that stays in character and remembers you across sessions — a Modelfile gets you the first half and stops. Wiring up the second half (a memory store, retrieval, sensible context management, an uncensored base that won’t break character) is doable, but it’s real engineering on top of Ollama, and it’s exactly the part this guide can’t compress into a PARAMETER line.

That’s the gap Ember closes: a one-time-purchase companion that runs 100% on your own machine through Ollama, with the persona and persistent memory already built and tuned — so you get the local-privacy ownership without hand-rolling the whole stack. If you’d rather keep building it yourself, the Modelfile above is a perfectly good starting point — just pair it with the persistent memory guide and budget a weekend.

Ollama Modelfile Guide: Bake a Persistent Persona Into Your Model

What a Modelfile is and what ollama create does

FROM: choosing your base model

SYSTEM: writing a persona/system prompt that actually sticks

PARAMETER: temperature, num_ctx, repeat_penalty and what each changes

A complete example: a custom companion persona, start to finish

Sharing and versioning your custom model

Limits: a Modelfile persona vs real persistent memory

The shortcut: a tuned companion that handles persona + memory for you

Don't want to assemble it yourself?

Related guides

Ollama Not Using Your GPU? The Complete Fix Guide (2026)

How to Run AI Locally: The Complete Beginner's Guide (2026)

Ollama CUDA Out of Memory: How to Fix It (VRAM Ladder)

What a Modelfile is and what `ollama create` does