If you’ve spent any time in the local-AI roleplay scene, you’ve heard the same two words over and over: SillyTavern and Ollama. Together they’re the de facto “power user” stack for running an uncensored AI companion entirely on your own machine — no cloud, no subscription, no message logs sitting on someone else’s server. SillyTavern is the front-end (the chat interface, character system, and prompt engine); Ollama is the back-end that actually runs the language model on your GPU or CPU. This guide walks the whole thing end to end: install, model pull, connection, character cards, samplers, and the connection errors that trip up almost everyone. It’s accurate to how the stack actually behaves in 2026 — real commands, real settings, no hand-waving.

It’s also honest about the catch. This stack is genuinely powerful and genuinely fiddly. By the end you’ll know exactly what you’re signing up for, and whether a one-click alternative makes more sense for you.

The Stack: What SillyTavern + Ollama Actually Is

People say “SillyTavern” as if it’s an AI. It isn’t. SillyTavern is a front-end — a self-hosted web UI that runs in your browser on localhost. It has no model of its own. What it does have is the best roleplay tooling in the open-source world: character cards, persistent personas, lorebooks (world info), group chats, prompt templates, and fine-grained sampler control.

The actual intelligence comes from a back-end — a program that loads a large language model and answers prompts. Several back-ends exist (KoboldCpp, LM Studio, text-generation-webui, TabbyAPI), but Ollama is the most popular because it’s the simplest: one install, then ollama run <model> and you’re done. Ollama also exposes a local HTTP API at 127.0.0.1:11434 that SillyTavern talks to.

So the mental model is:

LayerJobWhere it runs
SillyTavernChat UI, character cards, lorebooks, samplersYour browser, served from localhost:8000
OllamaLoads the model, generates textBackground service, API on 127.0.0.1:11434
The model (e.g. a 7B–13B GGUF)The actual “personality” / reasoningLoaded into your VRAM by Ollama

Everything stays on your machine. Nothing leaves the loopback interface unless you deliberately configure it to. That’s the whole appeal — and it’s the opposite of how a cloud companion app works, where your messages necessarily live on the company’s servers.

If the conceptual layer is new to you, how to run AI locally covers the fundamentals before you dive into the roleplay-specific stack here.

Install Ollama and Pull a Roleplay Model

Start with the back-end, because SillyTavern is useless without a model to talk to.

Install Ollama (Linux / macOS):

curl -fsSL https://ollama.com/install.sh | sh

On Windows and macOS you can also grab the desktop installer from ollama.com. Once installed, Ollama runs as a background service and exposes its API at 127.0.0.1:11434. Confirm it’s alive:

ollama --version

Pull a model. This is the single most important decision in the whole setup, because the model is the personality. Stock instruct models (the default llama3.1, qwen2.5, etc.) are aligned and will refuse or sanitize a lot of roleplay. For companion/RP work people reach for uncensored or “abliterated” fine-tunes — community models trained to stay in character and not lecture you. Categories to look for:

  • 8B-class models — fast, fit comfortably in 8GB VRAM, good enough for casual chat.
  • 12B–13B-class models — the sweet spot for coherent, in-character roleplay on a 12–16GB card.
  • Mixtral / larger MoE — heavier, typically needs 24GB+ VRAM at usable quants, but noticeably smarter.

Pull one like this:

ollama pull <model-name>
ollama list      # confirm it downloaded

Pay attention to the quantization tag (e.g. Q4_K_M). Lower quant = smaller and faster but slightly dumber; Q4_K_M is the standard “good balance” pick, Q5_K_M or Q6 if you have VRAM to spare. Our best local LLM for roleplay roundup names specific models by VRAM tier, and the best uncensored local AI models guide explains the uncensored/abliterated landscape so you don’t pick a model that breaks character every third message.

Match the model to your hardware. Run something too big and Ollama spills it into system RAM, which crushes your tokens-per-second to a crawl.

Install and Launch SillyTavern (Per-OS)

SillyTavern needs Node.js 20 or newer and Git. Install both first (node --version should report 20+).

Then clone the release branch — don’t download a random zip, and don’t use the staging branch unless you want bleeding-edge breakage:

git clone https://github.com/SillyTavern/SillyTavern -b release

Windows: open the SillyTavern folder and double-click Start.bat. The first launch installs Node dependencies automatically, then opens the UI in your browser.

macOS / Linux: run the launch script (recent release branches ship start.sh already executable, so the chmod step is only needed if the script somehow lost its exec bit):

cd SillyTavern
chmod +x start.sh   # only if it isn't already executable
./start.sh

Either way, SillyTavern serves itself at http://localhost:8000 and usually opens a browser tab automatically. Leave the terminal/console window open — that’s the running server. Close it and SillyTavern stops.

To update later, run git pull in the folder (or use the bundled update script). Because it’s a git checkout, updates are clean and you keep all your characters and settings.

Connect SillyTavern to Your Local Ollama Backend

This is the step everyone overcomplicates. With both pieces running — Ollama service active, model pulled, SillyTavern open in the browser — do this:

  1. Click the plug icon (API Connections) in the top toolbar.
  2. Set API to Text Completion.
  3. Set API Type to Ollama.
  4. Set API URL (Server URL) to:
    http://127.0.0.1:11434/
  5. Click Connect.

If the connection succeeds, SillyTavern will detect the model(s) Ollama has pulled (downloaded) and let you pick one from a dropdown. Under the hood it queries Ollama’s /api/tags endpoint, which lists every model on disk — so a model shows up as soon as it’s pulled, whether or not you’ve ever run it. (You do not need to ollama run it first to load it into VRAM; Ollama loads it on the first request.) Select the model you pulled earlier. A green light / “Connected” status means you’re done — pick a character and start chatting.

Two things people miss: Ollama must already be running before you hit Connect, and the model must already be pulled (SillyTavern lists what Ollama has — it doesn’t download for you). If the dropdown is empty, that’s your culprit nine times out of ten.

Character Cards, Lorebooks, and Sampler Settings

This is where SillyTavern earns its reputation — and where it gets deep.

Character cards are portable persona files (usually a PNG with embedded JSON, the “v2 card” format). They contain the character’s name, description, personality, example dialogue, and a first message. You import a card, and the model role-plays that persona. You can write your own in the built-in editor or import community cards. Treat the card’s description and example messages as the model’s most important instructions — a vague card produces a vague character.

Lorebooks (World Info) are SillyTavern’s killer feature for long-term consistency. A lorebook is a set of keyword-triggered entries — mention “the Northern Keep” and the matching entry gets injected into the prompt automatically. This lets you build a whole world without burning context window on lore the model doesn’t currently need. It’s also how people give companions persistent backstory and memory beyond the chat log. (For true cross-session memory, see local AI with persistent memory.)

Sampler settings control how the model picks each next token. The ones that matter most:

SettingWhat it doesSane starting point
TemperatureCreativity / randomness0.7–1.2 for RP
Top-PNucleus sampling cutoff0.9–0.95
Top-KCaps candidate tokens40 (or 0 to disable)
Repetition penaltyFights looping/repeating1.1–1.15
Context sizeHow much history the model seesMatch the model’s trained context

Higher temperature = wilder and more creative but less coherent; lower = safer and more repetitive. Repetition penalty is the dial you’ll reach for when the model keeps saying the same phrase. SillyTavern ships preset sampler profiles — start from one of those rather than tuning from scratch, and set context size to what the model actually supports, not higher (overshooting produces garbage or errors).

Troubleshooting the Common Connection Errors

Almost every failure is one of these:

  • “Connection refused” / can’t reach 11434. Ollama isn’t running. Start it (ollama serve, or just launch the app) and run ollama list to confirm. On Linux, check the service is up before blaming SillyTavern.
  • Connected, but the model dropdown is empty. You haven’t pulled a model yet, or you pulled it under a different user/Ollama instance. Run ollama list — if it’s empty, ollama pull <model> first.
  • address already in use on port 11434. Ollama is already running (often as a background service). That’s fine — you don’t need a second instance. Just point SillyTavern at the existing one.
  • SillyTavern UI won’t load at localhost:8000. The server terminal closed, or Node failed to install deps. Re-run Start.bat / ./start.sh and watch the console for the actual error. Node version below 20 is a frequent root cause.
  • Replies are painfully slow or cut off. The model is too big for your VRAM and spilling into system RAM, or your context size is set higher than the model supports. Drop to a smaller quant (e.g. Q4_K_M) or a smaller model, and lower context.
  • Model ignores the character / breaks persona. That’s an alignment problem with the model, not a bug — switch to an uncensored or abliterated model built for roleplay.

Why This Is Powerful but Painful

Let’s be straight about the trade-off. The SillyTavern + Ollama stack is, hands down, the most capable local companion stack in existence. You get total control: any model, any character, any sampler, full lorebook worldbuilding, group chats, prompt-template surgery — and zero data leaves your machine. No subscription, no content filter you didn’t choose, no company reading your chats. For a tinkerer, it’s heaven.

It’s also a lot of moving parts. You’re maintaining a Node front-end and a model back-end. You’re choosing models, matching quantization to VRAM, tuning samplers, hunting down character cards, and debugging connection errors that assume you already know the difference between Text Completion and Chat Completion. Nothing here is hard individually — but it’s an evening of setup before your first good conversation, and it breaks in small ways (a bad update, a wrong port, a too-big model) that you have to diagnose yourself.

That’s the honest deal: maximum power, maximum fiddliness. If that sounds fun, you’ll love it. If you just want to talk to a private, uncensored companion tonight without becoming a sysadmin, there’s a shorter path.

The One-Click Alternative for Non-Tinkerers

If reading the last two sections made you tired, you’re the reason simpler tools exist. Ember is the same core idea — an uncensored AI companion that runs 100% locally on your own machine, your data never leaving it — but without the SillyTavern + Ollama assembly project. It bundles the model, the runtime, and the interface into a single install, so you skip the git clones, the port debugging, the sampler spreadsheets, and the empty-dropdown errors. It’s a one-time purchase, not a subscription, and like the stack above it keeps every conversation on your hardware. (More on that route in how to run an AI girlfriend locally.)

If you love tinkering, build the SillyTavern + Ollama stack — it’s genuinely the most powerful option and this guide gets you there. If you just want the private, local experience without the work, Ember is the one-click way in.