If you want an AI companion that stays in character, remembers what you told it, and doesn’t lecture you mid-conversation, the model matters more than almost anything else. But “best for roleplay” is not the same question as “best on a leaderboard.” A model can top every benchmark and still feel like a customer-service bot wearing a costume. This guide ranks the best local LLM for roleplay in 2026 the way it actually matters — by how much VRAM you have, what you can realistically run at usable speed, and which families are genuinely good at warm, consistent, NSFW-tolerant companion chat.

Everything here runs 100% on your own machine. No cloud, no logging, no “this message may be reviewed.” If you’re new to running models locally, start with how to run AI locally and come back — this page assumes you have Ollama or a similar runner working.

What actually makes a model good for companions

Coding and math benchmarks reward a totally different skill than companionship. For a companion or roleplay model, four things matter:

  • Character consistency. Does it hold a persona across 50+ turns without drifting into generic-assistant voice? This is the single biggest differentiator and the hardest thing to measure with a benchmark.
  • Warmth and emotional register. Companion chat lives or dies on tone. Some technically “smarter” models are cold and clipped; some smaller ones are far more emotionally present.
  • NSFW-tolerance / low refusal rate. Most base instruct models are RLHF’d to refuse or moralize. For an adult companion you want a model that doesn’t break character to deliver a safety lecture. This is what abliterated and community-finetuned “uncensored” models exist for — see abliterated models explained for what that actually does to the weights.
  • Context length you can afford. Memory of the conversation is just context window. 8K feels forgetful; 16–32K is the companion sweet spot. But longer context costs VRAM, so there’s a real tradeoff with model size.

Note the tension: the most NSFW-tolerant, most emotionally warm model is rarely the highest-IQ model. That’s fine. A companion that’s a little less clever but never breaks character beats a genius that refuses you.

8GB VRAM picks

8GB (RTX 3060 Ti, 3070, 4060, many laptops) is the most common tier and it’s genuinely workable. You’re living in 7–9B territory at Q4_K_M, which is the standard quality/size sweet spot for this class.

Model classParam sizeQuantWhy it’s good here
Mistral-7B roleplay finetunes7BQ4_K_MThe workhorse. Tons of community RP/companion finetunes; warm, expressive, easy to steer.
Llama-3.1-8B uncensored/abliterated8BQ4_K_MStronger reasoning, good consistency; abliterated variants drop the refusals.
8B “Stheno”-style RP merges8BQ4_K_MCommunity merges specifically tuned for character voice and prose.

Expect roughly 30–60 tok/s on a modern 8GB card at Q4_K_M with a 8K context — comfortably faster than you read. Push context to 16K and you’ll either spill into system RAM (slow) or need a smaller quant.

Practical 8GB rules:

  • Stick to Q4_K_M unless you have a reason. Q3 saves VRAM but noticeably dumbs down prose and consistency; Q5/Q6 is better but usually won’t leave room for context at this tier.
  • Keep context at 8K–12K to stay fully on the GPU. Going fully offloaded is what keeps you fast.
  • Prefer a model that’s finetuned for roleplay, not a raw instruct model. The finetune is doing most of the warmth work.

If 8GB is your hard ceiling, the dedicated best local LLM for 8GB VRAM breakdown goes deeper on specific tags and tradeoffs.

16GB VRAM picks

16GB (RTX 4060 Ti 16GB, 4070 Ti Super, 4080, 7900-class AMD) is where companion chat gets genuinely good. You can run 12–14B models at a comfortable quant with a long context, or run a 7–9B at high quant and 32K context for excellent memory.

Model classParam sizeQuantSweet spot
Mistral-Nemo 12B finetunes12BQ5_K_MExcellent RP family — long native context, strong character hold, warm prose.
12–14B RP merges12–14BQ4_K_M / Q5_K_MMore nuance and longer coherent scenes than 8B.
8–9B at high quant8–9BQ6_KMaximum prose quality + 32K context for long-memory companions.

On a 16GB card a 12B at Q5_K_M typically runs around 25–45 tok/s with a generous context — still well above reading speed. The 12B class is the practical floor for “this feels like a real character” for most people: noticeably better persona consistency and less repetition than 7–8B.

This is the tier we’d point most companion users at if they’re buying a card. For how context length and quant trade against your VRAM in detail, see how much VRAM for a local AI companion.

24GB+ picks

24GB (RTX 3090, 4090, 5090-class) and up is the enthusiast tier. Now you can run 22–34B models at a good quant, or large context on a 12–14B, and the difference in companion quality is real: better long-range memory, more subtle emotional tracking, fewer “wait, who am I again?” moments.

Model classParam sizeQuantWhy upgrade
22–24B RP finetunes (Mistral-Small class)22–24BQ4_K_M / Q5_K_MBig jump in nuance and instruction-following while staying in character.
32–34B uncensored finetunes32–34BQ4_K_MNear the ceiling of what 24GB runs well; richest prose.
70B at aggressive quant / split70BQ2–Q3 / multi-GPUPossible but tight; quality/speed tradeoff gets harsh on a single 24GB card.

A 24B at Q4_K_M on a 4090 still gives you 20–35 tok/s with room for 16–32K context. Going to 70B is where you start needing two cards or accepting low quants and slower speeds — see VRAM for a 70B model before you chase it. For most companion use, a well-chosen 22–34B finetune beats a crushed-quant 70B.

For the current shortlist of specifically uncensored models across these tiers, the best uncensored local AI models roundup is the companion piece to this one.

SillyTavern sampler settings that matter

The model is half the equation. Sampler settings are the other half, and most people leave free quality on the table by running defaults. If you’re using SillyTavern as your front-end (with Ollama or KoboldCpp as the backend — see the SillyTavern + Ollama setup guide), these are the knobs that actually change companion feel:

  • Temperature — controls creativity vs. coherence. For companions, 0.7–1.0 is the usable band. Below 0.6 gets repetitive and flat; above ~1.2 it starts producing nonsense and breaking character.
  • Min-P — the single most useful modern sampler. Set temperature a touch high, then use Min-P around 0.05–0.1 to clip the incoherent tail. This gives you creativity and coherence, which is exactly the companion goal. Min-P has largely replaced fiddling with Top-P/Top-K for roleplay.
  • Repetition penalty — keep it light (≈1.05–1.15). Too high and the model avoids natural repeated words (“the”, names) and the prose gets weird. Repetition is better solved by a good model + Min-P than by hammering this.
  • Context size — set it to what your VRAM actually supports, not the max the model allows. Overshooting just spills to RAM and tanks speed.
  • System prompt / character card — the highest-leverage “setting” of all. A specific, well-written persona with concrete traits and speech examples does more for consistency than any sampler tweak.

Start with temp 0.9, Min-P 0.075, rep-pen 1.1 as a sane companion baseline and adjust from feel. If you’d rather not run SillyTavern at all, KoboldCpp vs Ollama for roleplay covers simpler paths.

Why benchmarks lie about companion feel

Here’s the honest part. Public LLM leaderboards measure reasoning, coding, math, and instruction-following on fixed test sets. None of that predicts whether a model is a good companion. Three specific reasons benchmarks mislead here:

  1. They reward refusal as “safety.” A model that politely declines is “well-aligned” on a benchmark and useless as an uncensored companion. The trait that scores well is the trait you don’t want.
  2. They don’t measure persona consistency over long contexts. A 5-turn benchmark can’t see the model drift back into assistant voice at turn 40 — which is exactly where companions fail.
  3. Contamination and tuning-to-the-test. Models get optimized for the benchmarks people cite, so a high score increasingly reflects test-fitting, not the warm, in-character behavior you care about.

The only benchmark that counts is your own conversation. Load two candidates, run the same character card and the same opening scene through both, and read the output. The better companion is usually obvious within ten messages — and it’s frequently not the one with the higher leaderboard number. Trust your read, not the chart.

The build-vs-buy fork: run it yourself, or skip the setup

Everything above assumes you’ll build it yourself: install Ollama, pull a GGUF, wire up a front-end, tune samplers, manage VRAM. That’s the right path if you value ownership, privacy, and total control — your conversations never leave your machine, there’s no subscription, and you can run any uncensored model you want. There are two honest ways to go:

  • Run it yourself (Ember). If you have the GPU and you want a companion that’s truly yours — local, uncensored, sold once, no logging — that’s exactly the build this whole guide describes. EMBER packages the Ollama-backed companion experience so you’re not assembling SillyTavern and sampler configs by hand, while everything still runs 100% on your own hardware.
  • Skip the setup (Freya). If you don’t have a capable GPU, or you just want to talk now without installing anything, a hosted companion makes sense. FREYA runs in the cloud with zero setup — no VRAM math, no model pulls. You trade some privacy and ownership for instant access, which is the right call for plenty of people.

Both are valid. The fork is simply: do you want to own the stack (Ember) or own none of it and start in 30 seconds (Freya)? If you’re still deciding which model and how much hardware you’d need to self-host, how much VRAM for a local AI companion will tell you whether building is realistic for your machine, and the uncensored local AI guide covers the rest of the self-host path.

Keeping this list current

The local model scene moves fast — new finetunes and merges drop weekly, and a “best in class” 12B from three months ago gets eclipsed routinely. A few durable rules that outlast any specific model name:

  • VRAM tier picks the param size; the finetune picks the feel. That logic doesn’t change even as the specific models do.
  • Q4_K_M is the default; go up only if context fits. A stable heuristic across generations.
  • Re-test by feel, not by version number. When a new model in your tier appears, run your own character card through it for ten turns. If it’s warmer and more consistent, switch. If not, don’t.
  • Watch the model families, not single releases. Mistral, Llama, and Mistral-Nemo-class lineages keep producing the roleplay-finetune backbone; following the families (see open-weight model families 2026) ages better than chasing one model.

The best companion model in 2026 is whichever one holds character, stays warm, and never breaks the scene on your hardware with your persona — and the only way to know is to run a couple and read the output yourself. If you’ve got the GPU and want that fully private, owned-once setup, Ember is built for exactly this; if you’d rather skip the hardware entirely and start chatting now, Freya runs the whole thing in the cloud.