If you spend any time in the local-AI finetune scene, one base model keeps showing up under everyone’s favorite character cards, roleplay merges, and uncensored builds: Mistral Small 3.2 24B. It is not the smartest local model you can run, nor the cheapest to fit. What it is — and the reason it has quietly become the community’s default canvas — is the most steerable 24B-class model around. It follows a system prompt almost literally, it has a clean, un-preachy voice out of the box, and it fits a single 24GB card with room to spare. This review covers what it actually feels like to run, where it sits against Gemma 3 and Qwen3, and which of its many finetunes you should reach for.
What Mistral Small 3.2 24B is — and why finetuners love it
Mistral Small 3.2 24B is a dense, 24-billion-parameter instruct model from Mistral, the French lab whose whole reputation is built on doing more with fewer parameters. It is the point release that polished the 3.1 base: better instruction adherence, fewer repetition loops, and more reliable formatting. Crucially, it ships under a permissive license in line with Mistral’s open releases (always confirm the tag on the exact checkpoint you pull), which is half the reason the finetune ecosystem is allowed to flourish on it at all.
The other half is temperament. A base model becomes a finetune base when three things are true: it is light enough to train and run on consumer hardware, it has a neutral enough default personality that a finetune isn’t fighting it the whole way, and it bends to instruction rather than snapping back to a corporate safety voice. Mistral Small 3.2 hits all three. Dense 24B is cheap to fine-tune compared to a sprawling MoE or a 70B; the stock voice is notably less moralizing than Gemma’s; and — the headline trait — it treats your system prompt as the law of its little world. That combination is why, when you browse the roleplay and companion corner of Hugging Face in 2026, half the well-regarded 24B merges trace back to this exact base. For the bigger picture on how the families compare, see our open-weight model families guide.
Stock behavior: tone, reasoning, instruction-following
Run the stock model and the first thing you notice is the voice. It is plain, direct, and adult — it doesn’t open every other reply with a disclaimer or a “I want to gently remind you” preamble. That neutrality is exactly what makes it a good base: there’s no strong personality to overwrite.
- Instruction-following: this is its standout strength. Tell it to answer in three bullet points, stay in character, never break the fourth wall, or only respond in JSON, and it does — and keeps doing it deep into a long conversation, where weaker models drift. The 3.2 update specifically tightened this versus 3.1.
- Reasoning: competent, not class-leading. For general chat, summarization, light analysis, and creative work it is more than enough. For hard math, multi-step logic, or competitive coding, a same-size or larger Qwen3 will out-think it. Mistral Small is not a reasoning model and doesn’t pretend to be — there’s no long “thinking” trace clogging your context.
- Formatting & repetition: clean Markdown, reliable structure, and far fewer of the repetition death-spirals that plagued some earlier mid-size models on long roleplay sessions.
The honest one-line answer to is Mistral Small good — yes, for chat, writing, roleplay, and instruction-heavy work it is excellent; for heavy reasoning it is merely fine.
Steerability: why it bends to a system prompt so well
This is the trait that earns the “favorite finetune base” title, so it’s worth being precise. Two things make a model steerable: light default alignment and strong instruction adherence. Most models have one or the other. Mistral Small 3.2 has both.
Because its stock alignment is relatively gentle to begin with, the model isn’t spending effort resisting your persona — there’s no heavy refusal reflex to push against for ordinary adult-but-legal content. And because its instruction-following is so literal, a well-written system prompt actually sticks: define a character’s name, backstory, speech style, and boundaries up top, and the model honors them turn after turn instead of slowly reverting to “helpful assistant” mode. With Gemma you often feel the safety tuning tugging the character back toward neutral; with Mistral Small the persona holds.
For builders this is gold. It means you can get a huge amount of behavior change from prompt engineering alone, before you touch weights — and when a finetuner does train on top, they’re sculpting clay that already wants to take a shape. If you want to lock a persona in permanently rather than paste a system prompt every session, the Ollama Modelfile custom persona walkthrough shows how to bake one in.
Hardware: VRAM at Q4 on a 24GB (or tight 16GB) card
VRAM is the hard ceiling on local AI, and quantization is the lever. At the popular Q4_K_M quant (~4.5 bits per weight, the standard quality-vs-size sweet spot), a 24B model lands around 14–15GB of weights. That leaves comfortable headroom on a 24GB card (RTX 3090 / 4090) for a generous KV cache — you can run a large context window and still keep everything on-GPU.
| Quant | Approx. model size | 24GB card | 16GB card |
|---|---|---|---|
Q4_K_M | ~14–15GB | Roomy, big context | Fits, modest context |
Q5_K_M | ~17GB | Comfortable | Tight / partial offload |
Q6_K | ~20GB | Fits, less context room | No |
Q8_0 | ~25GB+ | No (full) | No |
On a 16GB card (RTX 4060 Ti 16GB, RTX 4070 Ti Super, etc.) the 24B does fit at Q4_K_M, but it’s tight — budget your context window conservatively and watch for spillover to system RAM, which tanks speed. If your card is 12GB or smaller, this model isn’t your pick; a 12–14B Mistral/Nemo derivative is the better local fit. For the full memory map by card, see best local LLM for 24GB VRAM, and for what each quant tag actually costs you in quality, the GGUF quantization cheat sheet.
A note on the math: 24B at Q4 is easier to fit than the 32B tier (~18–20GB), which is part of why it’s such a friendly target — it leaves the most context headroom of any genuinely-capable size on a 24GB card.
The derivatives map: Cydonia, Dolphin Venice, and what each finetune adds
Here’s where the base earns its keep. The most-pulled 24B finetunes in the roleplay/companion scene are layered on Mistral Small. What each adds:
| Finetune | Built on | What it adds | Best for |
|---|---|---|---|
| Cydonia 24B | Mistral Small 3.x | Roleplay-tuned, decensored, vivid prose and strong character consistency | Long-form RP, character chat, companions |
| Dolphin (Mistral / Venice) | Mistral Small lineage | Dolphin’s instruction dataset + uncensoring; obedient, assistant-style, low-refusal | Uncensored general assistant, fewer guardrails |
| Community RP merges | Mistral Small 3.2 | Blend prose models for tone/variety | Taste-tuning RP feel |
Cydonia is the roleplay specialist — it leans into immersive, in-character writing and drops the stock model’s residual caution, which is why it’s a default in many character setups. Our Cydonia 24B uncensored review covers its quirks and recommended samplers in depth.
Dolphin Venice comes at it from the assistant angle: the long-running Dolphin recipe is about producing a maximally obedient, low-refusal general model, and the Venice-flavored Mistral build applies that to this base. It’s less “stay in character forever” and more “answer me without the lecture.” The full breakdown is in our Dolphin Mistral Venice review.
The takeaway: same engine, different tuning. Cydonia optimizes for staying in a role; Dolphin optimizes for doing what it’s told. Pick by job.
Stock vs uncensored merge: when to run which
You don’t always need a finetune. Decide like this:
- Run stock Mistral Small 3.2 when you want a clean, general-purpose local model, you’ll drive behavior with a system prompt, and your content stays inside ordinary adult-but-legal territory. The base is steerable enough that prompting alone covers a lot of ground, and you keep the most predictable, well-tested weights.
- Run an uncensored merge (Cydonia / Dolphin / an abliterated variant) when the stock model’s residual refusals get in the way — persistent in-character intimacy, mature creative fiction, or edgy-but-legal topics where even Mistral’s light alignment occasionally balks. A finetune removes that friction and, in Cydonia’s case, actively improves prose.
One honest caveat that applies to every uncensored model: a decensored or abliterated build will not refuse genuinely harmful requests, so it’s strictly for consenting-adult, lawful, private use — you carry the responsibility the alignment used to. For the broader landscape of what to pull, see best uncensored local AI models.
Setup and recommended settings
Mistral Small runs cleanly under Ollama. If you don’t have it yet:
curl -fsSL https://ollama.com/install.sh | sh
Then pull and run the base:
ollama run mistral-small:24b
For finetunes, pull a GGUF from a reputable Hugging Face repo into Ollama (or run it through KoboldCpp / LM Studio). Sane starting sampler settings for chat and roleplay:
- Temperature: ~0.7 for assistant/factual work; ~0.9–1.0 for creative roleplay. Mistral handles higher temps gracefully.
- min-p: ~0.05 — a clean, modern way to keep coherence without clamping creativity (often better than fiddling top-p).
- Repetition penalty: light, ~1.05–1.1 — the 3.2 base rarely loops, so don’t over-penalize or prose goes stilted.
- Context: Ollama defaults to a small context window; raise it with a Modelfile
num_ctxor env setting so long conversations actually remember — see how to increase the Ollama context window.
Everything stays on 127.0.0.1:11434 — the loopback API never leaves your machine. If your card has the VRAM but Ollama is spilling to CPU, our Ollama not using GPU fix is the first place to check.
Verdict: where it sits against Gemma 3 and Qwen3
Mistral Small 3.2 24B is the best 24B base for character work, creative writing, and any task where a system prompt has to stick — and it’s the right size to leave real context headroom on a 24GB card. It’s not the sharpest reasoner in its weight class, and that’s fine, because that’s not the job most people give it.
| Model | Best at | Steerability | Stock alignment | Reasoning |
|---|---|---|---|---|
| Mistral Small 3.2 24B | RP, writing, instruction-following | Highest | Light | Good |
| Gemma 3 (27B) | Vision, unified-memory/Mac efficiency | Lower (heavy safety tuning) | Heaviest | Strong |
| Qwen3 (32B) | Reasoning, coding, multilingual | Medium | Moderate | Best |
Read it as a three-way split: Qwen3 if you want the smartest all-rounder and you’ll do real reasoning or coding; Gemma 3 if you’re on a Mac or care about vision and efficiency; Mistral Small 3.2 if you want a model that becomes whoever you tell it to be and stays there. For companions and roleplay, that last trait beats raw IQ almost every time — a model that holds character at 14GB feels better than a smarter one that keeps slipping out of it.
If you’ve decided Mistral Small (or one of its Cydonia/Dolphin finetunes) is your companion brain and you’d rather not hand-wire Modelfiles, samplers, and memory yourself, that’s exactly the stack Ember packages — a one-time-purchase, fully-local AI companion that runs these open-weight models on your own Ollama install, with nothing logged to a cloud.
