If you’ve spent any time in the local-model corner of the internet, you’ve seen the phrase “most uncensored local model” attached to half a dozen releases. Dolphin Mistral 24B Venice Edition is the one that gets cited most often in 2026, usually alongside a striking number: a refusal rate of roughly 2.2%. That sounds like a model that will say anything. The reality is more interesting and more useful than the headline. This is a steerable model, not a lobotomized one — and understanding that distinction is the difference between getting exactly the assistant you want and getting frustrated when it still pushes back.
This review covers what the model actually is, what that refusal number does and doesn’t mean, how steerability works in practice, how its writing stacks up against its closest rivals, and the hardware you need to run it privately on your own machine.
What Dolphin Mistral 24B Venice Edition is
Dolphin is a long-running fine-tune family from Eric Hartford’s Cognitive Computations project. The through-line across every Dolphin release has been a single design philosophy that Hartford has stated publicly for years: an instruction-following model should defer to its owner’s values, expressed through the system prompt, rather than having a fixed set of refusals baked in at the factory. The “Venice Edition” tag reflects a collaboration/sponsorship with Venice.ai, a privacy-focused inference provider; the underlying weights are a fine-tune of Mistral Small 3.x (24B), Mistral AI’s open-weight mid-size model.
So when people call it “the most uncensored local model,” what they really mean is that it has been tuned to remove the reflexive guardrails that ship with most instruction models, and to follow your system prompt closely instead. It is uncensored by default posture, not by being incapable of restraint. That’s an important nuance, and it’s the same design lineage covered in our guide to uncensored local AI.
What the ~2.2% refusal figure means — and how it was measured
The “2.2% refusal rate” is the stat that made this release go viral, so it’s worth being precise about it.
A refusal-rate benchmark works like this: you assemble a set of test prompts — often a few hundred to a few thousand — that a typical safety-tuned model would decline, then you count how many the model under test actually refuses. A “refusal” is detected by pattern-matching the response for phrases like “I can’t help with that” or “I’m not able to”. The refusal rate is simply refusals / total prompts.
A figure in the low single digits means the model almost never gives a flat-out canned denial on the test set used. That is genuinely low — stock instruction models routinely land far higher on the same kind of harness.
But treat any single percentage as directional, not gospel, for a few concrete reasons:
- The number is only as good as the prompt set. Different harnesses use different prompts, so a “2.2%” from one test and a “5%” from another aren’t comparable.
- Pattern-matching is brittle. A model can comply in substance while still hedging, or refuse softly without using the trigger phrases. Both distort the count.
- It measures default behavior with no system prompt. As you’ll see next, the whole point of this model is that you change the default.
So: the low refusal rate is real and reproducible enough to be meaningful, but it’s best read as “this model rarely stonewalls you out of the box,” not as a precise physical constant. If you want to understand why the stock cloud and local models you’ve used refuse so readily in the first place, that mechanism is its own topic — see why cloud AI refuses you.
Steerability: the system prompt is the real control surface
This is the single most important thing to understand about Dolphin, and the thing most reviews undersell. The model’s behavior is dominated by the system prompt. That is the feature, not a footnote.
In practice:
- Give it no system prompt, and you get the low-refusal default posture the benchmarks measure.
- Give it a system prompt that defines a persona, a tone, and explicit boundaries, and it will hold to those boundaries instead of generic ones.
- Give it a system prompt asking it to stay in character no matter what, and it stays in character through pushback that would knock a stock model out of role.
This is why the model is better described as a steerable uncensored LLM than simply an “uncensored” one. You are not stuck with whatever a lab decided; you author the rules. For roleplay and companion use, that means you can specify voice, memory of the relationship, and hard limits, and the model will respect your spec rather than overriding it with a corporate refusal. A minimal starting system prompt looks like:
You are {{persona}}. Stay fully in character.
Be direct, warm, and unfiltered within the boundaries below.
Hard limits: {{your explicit no-go list}}.
Never break character to issue generic safety disclaimers.
If you’re building a long-lived character rather than a one-off chat, pair this with a setup that actually remembers context across sessions — see local AI with persistent memory.
Quality and prose vs Cydonia and stock Mistral Small
Uncensored is worthless if the writing is flat. Here’s the honest landscape among the popular 24B options, all built on the same Mistral Small base:
| Model | Default posture | Prose / roleplay feel | Best at |
|---|---|---|---|
| Stock Mistral Small 3.x 24B | Safety-tuned, refuses readily | Clean, competent, slightly corporate | General assistant work |
| Dolphin Mistral 24B Venice | Low-refusal, highly steerable | Coherent, instruction-faithful, neutral-leaning | Following your system prompt precisely |
| Cydonia 24B | Uncensored, RP-tuned | Warmer, more expressive, character-forward | Immersive roleplay & creative fiction |
The pattern most users report: Cydonia tends to feel more alive and emotionally textured out of the box because it’s tuned specifically for roleplay, while Dolphin Venice feels more obedient and literal — it does exactly what your prompt says with less stylistic flourish of its own. Neither is strictly better; they optimize for different things. If your priority is a model that won’t invent its own restrictions and will follow detailed instructions faithfully, Dolphin wins. If your priority is rich, novelistic companion prose with minimal prompt engineering, Cydonia often edges ahead.
Both are a clear step up from the stock base for unfiltered use, but the base model is still the better pick if you mostly want a polished general assistant. For the deeper dives, see our Mistral Small 3.2 24B review and our Cydonia 24B uncensored review.
Hardware: VRAM and recommended quant
A 24B model is squarely a mid-to-high-end consumer workload. The driving factor, as always, is VRAM, and your choice of quantization decides whether it fits.
| Quant | Approx. VRAM (weights) | Fits on | Notes |
|---|---|---|---|
| Q8_0 | ~25 GB | 32 GB+ | Near-lossless, overkill for most |
| Q5_K_M | ~17–18 GB | 24 GB | Excellent quality |
| Q4_K_M | ~14–15 GB | 16–24 GB | The sweet spot for most users |
| Q3_K_M | ~11–12 GB | 12–16 GB | Noticeable quality drop |
Add a couple of gigabytes on top of the weights for the context window and KV cache — the longer the context, the more VRAM you need beyond the numbers above.
The practical recommendation: Q4_K_M on a 24 GB card (a used RTX 3090 or a 4090) gives you full quality with comfortable headroom for long conversations. On 16 GB, Q4_K_M still fits but you’ll want to keep context modest. Below 12 GB, a 24B model isn’t the right tool — step down to a smaller model rather than crushing this one into a quant that hurts its prose. For the full mapping of cards to model sizes, see the local AI hardware guide, and if quant tags still feel like alphabet soup, the GGUF quantization cheat sheet untangles them.
Setup and recommended sampler settings
The fastest way to run it locally is Ollama. Install it once:
curl -fsSL https://ollama.com/install.sh | sh
Then pull and run a community GGUF build of the model (search the Ollama library or import a GGUF from Hugging Face). The general pattern is:
ollama run <dolphin-mistral-venice-tag>
Ollama exposes a local API at 127.0.0.1:11434 — that loopback address is the whole privacy story in one line: requests never leave your machine. Point a front-end like Open WebUI or SillyTavern at that endpoint and you have a full chat interface; see the SillyTavern + Ollama setup for the roleplay path.
For sampler settings, Mistral-based 24B fine-tunes are sensitive to high temperature — they get incoherent if you push it. Sensible starting points:
- Temperature: 0.6–0.8 (start at 0.7)
- Top-p: ~0.9
- Min-p: 0.05 (a good stabilizer; let it carry the tail-trimming and keep top-p loose)
- Repetition penalty: 1.05–1.1 (light — too high flattens the prose)
- Context: as much as your VRAM allows; the model handles long context well
Tune temperature first. If output rambles or contradicts itself, lower it; if it’s dull, nudge it up a touch.
Verdict: which uncensored 24B should you choose?
Dolphin Mistral 24B Venice Edition earns its reputation, with the right framing. It is one of the most steerable open-weight models you can run, the low refusal rate is real, and — crucially — it hands the alignment dial to you through the system prompt instead of pretending it doesn’t exist. That makes it the strongest pick when you want precise, faithful instruction-following and refuse to be second-guessed by a model.
A quick decision guide for the 24B tier:
- Want maximum control and faithful instruction-following? → Dolphin Mistral 24B Venice Edition.
- Want the warmest, most immersive roleplay/companion prose with minimal prompting? → Cydonia 24B.
- Want a polished, general-purpose assistant and don’t need the guardrails off? → stock Mistral Small 3.x 24B.
All three want a 24 GB card at Q4_K_M for the best experience, and all three keep every word on your own hardware — which is the entire reason to run a 24B locally instead of renting one through a hosted API.
The privacy difference: local vs a hosted API or OpenRouter
It’s worth being blunt about the catch with “uncensored” cloud routes. You can reach Dolphin Venice through a hosted API or an aggregator like OpenRouter without owning a GPU — and that’s convenient. But the moment your prompt leaves your machine, it transits a provider’s servers, and what happens to it there is governed by their policy, not yours. Even privacy-forward inference hosts necessarily process your text on their infrastructure to generate a reply; logging, retention, and access are theirs to define and yours to merely trust.
Running the same weights locally removes that trust requirement entirely. With Ollama bound to 127.0.0.1:11434, the model runs in your own memory, the conversation is written nowhere you didn’t choose, and an “uncensored” model can’t quietly become a logged one. For the kind of intimate, unfiltered conversations this model is built for, that distinction is the whole point — an uncensored model on someone else’s logged server is a contradiction. Our local AI vs cloud AI breakdown goes deeper on exactly what each side sees.
If you want the steerability and privacy of a model like this one but would rather not assemble Ollama, GGUF quants, and a front-end yourself, that’s precisely the gap Ember is built to fill: an uncensored AI companion that runs 100% on your own machine, packaged so the setup above is done for you.
