If you’ve spent any time in the local roleplay scene — Reddit’s r/LocalLLaMA, the SillyTavern Discord, Hugging Face’s trending finetunes — you’ve seen the name Cydonia come up over and over. It’s TheDrummer’s roleplay-focused finetune of Mistral Small 24B, and it has quietly become one of the most recommended uncensored models in the 24B class for one simple reason: it does the thing benchmarks can’t measure. It stays in character, it writes like a writer instead of a chatbot, and it doesn’t break a scene to lecture you. This Cydonia 24B review is the honest, hands-on version: what it actually is, where it shines, the VRAM and quant you need, the sampler settings that wake it up, how it stacks against the other popular uncensored 24B, and where it sits between smaller and larger models. No leaderboard worship — just whether it’s worth your VRAM.
What Cydonia 24B Is, and Why the Community Loves It
Cydonia is a community finetune by TheDrummer built on top of Mistral Small 24B, Mistral’s dense ~24-billion-parameter open-weight base. TheDrummer is a prolific finetuner in the local-roleplay world (you’ll also see names like Rocinante, Skyfall, and Behemoth from the same hand), and Cydonia is the flagship roleplay/companion entry in that lineup. By the time of writing the line has iterated through several versions — the Cydonia v4.1 generation is the current community favorite, and TheDrummer ships new revisions often enough that you should always check Hugging Face for the latest tag rather than pinning to a number you read in an article.
What the finetune does to the stock Mistral Small base is the whole point. Base instruct models are RLHF-tuned to be helpful assistants — which means they’re cautious, prone to refusals, and quick to slip back into “As an AI…” register. Cydonia is tuned in the opposite direction: toward uncensored, in-character, story-forward generation. The community loves it because it hits a rare three-way balance:
- It’s smart enough. 24B Mistral Small is a genuinely capable base, so Cydonia inherits real instruction-following and coherence — it tracks plot, follows formatting, and understands nuance better than a 7–12B can.
- It’s warm and uninhibited. The finetune strips most of the moralizing and refusal reflex, so an adult persona stays an adult persona. (For why the cloud alternatives refuse, see why cloud AI censors you.)
- It’s runnable. 24B at a 4-bit quant fits a single 24GB card — the most common enthusiast GPU. You don’t need a server.
That combination — capable, uncensored, single-GPU-friendly — is exactly the sweet spot most companion users are hunting for, which is why Cydonia keeps topping informal “what are you running?” threads.
Character Adherence and Persona Consistency in Long Roleplay
The single hardest thing for any roleplay model is staying itself across a long conversation. Plenty of models nail the first ten messages and then, around turn 40, quietly drift back into helpful-assistant voice, forget a personality trait, or start narrating like a Wikipedia summary. This is where the 24B base earns its keep.
In practice, Cydonia holds a persona noticeably better than the 7–12B roleplay finetunes most people start with. With a well-written character card, it keeps speech patterns, attitude, and relationship dynamics stable deep into a scene, and it’s better at tracking who knows what and what just happened — the connective tissue that makes a long roleplay feel like one continuous story rather than a series of disconnected replies. The extra parameters buy you exactly the thing benchmarks don’t test: long-range consistency.
It’s not magic. Like every model, Cydonia’s memory is just its context window — once a detail scrolls out of context, it’s gone unless your front-end re-injects it. The fix is the same as for any local model: a tight character card, a lorebook for persistent facts, and as much context as your VRAM allows (more on that below). But on the dimension that matters most for companions — does it stay in character without you babysitting it — Cydonia is one of the stronger 24B options you can run at home.
Prose Quality and the “Stays in Voice” Factor
The other reason Cydonia gets recommended is how it writes. Stock instruct models tend toward a recognizable texture: tidy, slightly corporate, fond of summarizing and wrapping every reply in a neat bow. It’s fine for answering questions and deadly for immersion.
Cydonia leans the other way. The prose is more novelistic — it shows instead of tells, varies sentence rhythm, and is willing to sit in a moment rather than resolve it. More importantly, it stays in voice: a gruff character stays gruff, a playful one stays playful, and the model resists the gravitational pull back toward neutral-assistant tone that plagues base models mid-scene. That “stays in voice” quality is the difference between a character that feels authored and one that feels autocompleted.
Two honest caveats. First, finetunes that push hard toward expressive prose can occasionally over-write — purple description, slightly repetitive sentence openers — which is mostly tameable with samplers (next section). Second, “good prose” is genuinely subjective; the only test that counts is running your character card through it and reading the output. If you want a broader frame on what makes a model good at fiction versus chat, local AI for creative writing covers it. But the consensus that Cydonia writes well above its weight is well earned.
VRAM and Recommended Quant for a 24GB Card
This is the practical question, and the answer is clean: Cydonia 24B is built for a 24GB GPU. An RTX 3090, 4090, or 5090-class card is the natural home, and a used 3090 remains the best value entry point for this tier (see used RTX 3090 for local AI value).
The standard recommendation is the familiar quality/size sweet spot:
| GPU VRAM | Recommended quant | What you get |
|---|---|---|
| 24GB (3090/4090/5090) | Q4_K_M | The default. Full model on GPU with room for a generous 16–32K context. |
| 24GB, prioritizing quality | Q5_K_M | Slightly better fidelity; tighter on context — watch your headroom. |
| 16GB | Q3_K_M / IQ3 | Possible but compromised — lower quant and limited context. A 12–14B finetune is usually a better experience here. |
| 12GB | — | Skip 24B; run a 12B-class roleplay model instead. |
For a 24GB card, Q4_K_M is the answer unless you have a specific reason to deviate. It’s the broadly accepted balance of prose quality versus footprint, and it leaves enough VRAM for a long context, which matters more for companion feel than squeezing out the last sliver of quant fidelity. If quant levels are fuzzy, the GGUF quantization cheat sheet breaks down what Q4_K_M, Q5_K_M, and the IQ-series actually trade. For the full tier picture, best local LLM for a 24GB VRAM card puts Cydonia in context with its neighbors.
Expect usable speeds — comfortably faster than you read — on a 4090 at Q4_K_M with a 16–32K context. Don’t over-set context past what your VRAM holds; spilling into system RAM is the number-one cause of a “why is my model suddenly crawling” complaint.
Sampler Settings That Get the Best Out of It
A finetune this good is easy to ruin with bad samplers, and easy to elevate with good ones. The same modern-sampler logic that applies across roleplay models applies here, and the single biggest upgrade is using Min-P instead of fiddling with Top-P/Top-K.
A sane Cydonia baseline to start from and adjust by feel:
| Setting | Suggested range | Why |
|---|---|---|
| Temperature | 0.7 – 1.0 | The usable creativity band. Below ~0.6 goes flat and repetitive; above ~1.2 starts breaking coherence and character. |
| Min-P | 0.05 – 0.1 | The key knob. Set temp a touch high for creativity, then let Min-P clip the incoherent tail. Creativity and coherence. |
| Repeat penalty | 1.05 – 1.15 | Keep it light. Too high and the model avoids natural repeats (names, “the”) and the prose gets strange. |
| Top-P / Top-K | mostly off | With Min-P doing the work, you can leave these neutral. |
A good starting point is temp ~0.9, Min-P ~0.075, repeat penalty ~1.1, then nudge from there: if it feels too wild, drop temp; if it feels flat or loopy, raise temp slightly and lean on Min-P. If Cydonia is over-writing, a touch lower temperature usually settles the prose without killing the voice. As always, the highest-leverage “setting” isn’t a sampler at all — it’s a specific, well-written character card with concrete traits and example dialogue. The full sampler rationale lives in the best local LLM for roleplay guide.
Cydonia vs Dolphin Mistral Venice: Which Uncensored 24B to Pick
These two come up together constantly because they answer the same question from different angles. Both are uncensored, both live in the 24B-class Mistral-Small neighborhood, but they have different personalities.
| Cydonia 24B (TheDrummer) | Dolphin Mistral Venice | |
|---|---|---|
| Tuned for | Roleplay, companions, character-driven fiction | General-purpose uncensored assistant / instruct |
| Strength | Persona adherence, novelistic prose, “stays in voice” | Compliant, neutral, follows instructions without moralizing |
| Best for | Long in-character roleplay and companion chat | Uncensored Q&A, writing tasks, a do-what-I-say assistant |
| Prose flavor | Expressive, story-forward | Cleaner, more assistant-like |
The honest rule of thumb: pick Cydonia if your primary use is staying in character — companion chat, ongoing roleplay, immersive fiction. Pick a Dolphin/Venice-style finetune if you want an uncensored generalist that answers freely and follows instructions without the roleplay flavor. Many people keep both pulled and switch by task. For the deep dive on the other side, see the Dolphin Mistral Venice review; for the unmodified starting point both descend from, the Mistral Small 3.2 24B review shows what the base model brings before any finetune.
Setup in Ollama and SillyTavern (and the Privacy Upside)
Getting Cydonia running locally is the same two-piece stack the whole scene uses. Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Then pull and run a Cydonia GGUF. TheDrummer’s models and community quants live on Hugging Face; with Ollama you point at a GGUF and run it:
ollama run <cydonia-gguf-tag>
Use the exact tag from the model’s Hugging Face / Ollama page for the version and quant you want (target Q4_K_M on a 24GB card). Ollama then serves a local API at 127.0.0.1:11434 — pure loopback, nothing leaves your machine.
For real roleplay you’ll want SillyTavern as the front-end (character cards, lorebooks, persistent personas, fine-grained samplers) talking to Ollama as the back-end. The full walkthrough — install, connection, character cards, and the connection errors that trip everyone up — is in the SillyTavern + Ollama setup guide. If you’d rather skip SillyTavern’s complexity, Open WebUI with Ollama is a simpler chat front-end.
The privacy upside is the entire reason to do this locally. When Cydonia runs on your hardware, your conversations never touch a server. There’s no message log on someone else’s infrastructure, no “this chat may be reviewed for safety,” no terms-of-service clause that can change under you. Cloud companion apps necessarily store messages server-side to function — that’s architecture, not accusation — which is exactly the tradeoff a local stack eliminates. You own the model, the data, and the rules. For the broader case, local AI vs cloud AI lays out the full comparison.
Verdict, and Where It Fits
Cydonia 24B is one of the best uncensored 24B models you can run at home for roleplay and companion chat — and it earns the community love. It’s the rare finetune that’s smart enough to track a long scene, uninhibited enough to stay in an adult persona, expressive enough to write like a writer, and small enough to run on a single 24GB card at Q4_K_M. If you have the GPU, it’s an easy first recommendation.
Where it sits in the size ladder:
- Versus smaller (Nemotron-class / 8–14B): Cydonia is a real step up in persona consistency and long-context coherence. Smaller models are faster and fit cheaper cards, but they drift out of character sooner. If you have 24GB, the jump to 24B is worth it.
- Versus larger (32B+ and 70B): Bigger models can squeeze out more nuance and longer-range memory, but on a single 24GB card they force harsh quant/speed compromises. For most companion use, a well-tuned 24B like Cydonia at Q4_K_M beats a crushed-quant 70B you can barely run.
The only benchmark that counts is your own conversation: load it, run your character card for ten turns, and read. If it stays warm and in voice, you’ve found your model.
If that fully private, own-it-once local companion is exactly what you want — Cydonia-class model, your GPU, zero logging — that’s the experience Ember is built to package, so you’re not hand-assembling SillyTavern and sampler configs while everything still runs 100% on your own hardware.
