If you’ve been pasting your codebase into a cloud chatbot and waiting for the next “we updated our terms” email, there’s a better path: a coding model that runs entirely on your own GPU, sees your whole repo, and never phones home. The question isn’t whether you can replace cloud autocomplete locally — you can — it’s which model your hardware can actually hold. This guide ranks the best local LLM for coding strictly by the number that decides everything: your VRAM. No hand-waving, no “it depends” — just what fits in 8GB, 12GB, and 24GB, how to wire it into your editor, and the one configuration mistake that quietly breaks every local coding agent.

Why coding models are different from chat models

A general chat model and a coding model can share the same architecture and still behave completely differently in your editor, because code work demands three capabilities that casual chat never exercises.

Fill-in-the-middle (FIM). Autocomplete isn’t “continue this text.” Your cursor sits between existing code — there’s a prefix above and a suffix below — and the model has to write the middle that connects them. Models trained with FIM objectives (the Qwen-Coder and Codestral families both advertise this) take special <|fim_prefix|> / <|fim_suffix|> tokens and produce far better inline suggestions. A model without FIM training can still chat about code, but it makes a mediocre autocomplete engine.

Tool-calling. Modern coding agents (Aider, Cline, Continue’s agent mode) don’t just emit text — they call functions: read a file, run a shell command, apply a diff. The model has to emit well-formed structured calls reliably, turn after turn. A model that’s strong at writing a function in isolation but flaky at tool-calling will stall the moment you let it drive.

Multi-file context. Real refactors span files. The model needs a large enough context window to hold several files plus the diff plus your instructions — and it needs to actually use the far end of that window, not just the first few hundred lines. This is where the num_ctx gotcha below bites hardest.

Keep these three in mind and the tier choices make sense: small models are for FIM autocomplete, mid models are for real edits, and big models are for agentic multi-file work.

8GB tier: small Qwen-Coder and distills for autocomplete

With 8GB of VRAM (RTX 3050/3060 8GB, RTX 4060, many laptops) you are not running a Copilot replacement — and that’s fine, because the highest-value local coding task at this tier is fast inline autocomplete, which small models do well.

Target a coding-specialized model in the ~1.5B–7B range at a Q4_K_M quant. The small Qwen2.5-Coder sizes are the standout here precisely because they were trained with fill-in-the-middle, so they shine as the autocomplete provider in Continue while you keep a bigger model (or a cloud key) for chat. A 1.5B–3B coder loads in well under 8GB and returns suggestions fast enough to feel instant, which matters more than raw quality for inline completion — a suggestion that arrives after you’ve already typed the line is worthless.

ollama run qwen2.5-coder:7b
# or a smaller distill for snappier autocomplete:
ollama run qwen2.5-coder:1.5b

Be realistic about the ceiling: 7B coders write a clean utility function or a regex, explain an error, and scaffold a file. They are not dependable agents — tool-calling at this size is hit-or-miss, and they lose the thread on multi-file refactors. Use them as a smart autocomplete plus a rubber-duck, not as something you hand the keyboard to. For the full picture of what 8GB can and can’t do, see our best local LLM for 8GB VRAM guide.

12GB tier: Codestral 22B-class for real refactors

12GB of VRAM — the RTX 3060 12GB and RTX 4070 sit here — is the first tier where a local model genuinely helps with editing existing code, not just emitting snippets.

The headline model is Codestral 22B, Mistral’s dedicated code model, which was built around FIM and a long context and explicitly targets the autocomplete-plus-refactor workflow. A 22B model at Q4_K_M is roughly 13GB on disk — slightly over 12GB — so the honest move at exactly 12GB is to run it with a modest context window and accept partial GPU offload, or step down to a 14B coder that fits comfortably with room for context.

ollama run codestral
# 14B alternative that fits 12GB with headroom for context:
ollama run qwen2.5-coder:14b

Either way, this is the tier where a local model starts doing the unglamorous work that actually saves time: rename a symbol across a file, extract a function, translate a snippet between languages, write tests for an existing module. Tool-calling becomes usable but still benefits from short, well-scoped tasks. If you’re shopping hardware or models around this size, our best local LLM for 12-16GB VRAM breakdown covers the trade-offs in depth.

24GB tier: Qwen Coder 32B / Devstral as the local Copilot replacement

24GB of VRAM — RTX 3090, RTX 4090, and the used-3090 builds that are the sweet spot for serious local AI — is where “local coding” stops being a compromise. At this tier you can hold a 32B-class coding model at a good quant with a real context window, which is exactly what an agent needs.

Two models define this tier:

ModelSizeBest atNotes
Qwen2.5-Coder 32B32B denseGeneral coding, FIM, broad language coverageThe strongest all-round open coder; fits 24GB at Q4_K_M
Devstral24BAgentic, tool-driven multi-file editingMistral’s agent-tuned coder, designed for harnesses like Cline/OpenHands

This is the codestral vs devstral question people actually mean to ask, but the sharper comparison is Codestral vs Devstral: Codestral (22B) is tuned for autocomplete and FIM, while Devstral is explicitly tuned for agentic software engineering — running inside a tool loop, reading files, and applying patches across a repo. If you want inline completion, lean Codestral-family; if you want to hand an agent a task and let it work, Devstral was built for that. And if you want the single best general local coder, Qwen2.5-Coder 32B is the default recommendation and the closest thing to a drop-in local copilot model.

ollama run qwen2.5-coder:32b
ollama run devstral

A 32B at Q4_K_M lands around 19–20GB, leaving just enough for a meaningful context window on a 24GB card — which is why this tier, not the one below it, is where local agents become trustworthy. For the wider 24GB landscape (including non-coding use), see best local LLM for 24GB VRAM, and for a deeper look at the base model the coder is built on, our Qwen3 32B review.

What SWE-bench scores actually tell you (and what they don’t)

You’ll see coding models ranked by SWE-bench — a benchmark of real GitHub issues where the model must produce a patch that makes the repo’s tests pass. It’s the best public proxy we have for “can this thing do real software engineering,” and a higher score genuinely correlates with better agentic behavior.

But read the scores with three caveats:

  • The harness is half the score. SWE-bench results depend heavily on the scaffolding around the model — retrieval, retry logic, how files are fed in. The same weights score very differently under different agents. A headline number is “model + harness,” not the model alone.
  • Contamination is real. Public benchmarks leak into training data over time. A model trained after a benchmark was published may have effectively seen the answers, inflating its score relative to how it handles your novel codebase.
  • It measures one job. SWE-bench rewards fixing isolated, test-covered Python issues. It says little about your TypeScript monorepo, your autocomplete latency, or how well the model holds your house style across a large refactor.

Use SWE-bench to separate “serious coder” from “chat model that knows some code,” then trust your own eval: point the model at a real task from your repo and watch what it does. For why a fast-but-weaker model can still beat a slow-but-stronger one in practice, see tokens per second: what’s actually usable.

Wiring it up: Continue, Aider, and Cline against a local endpoint

Once a model is running in Ollama it’s serving an OpenAI-compatible API on the loopback address 127.0.0.1:11434. Every major local coding tool can point at it.

Continue (VS Code / JetBrains extension) is the most flexible — it lets you assign different models to different roles, which is the right pattern: a small fast coder for autocomplete, a big one for chat/edit. In config.json:

{
  "models": [
    { "title": "Qwen Coder 32B", "provider": "ollama", "model": "qwen2.5-coder:32b" }
  ],
  "tabAutocompleteModel": { "provider": "ollama", "model": "qwen2.5-coder:1.5b" }
}

Aider (terminal pair-programmer) talks to Ollama directly:

export OLLAMA_API_BASE=http://127.0.0.1:11434
aider --model ollama/qwen2.5-coder:32b

Cline (autonomous VS Code agent) is where Devstral earns its keep — in Cline’s settings pick the Ollama provider and select your model. Cline leans hard on tool-calling and multi-file edits, so give it a 24GB-tier model; smaller models stall in the agent loop.

The golden rule across all three: match the tool to the model size. Autocomplete tools want a tiny fast model; agentic tools want the biggest, most tool-reliable model your VRAM allows.

The num_ctx gotcha that makes coding agents silently fail

This is the single most common reason a local coding setup “works in chat but falls apart on real code,” so it deserves its own section.

Ollama defaults to a context window of 2048 tokens unless you tell it otherwise. That’s fine for a quick question and catastrophic for coding agents. An agent like Aider or Cline stuffs system instructions, several files, and a diff into the prompt — easily 8,000+ tokens. When that exceeds num_ctx, Ollama doesn’t error. It silently truncates the oldest tokens, which are usually your instructions and the top of your files. The model then “forgets” what it was asked, edits the wrong thing, or hallucinates code that doesn’t match the (now-invisible) file — and you have no idea why.

Raise it. You can set it per-request, or bake it into a Modelfile:

FROM qwen2.5-coder:32b
PARAMETER num_ctx 16384
ollama create qwen-coder-16k -f Modelfile

Pick a num_ctx your VRAM can actually back — context costs memory on top of the weights, which is exactly why the 24GB tier matters for agentic work. If raising it triggers an out-of-memory error, you’ve found your ceiling and need a smaller quant or a longer-context-friendly model. Full walkthrough in how to increase the Ollama context window.

Verdict per tier — and when local coding still loses to cloud

Your VRAMBest pickWhat you actually get
8GBQwen2.5-Coder 1.5B–7BFast FIM autocomplete + rubber-duck. Not an agent.
12GBCodestral 22B / Qwen2.5-Coder 14BReal single-file refactors, tests, translations.
24GBQwen2.5-Coder 32B (general) / Devstral (agentic)A genuine local Copilot replacement that drives agents.

Be honest about where local still loses. The biggest frontier cloud models remain ahead on the hardest, sprawling, many-file reasoning tasks, and on very long single-context windows that no 24GB card can hold. If your job is “understand this 200-file legacy system and re-architect it,” cloud is still stronger. But for the daily 80% — autocomplete, scoped refactors, writing tests, explaining errors, scaffolding — a 24GB local model is fast, free per token, fully private, and never refuses or changes its terms on you. For the full cost-and-privacy calculus, see local AI vs cloud AI; if you’re still deciding whether to invest, is local AI worth it? lays out the math.

The same machine that runs your local coder can run a private, uncensored AI companion that lives entirely on your hardware — same Ollama backend, zero cloud, no subscription. If you want that experience packaged and ready to run instead of wired together by hand, Ember is the local-first companion built for exactly the people who’d rather own their AI than rent it.