If you’ve ever pasted a screenshot of a contract, a medical bill, or a tax form into ChatGPT to “just pull the numbers out,” you’ve quietly shipped that document to someone else’s servers. For a meme, who cares. For anything with a name, an account number, or a diagnosis on it, that’s a real exposure — and once it’s uploaded, you can’t un-upload it. The good news in 2026 is that a local vision-language model (VLM) running on your own GPU can read images and answer questions about them entirely offline. No upload, no retention, no terms-of-service roulette.

This guide ranks the best local vision models for OCR and document Q&A by what they actually run on — from a 2 GB model that fits on almost anything to 11B-class models that rival cloud OCR for accuracy. We’ll keep the model names, the tools, and the VRAM numbers concrete so you can pick one and have it reading your documents tonight.

Why run a vision model locally: private OCR with no upload

Cloud OCR is everywhere — Google Lens, the “analyze this image” button in every chatbot, the document scanners baked into note apps. The catch is architectural, not malicious: a hosted service that reads your image has to receive your image. It lands on their infrastructure, and what happens next is governed by a privacy policy you didn’t write. Many services state they may retain uploads to “improve services,” and some explicitly use submitted content for training unless you opt out. (For the broader pattern, see our AI data privacy guide.)

A local VLM flips that. The model weights live on your disk, inference happens on your GPU, and the image never leaves 127.0.0.1. That matters most for exactly the documents people most want OCR’d:

  • Financial — bank statements, invoices, pay stubs, tax forms
  • Legal — contracts, NDAs, court filings (lawyers, specifically, have a duty here)
  • Medical — lab results, prescriptions, insurance EOBs
  • Personal — passports, IDs, anything with your address on it

Run it locally and the privacy question simply evaporates. There’s no log to subpoena, no breach surface, no training pipeline. That’s the whole pitch, and it’s a strong one.

How VLMs differ from text models (and from cloud OCR)

A normal LLM only reads text. A vision-language model bolts a vision encoder (usually a CLIP- or SigLIP-style image model) onto a language model, plus a small “projector” that translates image features into tokens the LLM understands. The upshot: you hand it a picture and a question, and it answers in natural language.

That’s the key difference from classic OCR engines like Tesseract. Traditional OCR transcribes — it spits out raw text and stops. A VLM understands: you can ask “what’s the total on this invoice?” or “is there a late-payment clause?” and get a reasoned answer, not a wall of transcribed characters. It’s OCR and comprehension in one pass.

Cloud OCR (Lens, GPT-4o vision)Tesseract (classic OCR)Local VLM
PrivacyImage uploadedFully localFully local
OutputText or answersRaw text onlyText + reasoning
Document Q&AYesNoYes
Handwriting / layoutStrongWeakVaries by model
CostPer-call / subscriptionFreeFree after hardware

The trade-off is honest: a frontier cloud VLM is still the accuracy leader on messy handwriting and dense multi-column layouts. But for printed documents, screenshots, forms, and tables, the better local VLMs are now genuinely good — and they’re yours.

Tiny tier (~2 GB): Moondream 2 for fast captions and simple OCR

Moondream 2 is the featherweight champion. At roughly 2B parameters it quantizes down to about 1.5–2 GB, which means it runs on integrated graphics, an old 4 GB card, or even CPU-only in a pinch (see running local AI without a GPU). It’s purpose-built to be small and fast.

What it’s great at: quick image captions, reading short bits of clear printed text, answering simple “what’s in this picture?” questions, and acting as a cheap visual filter in a pipeline. What it’s not: a dense-document workhorse. Hand it a two-column invoice with fine print and it’ll miss things a bigger model catches.

Ollama passes the image as a path inside the prompt — there’s no --image flag. Point it at a file like this:

ollama run moondream "./bill.png Read the text in this image"

(Cleanest of all: just run ollama run moondream, then drag-and-drop the image file into the prompt once it’s interactive.)

If your bar is “I want a private model that captions screenshots and pulls a phone number off a flyer without melting my laptop,” Moondream is the answer and you can stop reading. For documents that matter, size up.

Mid tier: MiniCPM-V and LLaVA 1.6 for richer document understanding

This is the sweet spot for most people doing real document Q&A, and it fits comfortably on an 8 GB VRAM card.

MiniCPM-V (the 8B-class “2.6” line and successors) punches dramatically above its size. It was designed with OCR and document understanding as a priority, handles high-resolution images and dense text well, and is often the single best recommendation for someone who wants to ask questions of scanned PDFs and screenshots on modest hardware. Quantized to Q4_K_M the GGUF file is around 5–6 GB.

LLaVA 1.6 (also branded LLaVA-NeXT) is the well-known, broadly-supported open VLM. It comes in 7B and 13B sizes, improved meaningfully over the original LLaVA on text recognition and reasoning, and is the most frictionless to get running because tooling support is mature. The 7B at Q4 is roughly 4–5 GB; the 13B is about 8 GB.

ModelParams~VRAM (Q4_K_M)Best for
Moondream 2~2B~2 GBCaptions, simple OCR
LLaVA 1.6 7B7B~4–5 GBGeneral VLM, easy setup
MiniCPM-V (2.6)~8B~5–6 GBDense docs / OCR
LLaVA 1.6 13B13B~8 GBBetter reasoning

A sizing caveat that trips people up: the numbers above are the GGUF file size on disk. Real VRAM use runs roughly 1–2 GB higher once the model loads its KV cache and the image gets tokenized — a vision encoder turns a high-res scan into a lot of tokens. So if your card sits exactly at a tier’s listed minimum (e.g. MiniCPM-V on an 8 GB card), expect it to be tight; size up one tier if you want headroom.

For most “I have an 8–12 GB card and want to chat with my documents” readers, MiniCPM-V is the pick. It’s the best balance of OCR strength, document comprehension, and modest footprint. (If you want to wire that into a searchable knowledge base rather than one-off questions, that’s a local RAG pipeline.)

Big tier: Llama 3.2 Vision 11B and Qwen3-VL for the best accuracy

When accuracy is the priority and you have 12–24 GB of VRAM, two families lead.

Llama 3.2 Vision 11B is Meta’s open multimodal model and the most “production-grade” feeling of the local options. It’s strong on document reasoning, chart and diagram interpretation, and following complex instructions about an image. Quantized it wants roughly 8–10 GB, putting it within reach of a 12 GB card and very comfortable on 24 GB cards like a used RTX 3090. It runs cleanly in Ollama (again, image path goes inside the prompt — no flag):

ollama run llama3.2-vision "./contract.png Summarize this contract and flag any deadlines"

Qwen3-VL is the current accuracy frontier in open weights. The Qwen-VL line has been a standout for OCR, multilingual text, and structured-document parsing (tables, forms, receipts) for a while, and the Qwen3 generation pushes that further. If your documents are non-English, contain dense tables, or need careful layout-aware extraction, this is the family to reach for. And it’s now first-class in Ollama: there’s an official qwen3-vl library model with tags from 2b all the way up to 235b, so you can match it to your VRAM:

ollama run qwen3-vl:8b

One requirement to note: Qwen3-VL needs Ollama 0.12.7 or newer. If you’re on an older build, run ollama --version and upgrade before pulling — which is the perfect lead-in to the one gotcha worth understanding about brand-new VLMs.

The new-architecture gotcha: when a VLM outruns your runtime

Here’s the trap that wastes an afternoon: new VLM architectures often ship before the runtimes catch up. A model can have GGUF files uploaded to Hugging Face that still fail to load — or load the text half and silently ignore the vision encoder — because the inference engine doesn’t yet support that model’s specific vision tower.

This isn’t specific to any one model; it’s the general pattern whenever a fresh architecture lands. You grab a community GGUF, point an older runtime at it, and either get an error or a model that won’t actually look at images. It’s not a “you did it wrong” problem; it’s a runtime-support lag. The fixes, in order:

  • Upgrade the runtime first. Most “it won’t load” reports are just an old build. Ollama, for example, gained official Qwen3-VL support in 0.12.7 — so ollama --version and an update solve a surprising number of these before you touch anything else.
  • Use llama.cpp directly for anything truly bleeding-edge — its multimodal support (via the mmproj projector file) often lands new VLM architectures before higher-level wrappers ship them. You load the main GGUF plus its matching mmproj file.
  • Or use LM Studio — it tracks bleeding-edge VLM support and gives you a GUI to load the model + projector without hand-rolling flags. (See Ollama vs LM Studio vs Jan for how the tools differ.)
  • Check the model card’s date and the issue tracker. If it says “requires llama.cpp build ≥ X” or “requires Ollama ≥ X,” believe it.

Rule of thumb: mature models (LLaVA, MiniCPM-V, Llama 3.2 Vision, and now Qwen3-VL on current Ollama) → Ollama is fine. If you ever hit a model your installed runtime is too old for, upgrade first, then fall back to llama.cpp or LM Studio. Don’t fight a runtime on a model it doesn’t support yet — just update it, or use the one that does.

Setup and which tool to run each in

The base install is the same as any local LLM. If you don’t have Ollama yet:

curl -fsSL https://ollama.com/install.sh | sh

Then pull a vision model and pass an image as a path inside the prompt. The loopback API stays on 127.0.0.1:11434, so nothing touches the network:

ollama run minicpm-v "./invoice.png What is the invoice total and due date?"

Match the model to the tool:

ModelRecommended runtimeWhy
Moondream 2OllamaSimple, fully supported
LLaVA 1.6OllamaMature, one-command
MiniCPM-VOllamaSupported, the daily driver
Llama 3.2 VisionOllamaFirst-class support
Qwen3-VLOllama (0.12.7+) or LM StudioOfficial library model; upgrade Ollama first

For a friendlier experience than the terminal, point Open WebUI at Ollama and you get a drag-and-drop image box in a chat window. If you’d rather automate extraction over a folder of files, the Ollama Python API lets you loop a VLM over documents and dump structured output to JSON.

Verdict by VRAM tier + the privacy payoff

Pick by the card you actually own:

Your VRAMBest pickRunner-up
≤4 GB / iGPU / CPUMoondream 2LLaVA 1.6 7B (Q4)
8 GBMiniCPM-VLLaVA 1.6 7B
12 GBLlama 3.2 Vision 11BMiniCPM-V
16–24 GBQwen3-VLLlama 3.2 Vision 11B

If you only remember one line: MiniCPM-V is the best all-around local document-OCR model for normal hardware, and Qwen3-VL is the accuracy ceiling — now a one-command pull in current Ollama. Not sure how much GPU you need before committing? Our VRAM guide for local AI breaks down the tiers in plain terms.

The real payoff isn’t the benchmark — it’s that none of this leaves your machine. When the document is a medical record or a signed contract, “the file stayed on my SSD” is worth more than a few accuracy points you’d buy by uploading it to a stranger. That’s the same principle behind chatting with your documents locally: the most private system is the one where the data never travels.

If you want that ownership without assembling the pieces yourself, Ember packages a 100%-local AI that runs on your own machine via Ollama — bought once, no cloud, no logs. And if you’d rather skip the GPU entirely and just have a private assistant ready in your browser, Freya runs it hosted with zero setup. Either way, your documents stay yours.