Chat With Your Documents 100% Offline (Ollama + AnythingLLM RAG)

Chat with your PDFs, contracts, and notes 100% offline using Ollama + AnythingLLM. A private local RAG setup — nothing leaves your machine, no cloud, no

If you’ve ever pasted a contract, a medical record, or a signed NDA into ChatGPT to “just summarize this,” you’ve already handed that document to someone else’s servers. For most people that’s a quiet risk they never think about. For lawyers, doctors, accountants, founders under NDA, and anyone holding sensitive files, it’s a genuine liability. The good news: you can get the same “ask my documents anything” experience — fully offline, on your own machine, with nothing leaving your computer. This guide shows you exactly how, using Ollama plus AnythingLLM to build a private local RAG (retrieval-augmented generation) setup in about fifteen minutes.

This is the practical answer to how to chat with your documents locally and private: no API keys, no uploads, no monthly bill, no terms-of-service roulette over who gets to read your files.

Why cloud doc-chat is a real data risk

When you upload a document to a cloud AI tool, three things happen that you can’t see or control:

The file leaves your device and lands in a third party’s storage, where it’s subject to their retention policy, their breach exposure, and their staff access controls — not yours.
The text may be retained to maintain conversation history, power features, or — depending on the product and your settings — improve their models. Policies vary by provider and change over time, so the only safe assumption for sensitive material is that a copy now exists off your machine.
You may be breaking your own obligations. An NDA, an attorney-client privilege, HIPAA-style health rules, or a GDPR data-processing agreement often forbid shipping the underlying data to an unvetted subprocessor. “I pasted it into a chatbot” is not a defense you want to give a regulator or a client.

None of this requires a villain or a breach. It’s just architecture: a cloud companion or assistant necessarily stores your input server-side to function. The only way to remove that risk category entirely is to never send the document in the first place. That’s the whole pitch for local. If you want the deeper comparison, see local AI vs cloud AI and our breakdown of whether Ollama is really private.

The local RAG stack: Ollama + AnythingLLM

RAG sounds technical, but the idea is simple. Instead of stuffing an entire 80-page PDF into the model’s prompt (which doesn’t fit and gets expensive), the tool:

Chunks your documents into small passages.
Embeds each chunk into a vector — a numeric fingerprint of its meaning — using a local embedding model.
When you ask a question, it retrieves the handful of most relevant chunks and feeds only those to the language model, which writes a grounded answer and can cite the source.

You need two pieces, and both run on your own hardware:

Component	Job	Why this one
Ollama	Runs the local LLM (and can serve embeddings)	Dead-simple model management, exposes a loopback API at `127.0.0.1:11434`
AnythingLLM	The doc-chat front end + RAG engine	Desktop app for macOS/Windows/Linux, no account required, stores model, documents, vectors, and chats locally by default

Per AnythingLLM’s own description, everything — “the model, documents, chats” — is stored locally on your desktop, and it’s “private by default.” It plugs straight into Ollama as the LLM provider.

Don’t want to run Ollama separately? LM Studio ships its own document-chat panel and is a fine all-in-one alternative if you prefer a single app. The privacy principle is identical: the files never leave your machine.

First, get the engine running. Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull a capable general model and a small embedding model:

ollama pull llama3.1:8b
ollama pull nomic-embed-text

If you’ve never set up a local model before, our how to run AI locally primer covers the basics and hardware expectations.

Step-by-step: load PDFs and notes, then ask questions

Install AnythingLLM Desktop from anythingllm.com (macOS, Windows, or Linux). No sign-up.
Point it at Ollama. In Settings → LLM Preference, choose Ollama as the provider. It should auto-detect the endpoint at http://127.0.0.1:11434. Select the model you pulled (e.g. llama3.1:8b).
Set the embedder. In Settings → Embedding Preference, choose Ollama and select nomic-embed-text. (AnythingLLM also ships a built-in local embedder if you’d rather not run one through Ollama — either way it stays on-device.)
Create a Workspace. Think of a workspace as one project or one matter — “Q3 Contracts,” “Lab Results,” “Acme NDA.” Keeping sensitive files in their own workspace prevents cross-contamination between topics.
Drop in your files. Click the upload area and add PDFs, .docx, .txt, Markdown notes, even pasted text. AnythingLLM chunks and embeds them locally. Larger PDFs take a moment the first time while vectors are built.
Ask. Type a real question: “What’s the termination notice period in this contract?” or “Summarize every action item assigned to me across these meeting notes.” The model answers from your documents and shows which chunks it pulled, so you can verify rather than trust blindly.

That’s it. You now have a private, searchable brain over your own files.

Verifying nothing leaves your machine

Don’t take anyone’s word for it — confirm it. This is the part most guides skip, and it’s the part that actually earns your trust.

Watch the network. Pull the plug. Disconnect Wi-Fi / unplug Ethernet entirely, then ask your documents a question. If it still answers, the inference is happening on your hardware — full stop. This is the most honest test there is.

Check what’s listening. Ollama binds to loopback only. You can confirm it’s local:

ss -tlnp | grep 11434

You should see it bound to 127.0.0.1 (loopback), not 0.0.0.0 (every interface). Loopback traffic never touches your network card or your router.

Inspect outbound connections while you chat, if you want to be thorough:

# Linux/macOS — watch for any connection out of localhost during a query
lsof -i -nP | grep -i -E 'ollama|anythingllm'

Healthy local RAG shows connections to 127.0.0.1/localhost and nothing reaching out to a cloud host mid-answer. (AnythingLLM may check for updates or fetch a model on first run — that’s setup, not your document data. The files themselves stay put.)

Best models for document Q&A

Document Q&A rewards instruction-following and faithfulness over raw creativity. You want a model that sticks to the retrieved text instead of confidently making things up. VRAM drives how large a model you can run, so match the model to your hardware:

Your VRAM	Practical pick	Notes
8 GB	An 8B-class instruct model at `Q4_K_M`	Solid for summaries and Q&A. See best local LLM for 8GB VRAM
12–16 GB	8B–14B instruct at `Q4_K_M`/`Q5_K_M`	More headroom, longer context. 12–16GB guide
24 GB	A 24B–32B-class model	Noticeably better at multi-document reasoning. 24GB guide
No GPU / Mac	Smaller quantized models on CPU or Apple Silicon	Slower but workable — see run local AI without a GPU

A few principles that matter more than the exact model name:

Embedding model quality matters as much as the LLM. nomic-embed-text is a strong, lightweight default; better retrieval beats a bigger generator.
Quantization tags like Q4_K_M trade a little accuracy for a lot less memory. Q4_K_M is the everyday sweet spot; bump to Q5/Q6 if you have the VRAM and want crisper answers. Our GGUF quantization cheat sheet explains the tradeoffs.
Context window matters for long documents. A model with a generous context can hold more retrieved chunks at once.

Limits of local RAG — and how to get better answers

Local RAG is powerful but not magic. Knowing the failure modes lets you fix them:

Garbage in, garbage out on PDFs. Scanned/image-only PDFs have no extractable text. Run OCR first (e.g. via a tool that produces a text layer) or your retriever finds nothing.
Chunking can split context. If an answer spans a page break, the relevant chunk may get cut. Increasing chunk overlap in AnythingLLM’s settings helps the model see across boundaries.
Retrieval, not the model, is usually the bottleneck. When answers are weak, it’s often that the right chunk wasn’t retrieved. Ask more specific questions, raise the number of retrieved snippets (“top-k”), and try a stronger embedding model before you blame the LLM.
Hallucination still happens. Local models can invent. Mitigate it by always asking for citations, keeping workspaces tightly scoped, and verifying critical facts against the cited chunk.
Speed scales with hardware. Big models on modest GPUs feel sluggish. If tokens crawl, drop a model size or a quantization level — usable beats impressive.

Use cases for the privacy-conscious professional

This setup shines exactly where the cloud is riskiest:

Lawyers querying discovery dumps, contracts, and case files without breaching privilege or an NDA.
Doctors and therapists summarizing notes or literature without exposing patient data.
Accountants and finance teams parsing statements, returns, and client records.
Founders and execs working through term sheets, cap tables, and board materials under confidentiality.
Researchers and journalists interrogating source documents and embargoed material.
Anyone who simply doesn’t want their personal paperwork — leases, tax docs, medical bills — sitting on a stranger’s server.

In every one of these, “private” isn’t a nice-to-have; it’s the requirement. Local RAG meets it by design, not by promise. If you want the broader argument for keeping AI on-device, our piece on why cloud AI censors you covers the control angle too.

The all-in-one private setup — or hosted-private if you’d rather not build it

If you want this private-by-design philosophy beyond documents — a full local AI companion that runs entirely on your own machine, with your conversations never leaving it — that’s exactly what Ember is built for: a one-time purchase, 100% local, you own it. And if you genuinely don’t have the hardware (or just want it to work the moment you sign in), Freya gives you a hosted, zero-setup option without wiring up Ollama yourself. Either way, you get capable AI that respects the same principle you just set up here: your data is yours.

Chat With Your Documents 100% Offline (Ollama + AnythingLLM RAG)

Why cloud doc-chat is a real data risk

The local RAG stack: Ollama + AnythingLLM

Step-by-step: load PDFs and notes, then ask questions

Verifying nothing leaves your machine

Best models for document Q&A

Limits of local RAG — and how to get better answers

Use cases for the privacy-conscious professional

The all-in-one private setup — or hosted-private if you’d rather not build it

Ember — own it

Freya — no setup

Chat With Your Documents 100% Offline (Ollama + AnythingLLM RAG)

Why cloud doc-chat is a real data risk

The local RAG stack: Ollama + AnythingLLM

Step-by-step: load PDFs and notes, then ask questions

Verifying nothing leaves your machine

Best models for document Q&A

Limits of local RAG — and how to get better answers

Use cases for the privacy-conscious professional

The all-in-one private setup — or hosted-private if you’d rather not build it

Ember — own it

Freya — no setup

Related guides

Ollama Not Using Your GPU? The Complete Fix Guide (2026)

How to Run AI Locally: The Complete Beginner's Guide (2026)

Ollama CUDA Out of Memory: How to Fix It (VRAM Ladder)