Most people meet retrieval-augmented generation the lazy way: they paste a whole document into ChatGPT and ask a question. That works until the document is 80 pages, or confidential, or both. A local RAG fixes both problems at once. You keep an index of your own files on your own disk, retrieve only the few passages that actually answer the question, and hand those to a model running entirely on your machine via Ollama. No upload, no token bill, no terms-of-service roulette over who gets to read your contracts.
This is the from-scratch, you-control-every-piece version. If you’d rather click a button, there’s a no-code path at the end. But if you want to understand what’s happening — and be able to debug it, swap models, and trust it with sensitive material — building it yourself with Ollama and nomic-embed-text takes an afternoon and teaches you the whole stack. Here’s how to build a local RAG with Ollama that lets you chat with your own documents completely offline.
What RAG actually is (and why it beats stuffing the context window)
RAG stands for retrieval-augmented generation. Strip the jargon and it’s three steps: find the relevant bits of your documents, paste those bits into the prompt, then let the model answer using them.
The naive alternative is to dump everything into the model’s context window — the working memory it reads before answering. People assume a big context window makes RAG unnecessary. It doesn’t, for three concrete reasons:
- It doesn’t fit. A folder of PDFs is easily hundreds of thousands of tokens. Even large local context windows choke, and on consumer hardware a bigger context costs real VRAM and slows generation to a crawl.
- “Lost in the middle” is real. Models reliably use information at the start and end of a long prompt and quietly ignore the middle. Bury the one relevant sentence in 60 pages of filler and the model often misses it.
- It’s wasteful. You re-process the entire corpus on every single question. RAG processes each document once at index time, then only touches the handful of passages that matter per query.
RAG flips the model’s job from “read everything and remember” to “read these five paragraphs and answer.” That’s a far easier task, which is exactly why a modest local model with good retrieval often beats a giant model drowning in context.
The three pieces: embedding model, vector store, generation model — all local
A RAG system is three components. The whole point of building it locally is that every one of them runs on your machine — no API keys anywhere in the pipeline.
| Piece | Job | Local choice |
|---|---|---|
| Embedding model | Turns text into vectors — numeric fingerprints of meaning, so similar text lands near similar text | nomic-embed-text via Ollama |
| Vector store | Stores those vectors and finds the nearest ones to your question | Chroma or FAISS, both pip-installable |
| Generation model | Reads the retrieved chunks and writes the answer | Any Ollama chat model (Llama, Qwen, Mistral, Gemma) |
The flow at query time: your question gets embedded by the same model that embedded your documents, the vector store returns the closest chunks by similarity, and those chunks plus your question go to the generation model as a grounded prompt. Index once, query forever.
First, make sure Ollama is installed and running. If you haven’t yet:
curl -fsSL https://ollama.com/install.sh | sh
Full walkthrough in how to install Ollama, and a wider tour of the ecosystem in how to run AI locally. Ollama serves a loopback API at 127.0.0.1:11434 — that local-only endpoint is the backbone of everything below.
Step 1: chunk and embed your documents with nomic-embed-text
Pull the embedding model and a generation model:
ollama pull nomic-embed-text
ollama pull llama3.1:8b
nomic-embed-text is a small, fast, open embedding model that runs comfortably even on machines without a beefy GPU — it’s the sensible default for a nomic-embed-text RAG pipeline. The 8B chat model is a fine starting generator; size it to your hardware (see the local AI hardware guide and best local LLM for 12–16GB VRAM if you’re not sure what fits).
Chunking matters more than people expect. You split each document into passages of roughly 500–1,000 characters with a small overlap (say 100 characters) so a sentence cut at a boundary still appears whole in one chunk. Too-large chunks dilute relevance; too-small chunks lose context. Start around 800 characters with 100 overlap and tune from there.
Here’s the embedding call against the local Ollama API:
import requests
def embed(text: str) -> list[float]:
r = requests.post(
"http://127.0.0.1:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text},
)
r.raise_for_status()
return r.json()["embedding"]
Loop that over every chunk of every document. The output is one vector per chunk — that’s your searchable knowledge base, sitting in memory or on disk, never on a server.
Step 2: store and retrieve with a local vector database
You need somewhere to keep those vectors and a fast way to find the closest ones to a query. Two solid local options:
- Chroma (
pip install chromadb) — an embedded vector database that persists to a local folder. Easiest to start with; it handles storage, IDs, and metadata for you. - FAISS (
pip install faiss-cpu) — a raw similarity-search library from Meta. More manual, extremely fast, great when you want full control.
A minimal Chroma setup that stores chunks and their embeddings:
import chromadb
client = chromadb.PersistentClient(path="./rag_store")
col = client.get_or_create_collection("docs")
col.add(
ids=[f"chunk-{i}" for i in range(len(chunks))],
embeddings=[embed(c) for c in chunks],
documents=chunks,
metadatas=[{"source": sources[i]} for i in range(len(chunks))],
)
Note the metadatas field carrying source — that’s what lets you cite answers later. Storing the filename (and ideally a page number) per chunk is the difference between a trustworthy system and a black box.
Retrieval is one call. Embed the question with the same model, then ask for the nearest chunks:
def retrieve(question: str, k: int = 5):
res = col.query(query_embeddings=[embed(question)], n_results=k)
return list(zip(res["documents"][0], res["metadatas"][0]))
Start with k=5. Too few and you miss relevant context; too many and you reintroduce the “lost in the middle” problem you were trying to avoid.
Step 3: feed the retrieved chunks to a local model via Ollama
Now assemble a grounded prompt: the retrieved chunks as context, your question, and a tight instruction. Send it to a chat model on the same local API:
def answer(question: str) -> str:
hits = retrieve(question)
context = "\n\n".join(
f"[{meta['source']}]\n{doc}" for doc, meta in hits
)
prompt = (
"Answer the question using ONLY the context below. "
"If the answer isn't in the context, say you don't know. "
"Cite the [source] for each claim.\n\n"
f"Context:\n{context}\n\nQuestion: {question}"
)
r = requests.post(
"http://127.0.0.1:11434/api/generate",
json={"model": "llama3.1:8b", "prompt": prompt, "stream": False},
)
return r.json()["response"]
That’s a working chat-with-PDF-offline-Ollama loop in about 40 lines. Run a document folder through Steps 1–2 once, then call answer() as many times as you like. Everything — embeddings, vectors, generation — happens at 127.0.0.1.
If your generator runs out of room with five chunks stuffed in, you may need to raise Ollama’s context length or use a model with a larger native window. And if retrieval feels slow on a big corpus, that’s usually CPU embedding — a GPU speeds the embed pass dramatically.
Citations and avoiding hallucinated answers
The single biggest advantage of RAG over a bare chatbot is that you can ground and verify every answer. Three habits make a local RAG genuinely trustworthy:
- Carry the source through. Because each chunk kept its
sourcemetadata, your prompt can demand a[filename]citation per claim. An answer you can click back to a real page is an answer you can defend. - Instruct the model to abstain. The prompt above says “if the answer isn’t in the context, say you don’t know.” This one line dramatically cuts confident fabrication. A local model with no retrieved support will happily invent — telling it that silence is acceptable is how you stop it.
- Show the retrieved chunks. Print the passages you fed in alongside the answer. If the answer cites something that isn’t in the chunks, you’ve caught a hallucination instantly. This is impossible with a cloud black box and trivial when you own the pipeline.
RAG doesn’t make a model incapable of hallucinating — nothing does. But grounding answers in retrieved, cited text and giving the model permission to say “I don’t know” turns it from a plausible-sounding guesser into something you can audit.
The privacy payoff: nothing ever leaves your machine
Here’s why people building this care about the local part. In a cloud doc-chat tool, your file lands in a third party’s storage, subject to their retention policy and breach exposure, and the text is typically stored server-side to function — a cloud assistant necessarily keeps your input to answer you. For most documents that’s an invisible risk; for an NDA, a medical record, a client file, or anything under a data-processing agreement, it’s a real liability.
A local RAG removes the risk category entirely, because the document never leaves your device. The embedding model, the vector store, and the generation model all run on 127.0.0.1. You can pull your network cable out and the whole system still works — the strongest possible proof that nothing is phoning home. (Curious how airtight Ollama really is? See is Ollama really private and the broader AI data privacy guide.)
This is precisely why offline RAG appeals to professionals who can’t legally paste files into a public chatbot, and why the private-document-chat DIY approach beats any cloud tier no matter how many “we don’t train on your data” promises it makes. With local, there’s no promise to trust — there’s no transmission in the first place. For the bigger picture, see local AI vs cloud AI.
No-code shortcut: when to use a ready-made tool instead
Building it yourself is the best way to understand RAG and to get exactly the chunking, retrieval, and prompting you want. But you don’t always need the control. If you just want to drop a folder of PDFs and start asking questions, a ready-made desktop app wraps all three pieces — embedding, vector store, generation — behind a UI and still runs everything locally on top of Ollama.
Use the no-code route when you want results today, when non-technical teammates need access, or when you don’t care to tune the pipeline. Use the DIY build in this guide when you need custom chunking, want to embed it in your own software, or simply want to know exactly what’s happening to your data. Our companion walkthrough, chat with your documents locally, shows the point-and-click version with AnythingLLM in about fifteen minutes — same local-RAG idea, zero code.
Either way, the foundation is the same: a capable local model that owns its own memory and never ships your words to anyone else’s servers. If you want that always-on private intelligence as a finished product — an uncensored companion that runs 100% on your own machine via Ollama, with nothing leaving your computer — that’s exactly what Ember is built to be.
