If you’ve gotten a model running with ollama run llama3 in your terminal, the natural next question is: how do I call this from my own code? Good news — Ollama exposes a local HTTP server the moment it starts, so any program on your machine can talk to it. No API keys, no rate limits, no internet round-trip. This guide takes you from “what’s even running” to a working, streaming, conversation-aware Python chat script you can paste and run today. Everything stays on 127.0.0.1 — your prompts never leave the box.

The two ways to talk to Ollama: REST API vs the ollama Python package

When Ollama is running, it listens on http://127.0.0.1:11434 (also reachable as localhost:11434). That’s the foundation for everything else. You have two practical ways to reach it from Python:

  1. The raw REST API. Ollama serves JSON over HTTP. You can hit it with requests, httpx, or even curl. This is the lowest-common-denominator approach — useful when you want zero dependencies or you’re calling from a language that has no official client.
  2. The official ollama Python package. A thin, well-typed wrapper around that same REST API. It handles the HTTP plumbing, JSON encoding, and streaming for you. For most Python projects, this is what you want.

Install the package with pip:

pip install ollama

A quick sanity check that the server is even up — this one line proves the loopback API is alive:

curl http://127.0.0.1:11434/api/tags

If that returns a JSON list of your installed models, you’re in business. If it errors, jump to the troubleshooting section below. (And if you haven’t installed Ollama itself yet, start with how to install Ollama — the package talks to the server, it doesn’t replace it.)

Here’s the same “hello” in both styles so you can see the relationship:

# Style 1: raw REST with requests
import requests

r = requests.post("http://127.0.0.1:11434/api/chat", json={
    "model": "llama3",
    "messages": [{"role": "user", "content": "Say hi in five words."}],
    "stream": False,
})
print(r.json()["message"]["content"])
# Style 2: the ollama package (same result, less boilerplate)
import ollama

resp = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Say hi in five words."},
])
print(resp["message"]["content"])

Both hit the exact same endpoint. The package just saves you from writing the HTTP layer yourself.

/api/chat vs /api/generate — which to use

Ollama has two text endpoints, and beginners often pick the wrong one.

EndpointInput shapeUse it for
/api/chata messages list (role + content)conversations, assistants, anything multi-turn, anything using a system prompt
/api/generatea single prompt stringone-shot completions, raw text generation, fill-in tasks, embeddings-adjacent work

Rule of thumb: use /api/chat unless you have a specific reason not to. Chat applies the model’s chat template automatically (the special tokens that tell an instruct-tuned model where the system, user, and assistant turns begin). generate gives you a more raw completion and is handy when you want full control over the prompt string, but you lose the automatic turn formatting.

In the Python package those map cleanly to ollama.chat(...) and ollama.generate(...):

# generate: one prompt in, one completion out
out = ollama.generate(model="llama3", prompt="Write a haiku about loopback addresses.")
print(out["response"])

Note the response key differs: chat returns response["message"]["content"], while generate returns response["response"]. Mixing those up is the second most common beginner bug.

The messages array and how to maintain a conversation

The model is stateless. Ollama does not remember your last message between calls — it only knows what you send it. To hold a conversation, you keep the running history and resend it every turn. That history is the messages list.

Each message is a dict with a role and content. There are three roles:

  • system — instructions that shape behavior (“You are a terse assistant.”). Goes first.
  • user — what the human said.
  • assistant — what the model said previously. You append the model’s own replies here so it has context for the next turn.
messages = [
    {"role": "system", "content": "You are a concise, friendly assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

resp = ollama.chat(model="llama3", messages=messages)
reply = resp["message"]["content"]

# Append the model's reply so the NEXT turn has context
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "And its population?"})

resp = ollama.chat(model="llama3", messages=messages)
print(resp["message"]["content"])  # answers "its" correctly — it remembers Paris

The list grows with every exchange. That’s the entire trick to memory in a chat loop. (If you want memory that survives a restart — saved to disk and reloaded — that’s a separate design problem; see persistent memory for local AI.) Just be aware the whole list counts against the model’s context window, so very long conversations eventually need trimming or summarizing; if you hit limits, look at increasing the Ollama context window.

Streaming responses vs waiting for the full reply

By default a chat call blocks until the model finishes, then hands you the complete reply. That’s fine for short answers, but for a 300-word response it feels like the app froze for ten seconds. Streaming fixes this: you get tokens as they’re generated, exactly like the cursor-by-cursor effect in ChatGPT.

Set stream=True and iterate over the result. Each chunk carries a slice of text:

stream = ollama.chat(
    model="llama3",
    messages=[{"role": "user", "content": "Explain TCP loopback in two sentences."}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)
print()  # newline at the end

end="" keeps the tokens on one line; flush=True forces them to the terminal immediately instead of buffering. When streaming, you reconstruct the full reply by concatenating each chunk’s content — useful if you also need to append it to your messages history.

When to use which:

  • Stream for anything a human watches in real time (chat UIs, CLIs). It dramatically improves perceived speed even though total generation time is identical.
  • Don’t stream (stream=False) when you only need the final string — batch jobs, JSON extraction, automated pipelines, anything where you parse the whole output at once.

A complete copy-paste chat script

Here’s a real terminal chatbot that ties it all together: a system prompt, a growing message history, streaming output, and a clean exit. Save it as chat.py and run python chat.py.

import ollama

MODEL = "llama3"  # swap for any model you've pulled: qwen3, mistral, gemma3...

messages = [
    {"role": "system", "content": "You are a helpful, concise assistant."},
]

print(f"Chatting with {MODEL}. Type 'exit' to quit.\n")

while True:
    user_input = input("you: ").strip()
    if user_input.lower() in {"exit", "quit"}:
        break
    if not user_input:
        continue

    messages.append({"role": "user", "content": user_input})

    print("ai: ", end="", flush=True)
    full_reply = ""
    stream = ollama.chat(model=MODEL, messages=messages, stream=True)
    for chunk in stream:
        token = chunk["message"]["content"]
        print(token, end="", flush=True)
        full_reply += token
    print("\n")

    # Persist the assistant turn so context carries forward
    messages.append({"role": "assistant", "content": full_reply})

That’s a fully working local AI chat in about 25 lines, with conversation memory and live streaming, talking to a model running entirely on your own hardware. Change MODEL to anything you’ve pulled with ollama pull.

The OpenAI-compatible endpoint: reuse code written for the OpenAI SDK

This is the feature that saves the most time. Ollama also exposes an OpenAI-compatible API at http://localhost:11434/v1. That means code written for the official openai Python SDK works against your local model with a two-line change: point base_url at Ollama and pass any non-empty api_key (the value is ignored locally, but the SDK requires one).

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, ignored by Ollama
)

resp = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello from the OpenAI SDK!"}],
)
print(resp.choices[0].message.content)

The messages format is identical, streaming works (stream=True and iterate over chunk.choices[0].delta.content), and most existing OpenAI tutorials, LangChain integrations, and sample apps run unchanged. This is the fastest way to port an existing cloud app to local inference — and the fastest way to prove out an idea against ChatGPT-style code before deciding where to host it.

Coverage is good but not 100%. The chat and completions endpoints are solid; some newer or niche OpenAI parameters may be ignored or unsupported. For Python-native projects, the dedicated ollama package exposes more Ollama-specific options (like per-call model parameters and Modelfile features). Use the OpenAI shim for portability, the native client for full control.

Common errors and where to fix them

Connection refused / Max retries exceeded on port 11434 The Ollama server isn’t running, or nothing is listening on that port. Start it (ollama serve, or just run ollama run <model> once, which launches the server). On a default install it auto-starts as a background service. If it’s running but you still can’t connect, you may have a host/port or firewall mismatch — there’s a full walkthrough in fixing Ollama connection refused on 11434.

model not found / model "x" not found, try pulling it first You referenced a model you haven’t downloaded. List what you actually have:

ollama list

Then pull the one you want:

ollama pull llama3

Model names are exact — llama3 and llama3:8b and llama3:70b are different tags. Use the precise string from ollama list in your model= argument.

Hangs forever / extremely slow first token The first call after starting loads the model into memory, which can take several seconds (longer for big models on modest hardware). If every call is slow, the model may be too big for your VRAM and spilling to system RAM — see why Ollama is slow and how to speed it up.

KeyError: 'message' or 'response' You’re reading the wrong key. chat returns ["message"]["content"]; generate returns ["response"]. The OpenAI shim returns .choices[0].message.content.

Pointing the same code at a hosted endpoint when you have no GPU

Here’s the quiet superpower of writing against the OpenAI-compatible interface: the same code runs anywhere that speaks that protocol. Local Ollama is just one such endpoint. If your machine has no capable GPU — or you want your script to run on a laptop, a cheap server, or in CI — you can swap the base_url (and a real key) and point at a hosted model instead, with zero changes to your prompt logic, your messages array, or your streaming loop.

# Local, private, free — runs on your hardware
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Same code, different endpoint — a hosted provider when you have no GPU
# client = OpenAI(base_url="https://your-host/v1", api_key=os.environ["API_KEY"])

That’s the whole point of building against a stable interface: you develop and test locally where it’s private and costs nothing, then choose your deployment target later. If running models locally isn’t realistic for your hardware, running local AI without a GPU covers the honest trade-offs (CPU inference is slow; small models help) — and sometimes the right answer is simply to use a hosted endpoint that’s already set up for you.

If you want a companion experience rather than a developer toolkit — the same kind of always-available AI chat, but with zero setup, no pip install, and no GPU to feed — Freya runs entirely in the cloud and is ready the moment you open it. Build local when you want ownership; reach for hosted when you just want it to work today.