If you’ve been shopping for a GPU or a mini PC to run a local AI companion, you’ve seen the number everywhere: tokens per second. Benchmarks brag about it, Reddit argues about it, and somewhere a spec sheet implies you need 100 tok/s or your setup is garbage. Here’s the honest answer almost nobody gives you plainly: for a back-and-forth chat you actually read, the speed that matters is roughly how fast you read. Once your model outputs text faster than your eyes consume it, extra speed stops mattering for the experience. This guide explains what tok/s really means, where the usable floor is, why the first token can matter more than the rest, and exactly what to expect from each hardware tier.
What “tok/s” means in plain terms
A token is a chunk of text the model generates one piece at a time. It’s not quite a word and not quite a letter — it’s a sub-word unit. As a rough rule of thumb, 1 token ≈ 0.75 English words, or about 4 characters. So “100 tokens” is roughly 75 words, or a solid paragraph.
Tokens per second (tok/s) is simply how many of those chunks the model produces each second during generation. If a model runs at 10 tok/s, it’s writing about 7-8 words every second. That’s the generation (decode) speed — the part you watch stream onto the screen.
The key mental shift: tok/s is a throughput number, not a quality number. A blazing-fast model isn’t smarter; it just types faster. What you care about as a chatter is whether the typing keeps up with you — and that bar is lower than the benchmark culture suggests.
The usable floor: human reading speed (~7-10 tok/s)
Here’s the anchor for the whole question. The average adult reads English at roughly 200-300 words per minute. Convert that to tokens:
- 250 words/min ÷ 60 = ~4.2 words/sec
- ÷ 0.75 words per token = ~5.5 tokens/sec to match a comfortable reading pace
So purely on reading mechanics, ~5-6 tok/s is the absolute floor where text arrives as fast as you can read it. But “as fast as I can read” isn’t the same as “feels good.” A little headroom makes the difference between waiting and flowing. In practice:
| Speed | What it feels like |
|---|---|
| Under 3 tok/s | Painful. You finish reading and stare at a blinking cursor. Fine for batch jobs, not for chat. |
| 3-5 tok/s | Tolerable for offline/private work where you’d accept the trade. Slightly behind a fast reader. |
| 7-10 tok/s | The usable sweet spot for chat. Text streams a touch faster than you read — natural, conversational. |
| 15-30 tok/s | Comfortably fast. You never wait; you skim if anything. |
| 40+ tok/s | Faster than you can read. Great for long generations and code, but invisible “extra” for plain chat. |
Bottom line: 7-10 tok/s is the practical minimum for a chat that feels good, and anything above ~15 tok/s is gravy for conversational use. Chasing 60 tok/s for a roleplay session is optimizing a number you’ll never feel. This is the same conclusion we reach in our deeper dive on whether tokens per second is even worth obsessing over — the honest answer is that “usable” plateaus early.
Streaming vs total latency: why first-token matters
Average tok/s hides the single biggest thing people actually feel: time to first token (TTFT) — the gap between hitting Enter and the first word appearing.
Two metrics, two very different experiences:
- Time to first token (TTFT): how long until the response starts. This is the “is it frozen?” feeling.
- Inter-token latency / decode speed: the steady tok/s once it’s rolling — the streaming speed your eyes track.
A model can have a great average tok/s but a sluggish TTFT, and it’ll feel slow because you sat watching nothing for two seconds. Conversely, a modest 9 tok/s with near-instant first token feels snappy and alive, because streaming lets you start reading the moment generation begins — you read the opening words while the rest is still being written.
What drives TTFT:
- Prompt length (context). The model must process your entire prompt — system prompt, character card, chat history — before it writes word one. This is the prefill stage. Long roleplay histories and big system prompts inflate TTFT noticeably. A 4,000-token history means thousands of tokens to chew through first.
- Cold start. The first message after loading a model is slower while weights page into VRAM/RAM. Subsequent replies are quicker.
- Memory bandwidth. More on this below — it’s the real bottleneck for local LLMs.
Practical takeaways: keep your context lean (trim bloated system prompts, cap history length), and value a setup that streams tokens as they generate rather than waiting for the full reply. Streaming is why 9 tok/s on a local box can feel better than a “fast” cloud app that buffers the whole message before showing it.
Tok/s by hardware tier (quick reference)
Local LLM speed is governed less by raw compute and more by memory bandwidth — how fast the chip can move model weights. That’s why GPUs (huge bandwidth) crush CPUs, and why Apple Silicon punches above its weight (unified memory with high bandwidth). The other lever is model size: a 7B-8B model at a Q4_K_M quantization is several times faster than a 70B model on the same hardware.
Here’s a realistic, category-level picture for a mid-size chat model (think 7B-13B class, 4-bit quantized). Treat these as ballpark ranges, not promises — exact numbers depend on the specific model, quant, and context length:
| Hardware tier | Example | Realistic tok/s (7B-13B, Q4) | Verdict for chat |
|---|---|---|---|
| Modern dedicated GPU, 12-24 GB | RTX 4070 / 4080 / 4090, etc. | Tens to 100+ tok/s | Effortless. Way past the floor. |
| Entry/older GPU, 8 GB | Budget 8 GB cards | Comfortably 20-50+ tok/s on an 8B model | Great for chat; size your model to VRAM. |
| Apple Silicon (unified memory) | M-series Mac / Mac mini | Often 10-40+ tok/s depending on chip + model | Smooth, quiet, efficient. |
| CPU only (no GPU) | Modern desktop CPU + RAM | Low single digits to ~10 tok/s on small models | Usable for small models; tight on larger ones. |
If you want the full breakdown of what to buy and why, our local AI hardware guide maps models to VRAM and budgets in detail, and the question of whether you need a GPU at all for an AI companion is more nuanced than “buy a 4090.”
The headline: almost any modern dedicated GPU clears the usable bar for chat with room to spare. The interesting case is the no-GPU path.
When “slow but offline” is an acceptable trade
There’s a real category of user for whom 3-6 tok/s on a CPU is a perfectly good deal — and it’s worth naming honestly rather than pretending everyone needs a 4090.
You’re in this camp if:
- Privacy is the point. A slightly slower reply that never leaves your machine beats an instant reply logged on someone’s server. If your reason for going local is that the conversation is nobody’s business but yours, a few seconds of latency is a rounding error against that.
- You’re reading, not racing. Thoughtful, slower-paced chat (journaling, companionship, creative writing) doesn’t need machine-gun output. You’re savoring the reply, not speed-running it.
- You have RAM but no GPU. A modern laptop or mini PC with 16-32 GB of RAM can run a small-to-mid model on CPU. It won’t win benchmarks, but it’ll hold a conversation. Our guide to running local AI without a GPU covers how to pick a model small enough to stay responsive.
The trade only goes bad when slowness plus a big context history pushes TTFT into “did it crash?” territory. The fix is to run a smaller, well-quantized model (an 8B at Q4_K_M instead of a 13B), keep context tight, and accept that you’re trading a little brilliance for a lot of privacy and zero subscription. For many people, that’s exactly the right trade. If you’re unsure your machine can even hold up, can my PC run an AI companion walks through the honest minimum.
Your rig fast enough? Run it locally
If you’ve got a dedicated GPU — even a humble 8 GB card — you are almost certainly past the usable floor for a great chat experience. The path is well-trodden:
curl -fsSL https://ollama.com/install.sh | sh
ollama run <model>
Ollama serves a local API on 127.0.0.1:11434 (loopback only — nothing leaves your box), and you point a companion front-end at it. That’s the whole pitch of running local: the model lives on your hardware, the conversation stays on your hardware, and there’s no monthly bill metering your tokens. If you’re new to the workflow, start with how to run AI locally.
This is exactly what Ember is built for: an uncensored AI companion that runs 100% on your own machine via Ollama, bought once, no cloud, no logging, no subscription. If your rig clears the bar — and most modern GPUs do — Ember turns that throughput into an actual companion instead of a terminal benchmark.
Too slow? Host it instead
Be honest with yourself about the measurement. If you ran the numbers and your CPU-only box is grinding out 2-3 tok/s with a long history, or you simply don’t have a GPU and don’t want to buy one, forcing local will make the experience worse, not more private-feeling. A companion that takes 30 seconds to start replying isn’t a companion; it’s a loading screen.
That’s the case for a hosted option. Freya is the cloud-side answer: a hosted AI companion with zero setup, no GPU required, and instant, fast responses — the speed of datacenter hardware without buying any. You trade the pure local-only privacy model for “it just works on the laptop I already own.” For the “want it now, no hardware” reader, that’s the sane choice — and you can always graduate to a local setup later when you’ve got the GPU for it.
How to measure your own tok/s
Don’t guess — measure. It takes two minutes and ends the speculation.
With Ollama, run any model and add the --verbose flag, then send one message:
ollama run <model> --verbose
After each response, Ollama prints timing stats including eval rate (that’s your decode tokens/second) and prompt eval timing (which feeds your time-to-first-token). The eval rate line is the number to compare against the 7-10 tok/s floor.
A quick manual sanity check if you’re using a GUI: count roughly how many words appear in ~5 seconds of streaming. Multiply by 12 (5s → per-minute), then divide by ~45 to get tok/s in your head — or just eyeball it: if the text streams a little faster than you naturally read, you’re in the good zone. If you finish each line and wait, you’re under the floor and should switch to a smaller or more aggressively quantized model.
Things that move your number, in order of impact:
- Model size — drop from 13B to 8B for a big jump.
- Quantization —
Q4_K_Mis the popular speed/quality balance; lighter quants run faster. - Context length — shorter prompts and trimmed history cut TTFT.
- Offload — keep the whole model in VRAM if you can; spilling to system RAM tanks speed.
Measure once, pick a model that comfortably clears 7-10 tok/s on your hardware, and stop worrying about the leaderboard numbers. Usable is usable.
Once you know your real tok/s, the decision gets easy: if your machine clears the bar, Ember runs the whole companion locally and privately for a one-time price; if it doesn’t, Freya gives you the same conversation hosted, instantly, with no hardware to buy.
