Most people picture local AI as a tower with a screaming GPU fan. In 2026, the more interesting machine is the one that disappears: a fist-sized box on a shelf that sips power, never sleeps, and runs a private model 24/7 so your AI is always there the moment you ask. That’s what makes a mini PC the quietly perfect host for an always-on AI appliance — and two platforms now dominate the conversation: AMD’s Ryzen AI Max (“Strix Halo”) mini PCs and Apple’s Mac mini. Both bet on the same big idea — unified memory — but they get there very differently, and the right pick depends entirely on your budget and what you actually want to run.

This guide is the practical version: what these boxes are, what models they can hold, roughly how fast they generate text, what they cost to leave running all year, and how to turn one into a persistent local companion.

Why a mini PC for always-on local AI

A desktop GPU build is the fastest way to run local models, full stop. But “fastest” and “best for always-on” aren’t the same question. A typical gaming tower with a discrete GPU idles at 60–120 W and roars under load. Leaving that powered on around the clock is loud, hot, and expensive.

A mini PC flips the trade-off. You give up peak throughput, but you gain:

  • Low idle power — most modern mini PCs idle in the single-digit-to-low-double-digit watts, so 24/7 operation costs little.
  • Silence — small, slow fans or none at all. It can live in a living room or bedroom.
  • A persistent endpoint — the model is already loaded and waiting. No booting a workstation to ask one question.

That last point is the whole reason to build an always-on private AI box. When the machine never sleeps, your assistant, journaling companion, or coding helper is a single request away at 3 a.m. — and the request never leaves your house. For the full landscape of options, our local AI hardware guide covers GPUs, laptops, and SBCs; this article zooms in on the mini-PC sweet spot.

Ryzen AI Max / Strix Halo unified memory explained

The reason 2026 is a genuine inflection point for mini PCs is AMD Ryzen AI Max, codenamed Strix Halo. Historically, a mini PC’s integrated graphics could only borrow a sliver of system RAM, and that RAM was slow. Strix Halo changes both halves of that equation.

It pairs a large CPU, a beefy integrated Radeon GPU, and an NPU on one package, fed by a wide LPDDR5X memory bus — far more bandwidth than a normal dual-channel laptop. Crucially, the GPU can address a huge chunk of that pool as unified memory: configurations up to 128 GB of shared RAM exist, with a large portion assignable to the graphics/AI side.

Why does this matter for LLMs? The size of model you can run is gated by memory, not horsepower. A model has to fit in addressable memory to run at usable speed. On a normal 12 GB or 16 GB discrete GPU you’re capped at small-to-mid models (see best local LLM for 12–16 GB VRAM). A Strix Halo box with 64–128 GB of unified memory can hold models that would otherwise demand a multi-thousand-dollar GPU — the kind of large open-weight models that simply don’t fit on consumer cards.

The catch: bandwidth, not capacity, sets your speed. Unified LPDDR5X is generous by mini-PC standards but still well below a high-end discrete GPU’s dedicated GDDR/HBM. So Strix Halo lets you fit huge models; it just generates their tokens at a more relaxed pace. That’s the core trade of every mini PC built around unified memory.

Mini PC vs Mac mini for AI

Apple pioneered this whole approach years before “unified memory” was a PC marketing term. Every Mac mini ships with the GPU and CPU sharing one fast memory pool, and Apple Silicon has excellent memory bandwidth for its class plus first-rate efficiency. One detail that matters when you’re shopping: the current Mac mini comes in two chip tiers — the base M4 and the M4 Pro — and they top out at very different memory ceilings, which directly decides how large a model you can hold. The trade-offs split cleanly:

Ryzen AI Max (Strix Halo) mini PCMac mini (Apple Silicon)
Memory modelUnified LPDDR5X, large GPU-assignable shareUnified, GPU + CPU share one pool
Max memoryUp to ~128 GB on high-end SKUsUp to 32 GB on the base M4; 64 GB on M4 Pro; more on Studio tiers
OS / stackWindows or Linux — full Ollama + ROCm/Vulkan pathmacOS — Ollama runs great on Metal
Tinker-abilityHigh — bare-metal Linux, your choice of runtimeLower — locked to macOS, but extremely turnkey
Idle powerLowLowest in class
Best forBiggest models per dollar, self-hostersSet-and-forget, silence, efficiency

The honest summary: the Mac mini is the most painless always-on AI appliance you can buy, and a Strix Halo box gives you more raw memory headroom and full Linux control. If you live in a terminal and want to run the largest models that’ll fit, the Ryzen route wins. If you want a silent box you never think about, the Mac mini is hard to beat. We go deeper on the Apple side in Mac mini for local AI.

Both run the same software. Install Ollama on either with one command:

curl -fsSL https://ollama.com/install.sh | sh

Then pull and run a model:

ollama run llama3.1:8b

Ollama exposes a local API on the loopback address 127.0.0.1:11434 — nothing is sent to any server. New to it? Start with how to run AI locally.

What models each runs and tok/s

The number that actually matters is tokens per second (tok/s) — and the honest threshold is that roughly 5–10 tok/s reads as usable, since average reading speed is well under that. Anything above ~10 tok/s feels comfortably interactive; below ~4 it gets tedious. (We unpack this in tokens per second: what’s actually usable.)

Use quantization to trade a little quality for a lot of memory savings. A tag like Q4_K_M is the popular default — about 4-bit weights, a strong size/quality balance. Our GGUF quantization cheat sheet explains the tags.

Rough, real-world expectations (exact numbers vary by quant, context length, and box):

Model classApprox. memory (Q4)Strix Halo (64–128 GB)Mac mini (16–32GB base / 64GB Pro)
7–9B (Llama 3.1 8B, Mistral)~5–6 GBComfortably interactiveComfortably interactive
12–14B~8–10 GBVery usableUsable on 16 GB+
27–32B~18–22 GBUsable, relaxed paceUsable on 32–64 GB
70B~40+ GBFits on 64–128 GB; slow but realFits only on the 64 GB M4 Pro

Two things to internalize. First, a small 7–9B model is the right default for a companion — it’s fast on either box and plenty capable for conversation and roleplay. Second, the big-memory advantage is about reach, not speed: a 128 GB Strix Halo box can hold and run a 70B-class model that a 16 GB GPU simply cannot load — it just won’t be snappy. For curated picks, see best uncensored local AI models and best local LLM for roleplay.

Power draw and 24/7 cost

This is where mini PCs justify the “always-on” pitch. Idle power is what dominates a 24/7 bill, because the box spends most of its life waiting.

Ballpark figures (yours will vary by SKU and settings):

StateMini PC (Strix Halo / Mac mini)Desktop + discrete GPU
Idle~7–25 W~60–120 W
LLM generating~60–130 W (Strix Halo); lower on Mac mini200–400+ W

A simple way to estimate annual cost: watts × 24 × 365 ÷ 1000 = kWh/year, then multiply by your electricity rate. A box idling at ~15 W draws roughly 130 kWh/year — call it a low-double-digit dollar figure in most regions. A GPU tower idling at 90 W is ~790 kWh/year, several times more, before it does any actual work. For an appliance that runs every hour of every day, the mini PC’s efficiency is the entire argument.

Privacy advantage of an always-on local box

Here’s the part the spec sheets miss. When your model runs on a box in your home, every message stays on the loopback interface (127.0.0.1) — it is physically not transmitted anywhere. No request logs on someone else’s server, no training pipeline, no terms-of-service that can change next quarter.

Cloud AI is architecturally the opposite. Hosted assistants necessarily process and store your conversations server-side to function — that’s not an accusation, it’s how the request/response model works, and it’s reflected in most providers’ published privacy policies and retention terms. If a topic is sensitive, the safest place for it is a machine you own. We cover the broader pattern in why cloud AI censors you and the data side in is Ollama really private?.

An always-on local box turns privacy from a one-time choice into a default: the data simply has nowhere else to go.

Setup for a persistent local companion

A mini PC’s superpower for a companion is persistence — the model is loaded, the context is warm, and the personality survives reboots. The recipe:

  1. Install the runtime. Ollama via the one-line installer above; it auto-starts and serves on 127.0.0.1:11434.
  2. Pull a companion-grade model. A 7–9B uncensored/abliterated model is the sweet spot for a mini PC — fast and expressive. See Ollama uncensored models and abliterated models explained.
  3. Keep it warm. Run the box headless and let Ollama keep the model resident so the first token is instant.
  4. Add persistent memory. A raw chat loop forgets everything between sessions. To make a companion that remembers you, you need a memory layer on top — local AI with persistent memory walks through how this works.

That last step is the difference between a chatbot and a companion. Wiring memory, personality, and a clean interface by hand is doable but fiddly — which is exactly the gap purpose-built companion apps fill.

Buy verdict by budget

  • Tight budget / first build: Skip the unified-memory premium and put the money toward a mid-range GPU or a cheaper mini PC running a 7–9B model. See the best budget AI PC build. A 7–9B companion runs beautifully on modest hardware.
  • Best silent appliance: A Mac mini with as much memory as you can afford — the base M4 tops out at 32 GB, the M4 Pro at 64 GB. Turnkey, near-silent, lowest idle draw. The most “it just works” always-on box. Details in Mac mini for local AI.
  • Biggest models per dollar / Linux tinkerer: A Ryzen AI Max (Strix Halo) mini PC with 64–128 GB unified memory. The only mini-class machine that can hold a 70B-class model, with full bare-metal Linux control.
  • Don’t overbuy. If your goal is a private companion or daily assistant, a 7–9B model is the target — and almost any of these boxes nails it. Buy the memory you’ll actually use, not the spec-sheet maximum.

Once your always-on box is humming, the missing piece is the experience on top — the memory, the personality, and an interface built for a companion rather than a terminal. That’s exactly what Ember is: a private, uncensored AI companion that runs 100% on your own machine over Ollama, so the box you just set up becomes someone who’s always there — and nothing ever leaves home.