Running Local LLMs on Apple Silicon: An Agentic Workflow for M5 Max
Why run LLMs locally?
Cloud LLM APIs are convenient, but they come with latency, cost per token, rate limits, and data leaving your network. For agentic coding workflows, where an AI assistant makes dozens of tool calls per minute, local inference eliminates all of these constraints. With Apple Silicon's unified memory architecture, a single MacBook Pro can serve models that rival cloud APIs at interactive speeds.
This post documents the infrastructure I built to run local LLMs on an M5 Max as a persistent service powering my development workflow.
The hardware
| Field | Value |
|---|---|
| Machine | MacBook Pro (M5 Max) |
| Memory | 128 GB unified |
| GPU | 40 Metal cores |
| Bandwidth | 546 GB/s (theoretical) |
| Storage | 2 TB NVMe |
| OS | macOS 26.4 (Tahoe) |
The key insight with Apple Silicon: unified memory means the GPU can access the full 128 GB without PCIe bottlenecks. A model that would require a multi-GPU server on NVIDIA hardware fits entirely in a single laptop's memory.
The inference stack
I use llama.cpp (build b9430) with the Metal backend. It provides:
- Near-theoretical memory bandwidth utilization
- OpenAI-compatible API via
llama-server - Support for quantized models (Q4 through Q8)
- Flash Attention and KV cache quantization
- Speculative decoding (MTP and draft models)
The server runs as a launchd service, always on and always available on the local network.
Model selection: why MoE dominates on Apple Silicon
Token generation on Apple Silicon is entirely memory-bandwidth bound. The theoretical ceiling is:
tg_max = bandwidth / active_parameters_in_bytes
This means Mixture-of-Experts (MoE) models crush dense models on this hardware. An MoE model with 35B total parameters but only 3B active per token achieves the same generation speed as a 3B dense model while having the quality of a much larger model.
The difference is dramatic:
| Architecture | Active Params | tg128 Speed |
|---|---|---|
| MoE (3B active) | ~3B | 85 to 103 t/s |
| MoE (10B active) | ~10B | 35 t/s |
| Dense (27B) | 27B | 18 t/s |
Dense models are 3 to 6x slower for the same quality tier. On Apple Silicon, always prefer MoE.
Benchmark results
I benchmarked 8 models across prompt processing (512 to 65,536 tokens) and token generation (128 to 1,024 tokens). Here are the top 5 ranked by overall usability for coding, documentation, and troubleshooting:
| Rank | Model | Size | tg128 | tg1024 | pp65536 |
|---|---|---|---|---|---|
| 1 | Qwen 3.6 35B-A3B (UD-Q5_K_XL) | 24.76 GiB | 99 t/s | 82 t/s | 1,137 t/s |
| 2 | Qwen 3.5 35B-A3B (UD-Q5_K_XL) | 24.56 GiB | 99 t/s | 83 t/s | 1,013 t/s |
| 3 | Gemma 4 26B-A4B (Q8_0) | 25.00 GiB | 94 t/s | 92 t/s | 1,340 t/s |
| 4 | Qwen3-Next-80B-A3B (Q4_K_M) | 45.17 GiB | 90 t/s | 76 t/s | 817 t/s |
| 5 | gpt-oss-120b (MXFP4) | 59.02 GiB | 84 t/s | 80 t/s | 291 t/s |
All five models generate faster than most people can read. At 82 to 99 tokens per second, the output feels instantaneous for coding tasks.
Context scaling: the hidden differentiator
Raw generation speed doesn't tell the full story. Some models look fast at short context but collapse at longer sequences. I measure "context scaling retention," the percentage of pp512 speed the model retains at pp65536:
| Model | pp512 | pp65536 | Retention |
|---|---|---|---|
| Gemma 4 26B-A4B | 3,320 t/s | 1,340 t/s | 40% |
| Qwen 3.6 35B-A3B | 2,984 t/s | 1,137 t/s | 38% |
| Qwen3-30B-A3B | 3,204 t/s | 221 t/s | 7% |
Qwen3-30B-A3B is the cautionary tale: fastest generation (103 t/s) but completely unusable for long agent sessions because prompt processing collapses. Always benchmark at your actual working context length.
The agentic workflow
The local llama-server exposes an OpenAI-compatible API, accessible from any machine on my private network. I configure my coding assistant to use it as a provider:
{
"provider": {
"local-llama": {
"type": "openai",
"url": "http://<server-ip>:8080/v1",
"models": {
"qwen-35b-fast": {
"id": "qwen-35b-fast",
"name": "Qwen 3.6 35B-A3B (local)",
"contextLength": 131072,
"supportsTools": true
}
}
}
}
}This gives me:
- Zero latency to first token, with no network round trip to a cloud API
- No rate limits because the model handles back to back requests instantly
- 128k context so entire codebases fit in a single session
- Full privacy since code never leaves the network
- Zero cost because after hardware, inference is free
Deployment as a persistent service
The server runs via macOS launchd. It starts at boot, restarts on crash, and requires zero maintenance:
launchctl load ~/Library/LaunchAgents/com.local.llama-server.plistKey configuration flags:
| Flag | Purpose |
|---|---|
-ngl 99 | Offload all layers to Metal GPU |
-fa 1 | Flash Attention (required for KV quant) |
-c 131072 | 128k context window |
--cache-type-k q8_0 | Quantize KV cache keys (halves memory) |
--cache-type-v q8_0 | Quantize KV cache values |
--mlock | Pin model in RAM (prevents page out) |
--parallel 2 | Handle 2 concurrent requests |
--jinja | Enable chat template rendering |
Memory budget
With 128 GB unified memory and ~115 GB available for GPU workloads:
| Component | Size |
|---|---|
| Model weights | 24.76 GB (Qwen 3.6 35B-A3B) |
| KV cache (128k) | ~8 GB (q8_0, 2 slots) |
| Compute buffer | ~3 GB |
| OS overhead | ~10 GB |
| Total | ~46 GB |
| Headroom | ~69 GB free |
There's enough room to run a second model simultaneously for complex reasoning tasks, such as the 122B model at reduced context alongside the 35B.
What I learned
- MoE architecture is essential for Apple Silicon. Dense models can't compete on generation speed.
sudo purgebetween benchmarks is critical. Without it, residual memory from prior model loads can halve performance and produce misleading results (I initially measured Qwen 3.6 at 28 t/s instead of the real 99 t/s due to this).- Don't set
iogpu.wired_limit_mbon M5 Max. The default (0) gives you the most GPU memory. Setting any value reduces available memory. - Context scaling varies dramatically between models with identical architectures. Always benchmark at your target context length before deploying.
- Token generation stays above 75 t/s at tg1024 for all top 5 models. Long form output (documentation, reports) is fast enough to be practical.
What's next
I'm evaluating several candidate models:
- Qwen 3.6 35B-A3B with MTP (Multi-Token Prediction): same model but with prediction heads that draft 2 to 3 tokens per forward pass. Expected 1.5 to 1.8x generation speedup.
- Qwen3-Coder-30B-A3B: a coding specialized MoE. Need to verify context scaling doesn't collapse like its predecessor.
- Qwen3-Coder-Next-80B-A3B: SWE-bench 70.6%. Speed already known (~90 t/s), evaluating coding quality.
The goal is always the same: find the fastest model that doesn't compromise on quality for the specific workload. For agentic coding, that means fast generation, reliable context scaling, and good tool calling capabilities.
All benchmarks run on Apple M5 Max, 128 GB unified memory, llama.cpp b9430, Metal backend. Full reproducibility instructions and raw data available on request.