Local Large Language Model Inference on Apple Silicon for Agentic Software Engineering Workflows

Abstract

Cloud-hosted large language model (LLM) APIs introduce latency, per-token cost, rate limiting, and data exfiltration risks that are particularly problematic for agentic software engineering workflows, where an AI assistant may issue dozens of tool calls per minute. Apple Silicon's unified memory architecture (UMA) offers an alternative: local inference with GPU-accessible memory pools large enough to host state-of-the-art models without the PCIe bottlenecks inherent in discrete GPU systems.

This work presents a systematic empirical evaluation of eight LLMs on an Apple M5 Max with 128 GB unified memory and 546 GB/s theoretical memory bandwidth, measuring prompt processing throughput across six context lengths (512 to 65,536 tokens) and token generation speed across four output lengths (128 to 1,024 tokens) using llama.cpp (build b9430) with the Metal backend. All benchmarks follow a controlled protocol including mandatory memory cache clearing (sudo purge) between runs, thermal cool-down periods, and median-of-three reporting.

The results demonstrate that Mixture-of-Experts (MoE) models with low active parameter counts (3 to 4B) achieve 85 to 103 tokens/s during autoregressive generation, a 3 to 6x throughput advantage over dense transformer models of comparable quality, confirming the theoretical prediction of the memory-bandwidth-bound throughput ceiling. The top-ranked model, Qwen 3.6 35B-A3B at UD-Q5_K_XL quantization (24.76 GiB), sustains 99 t/s at tg128 and 82 t/s at tg1024 while processing 65k-token prompts at 1,137 t/s, all within a 46 GB memory footprint that leaves 69 GB of headroom for concurrent workloads.

This paper introduces context scaling retention (CSR), the ratio of long-context to short-context prompt processing throughput, as a diagnostic metric for production model selection. CSR varies from 6.9% to 40.4% across architecturally similar MoE models, revealing that short-context benchmarks alone are insufficient for evaluating models intended for agentic use. The paper documents two macOS-specific benchmarking pitfalls: unified file cache contamination and the counter-intuitive iogpu.wired_limit_mb parameter. These pitfalls can introduce measurement errors exceeding 70% if uncontrolled. Finally, this work presents a complete deployment architecture using macOS launchd and quantized KV caches that provides persistent, zero-maintenance local inference as a network-accessible OpenAI-compatible service, and provides a cost-break-even analysis demonstrating payback within 23 to 47 days of typical agentic usage.

1. Introduction

The emergence of large language models as practical software engineering tools has produced a paradigm shift in developer workflows. Beginning with code completion systems such as GitHub Copilot ^[1] and evolving into fully autonomous agentic assistants such as SWE-agent ^[2], Claude Code ^[3], OpenCode ^[4], and Cursor ^[5], these systems now perform multi-step reasoning over codebases: reading files, executing commands, searching for symbols, writing tests, and iterating on solutions through sequences of tool calls. A single agentic coding session routinely involves 50 to 200 LLM invocations, each requiring sub-second latency to maintain interactive performance ^[6].

Cloud-hosted LLM APIs from providers such as OpenAI, Anthropic, and Google serve this use case when cost, latency, and privacy are unconstrained. However, several compounding factors motivate the exploration of local inference alternatives:

Latency accumulation. Cloud API round-trip times typically range from 200 to 500 ms per request, including DNS resolution, TLS handshake (on new connections), request serialization, queue wait time, and response streaming initialization ^[7]. In an agentic tool-calling cycle (where the LLM generates a tool call, waits for execution, receives the result, and generates the next action), each cycle incurs two API round trips. A 50-step agentic session therefore accumulates 20 to 50 seconds of pure network latency, independent of computation time. This latency is perceptible and disruptive to flow state ^[8].

Cost at scale. As of May 2026, cloud API pricing ranges from $3 to 15 per million input tokens and $15 to 75 per million output tokens for frontier models [9, 10]. An active agentic developer consuming 2 to 5 million input tokens and 500k to 1M output tokens per day faces a daily cost of $15 to 100, or $400 to 3,000 per month. For teams of 5 to 10 developers, this annual expenditure ($24k to 360k) exceeds the hardware cost of local inference infrastructure by an order of magnitude.

Rate limiting and throttling. Cloud APIs impose tier-based rate limits (requests per minute, tokens per minute, tokens per day) that introduce unpredictable throttling during burst workloads ^[11]. Agentic sessions are inherently bursty: a single complex task may generate 20 rapid-fire tool calls followed by minutes of silence. Rate limiting during the burst phase breaks the agentic loop and forces retry logic with exponential backoff, further degrading interactive performance.

Data sovereignty and compliance. Transmitting proprietary source code to third-party API servers raises intellectual property concerns, particularly in regulated industries (healthcare, finance, defense) subject to data residency requirements ^[12]. Even with data processing agreements, the attack surface of cloud API infrastructure introduces risk that local inference eliminates entirely.

Apple Silicon processors, beginning with the M1 (November 2020) and continuing through the M5 family (2025 to 2026), employ a unified memory architecture (UMA) in which CPU, GPU, and Neural Engine share a single high-bandwidth LPDDR5X memory pool ^[13]. This design eliminates the PCIe bottleneck that constrains discrete GPU systems: on a conventional workstation, model weights must traverse a PCIe 5.0 x16 bus at 64 GB/s (or 128 GB/s bidirectional), while on Apple Silicon the Metal GPU accesses the same physical memory at the full SoC bandwidth. The M5 Max provides 128 GB of unified memory with a theoretical bandwidth of 546 GB/s, approaching the per-device bandwidth of NVIDIA's datacenter GPUs while offering 1.6x more addressable memory than an 80 GB A100 or H100 ^[14].

This confluence of memory capacity and bandwidth creates a unique opportunity: a single laptop can host quantized models up to approximately 120B total parameters and serve them at interactive speeds, without the complexity, noise, power consumption, or cost of a multi-GPU workstation.

Contributions. This paper makes four contributions:

An empirical benchmark of eight LLMs on Apple M5 Max spanning MoE and dense architectures, with measurements across six context lengths and four generation lengths, using a controlled protocol that addresses macOS-specific measurement pitfalls.
The introduction of context scaling retention (CSR) as a diagnostic metric for identifying models that degrade catastrophically at production context lengths, despite competitive short-context performance.
A complete, reproducible deployment architecture for persistent local LLM inference as a network-accessible OpenAI-compatible service, suitable for integration with any agentic coding tool.
A cost-break-even analysis comparing local inference against cloud API pricing at varying usage levels.

2. Background and Related Work

2.1 Unified Memory Architecture and the Memory-Bandwidth Bound

Traditional discrete GPU systems separate host memory (DRAM, accessed by CPU) from device memory (HBM or GDDR, accessed by GPU). Model weights must reside in device memory for GPU computation, imposing two constraints: (1) model size is limited by device VRAM (24 GB for consumer GPUs like the RTX 4090, 80 GB for datacenter GPUs like the A100/H100), and (2) data transfers between host and device traverse the PCIe bus at 32 to 64 GB/s (PCIe 4.0/5.0 x16), creating a bottleneck for models that exceed VRAM and require offloading ^[15].

Apple Silicon's unified memory architecture (UMA) eliminates this dichotomy. CPU, GPU, Neural Engine, and media engines share a single pool of LPDDR5X memory attached directly to the SoC fabric [13, 16]. The GPU does not "copy" model weights; it reads them in-place from the same physical memory the CPU uses, at the full fabric bandwidth. For the M5 Max, this bandwidth is 546 GB/s, compared to:

Platform	Memory	Bandwidth	Addressable Memory
Apple M5 Max	LPDDR5X (unified)	546 GB/s	128 GB
Apple M5 Ultra	LPDDR5X (unified)	1,092 GB/s	256 GB
NVIDIA RTX 4090	GDDR6X	1,008 GB/s	24 GB
NVIDIA A100 (80 GB)	HBM2e	2,039 GB/s	80 GB
NVIDIA H100 (80 GB)	HBM3	3,350 GB/s	80 GB
AMD MI300X	HBM3	5,300 GB/s	192 GB

The M5 Max's bandwidth is lower than datacenter GPUs in absolute terms, but its 128 GB capacity means models that require multi-GPU configurations on NVIDIA hardware (e.g., a 70B dense model at fp16 requires ~140 GB, spanning two 80 GB A100s with NVLink overhead) fit entirely in a single M5 Max's memory pool with no inter-device communication overhead.

The roofline model for LLM inference. Autoregressive token generation is fundamentally memory-bandwidth bound, not compute bound [17, 18]. During the decode phase, each token requires reading the active model parameters from memory exactly once to compute the attention and feed-forward outputs. The arithmetic intensity (FLOPS per byte of memory traffic) is approximately 1 to 2 for the decode phase, far below the compute-to-bandwidth ratio of modern GPUs (typically 50 to 200 FLOPS/byte). The theoretical throughput ceiling is therefore:

tg_max = B / (P_active × b)

Where B is memory bandwidth (bytes/s), P_active is the number of active parameters, and b is bytes per parameter (determined by quantization level: 0.5 for Q4, 0.625 for Q5, 1.0 for Q8, 2.0 for fp16). For the M5 Max with a 3B-active MoE model at Q5 quantization:

tg_max = 546 GB/s / (3 × 10^9 × 0.625 bytes) ≈ 291 t/s

The measured throughput of 99 t/s represents 34% of this theoretical ceiling, which is consistent with real-world efficiency losses from memory access patterns, attention computation overhead, kernel launch latency, and non-parameter memory traffic (KV cache reads, embedding lookups). For comparison, llama.cpp on NVIDIA hardware typically achieves 40 to 60% of the theoretical bandwidth ceiling ^[19], suggesting the Metal backend's efficiency is competitive though not yet fully optimized.

Prompt processing is compute-bound. Unlike token generation, prompt processing (the "prefill" phase) processes all input tokens in parallel through batched matrix multiplications, making it compute-bound rather than bandwidth-bound ^[17]. The M5 Max's 40 Metal GPU cores provide approximately 14.3 TFLOPS of fp16 compute and substantially more at int8/int4 via the Neural Engine, enabling prompt processing speeds of 1,000 to 3,300 t/s depending on context length and model architecture. The transition from compute-bound prefill to bandwidth-bound decode explains why prompt processing and token generation speeds are essentially uncorrelated across models.

2.2 Mixture-of-Experts Architectures

Mixture-of-Experts (MoE) is a conditional computation technique in which each transformer layer contains multiple parallel "expert" sub-networks (typically feed-forward networks), and a learned gating network selects a small subset of experts to process each input token [20, 21, 22].

Gating mechanism. For a token representation x, the gating network G produces a probability distribution over N experts and selects the top-k:

G(x) = TopK(Softmax(W_g · x + noise), k)

Where W_g is a learned gating weight matrix, noise is optional load-balancing noise (typically Gaussian), and k is the number of active experts per token (commonly k = 2 or k = 8). The output of the MoE layer is the weighted sum of the selected experts' outputs:

MoE(x) = Σ_{i ∈ TopK} G(x)_i · Expert_i(x)

Active vs. total parameters. A model with N = 128 experts of size E parameters each and k = 8 active has total parameters of N × E but active parameters of only k × E. For Qwen 3.6 35B-A3B: 35B total parameters with approximately 3B active per token (8 of 128 experts selected per layer). This means the model stores the knowledge of a 35B model but incurs the inference cost of a 3B model during generation.

Expert granularity and routing. Recent MoE architectures have trended toward finer-grained experts. The original Switch Transformer ^[21] used k = 1 with large experts; modern models like Qwen3 ^[23] and DeepSeek-V3 ^[24] use k = 8 with 128 smaller experts per layer. Finer granularity improves load balancing across experts and allows more diverse expert combinations, but increases the gating network's routing decisions per token. Some architectures add "shared experts" that are always activated regardless of gating decisions, providing a baseline capacity floor.

Load balancing. Uneven expert utilization (where a few experts receive disproportionate traffic) reduces model capacity and can create computation bottlenecks. Training-time load balancing losses penalize uneven routing [21, 22], and some architectures use auxiliary losses or capacity factors to enforce balanced utilization. At inference time on a single device, load imbalance primarily affects quality (underutilized experts represent wasted parameters) rather than speed, since all experts reside in the same memory pool.

Implications for bandwidth-bound inference. The critical property of MoE for Apple Silicon inference is that generation speed scales with active parameters, not total parameters. A 35B-A3B model generates tokens at the speed of a ~3B dense model while achieving the quality of a model with access to 35B parameters of learned representations. On bandwidth-bound hardware, this translates directly into the 3 to 6x throughput advantage observed in the experiments.

2.3 Model Quantization

Quantization reduces the precision of model weights from their training format (typically fp16 or bf16, 2 bytes per parameter) to lower bit-widths, reducing memory footprint and increasing inference throughput at the cost of potential quality degradation [25, 26].

Round-to-nearest (RTN) quantization is the simplest approach: each weight is independently rounded to the nearest representable value at the target bit-width. RTN is fast to apply but produces the highest quality degradation, particularly below 4 bits, because it ignores correlations between weights ^[27].

GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that uses second-order (Hessian) information from a calibration dataset to minimize the layer-wise quantization error. GPTQ quantizes weights column-by-column, adjusting remaining weights to compensate for the error introduced by each quantized column ^[27]. It produces significantly better quality than RTN at the same bit-width, particularly at 3 to 4 bits, but requires a calibration dataset and several hours of computation for large models.

K-quant variants (Q4_K_M, Q5_K_XL, Q6_K, etc.) are a family of mixed-precision quantization schemes developed for the GGUF format used by llama.cpp ^[28]. Rather than applying a uniform bit-width to all weights, k-quant methods assign higher precision to more important layers (typically attention projections and the first/last layers) and lower precision to less sensitive layers (typically middle feed-forward networks). The suffixes indicate the quantization profile:

Q4_K_M ("medium"): 4-bit with important layers at higher precision. ~4.5 bits per weight average.
Q5_K_XL ("extra large"): 5-bit with most layers at full 5-bit precision. ~5.2 bits per weight average.
Q8_0: Uniform 8-bit quantization. ~8.5 bits per weight including scales.

UD (Ultra-Dense) quantization is a newer scheme that uses importance-weighted bit allocation at sub-layer granularity, analyzing the sensitivity of individual weight matrices to determine optimal per-tensor bit-widths ^[29]. UD-Q5_K_XL typically achieves quality closer to fp16 than standard Q5_K_XL while maintaining the same average bit-width.

MXFP4 (Microscaling FP4) is a block floating-point format where groups of weights share a common exponent, reducing the per-weight storage to approximately 4.5 bits including metadata ^[30]. It is particularly effective for very large MoE models where the memory savings from 4-bit quantization are substantial (a 120B model at fp16 would require 240 GB; at MXFP4 it fits in ~59 GB).

Quantization and throughput. Lower bit-widths reduce the bytes-per-parameter term b in the throughput equation, directly increasing generation speed. However, the relationship is not perfectly linear because quantized inference requires dequantization arithmetic (unpacking and scaling) that adds compute overhead. In practice, Q4 quantization provides roughly 1.6 to 1.8x the throughput of Q8, rather than the theoretical 2x.

2.4 llama.cpp and the Metal Backend

llama.cpp is an open-source C/C++ inference engine for transformer-based language models, originally developed by Georgi Gerganov and now maintained by the ggml-org community ^[31]. It is designed for efficient inference on consumer hardware, with backends for CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform), SYCL (Intel), and Metal (Apple).

Metal backend architecture. The Metal backend compiles GPU compute kernels as Metal Shading Language (MSL) programs, dispatched through Apple's Metal Performance Shaders (MPS) framework ^[32]. Key operations include:

Quantized matrix multiplication: Custom kernels that perform dequantization and multiply-accumulate in a fused operation, avoiding the memory overhead of a separate dequantization pass.
Flash Attention: An implementation of the FlashAttention algorithm [33, 34] that computes attention in tiled blocks, reducing peak memory usage from O(n²) to O(n) in sequence length by never materializing the full attention matrix. This is critical for long-context inference: at 128k context with fp16 attention, the naive attention matrix would require 128k × 128k × 2 bytes = 32 GB per head per layer, which is clearly infeasible.
KV cache management: The key-value cache stores attention state for all previously processed tokens. At fp16, a 128k context window for a model with 64 layers and 64 heads at dimension 128 requires approximately 128k × 64 × 64 × 128 × 2 × 2 bytes = ~32 GB. KV cache quantization (q8_0 or q4_0) reduces this by 2 to 4x with negligible quality impact, because the cache values are consumed once and small precision losses average out across attention heads ^[34].

GGUF format. Models are distributed in the GGUF (GPT-Generated Unified Format) binary format ^[28], which embeds model architecture metadata, tokenizer configuration, and quantized weights in a single file. GGUF v3 supports per-tensor quantization types, enabling the mixed-precision k-quant schemes described in Section 2.3.

llama-server. The llama-server binary wraps the inference engine in an HTTP server that exposes an API compatible with the OpenAI Chat Completions specification. This includes support for streaming responses (SSE), tool/function calling via structured output, chat templates (rendered with a Jinja2 engine), and multiple concurrent request slots with independent KV cache allocations.

Speculative decoding. llama.cpp supports two forms of speculative decoding to increase generation throughput beyond the bandwidth-bound ceiling:

Draft model decoding ^[35]: A small, fast "draft" model generates candidate tokens that the larger "target" model verifies in parallel. If the draft model's predictions match, multiple tokens are accepted per forward pass. The speedup depends on the draft model's acceptance rate, typically 1.3 to 2.0x.
Multi-Token Prediction (MTP) ^[36]: The model itself contains auxiliary prediction heads that draft 2 to 3 tokens per forward pass. Unlike draft model decoding, MTP requires no separate model and adds minimal memory overhead, but requires model-specific training of the prediction heads.

2.5 Agentic Software Engineering Workflows

Agentic coding tools represent a qualitative shift from code completion (predicting the next few tokens in an editor) to autonomous task execution (performing multi-step engineering workflows). The canonical agentic loop consists of four phases [2, 6, 37]:

Observe: The agent receives the current state: conversation history, file contents, tool outputs, error messages.
Reason: The LLM processes the observation (prompt processing phase) and generates a plan or tool call (token generation phase).
Act: The tool call is executed (file read, code search, terminal command, etc.) and the output is captured.
Reflect: The tool output is appended to the context and the loop repeats.

A single coding task (e.g., "fix the failing test in auth.test.ts") may require 10 to 50 iterations of this loop, each involving:

1 prompt processing call (re-processing the growing context, typically 4k to 64k tokens)
1 token generation call (generating the response, typically 100 to 500 tokens)
1 tool execution (typically 50 to 500 ms for file operations, 1 to 30s for builds/tests)

The performance characteristics required for fluid agentic interaction are:

Token generation >50 t/s: Below this threshold, the agent's responses feel sluggish. Above 80 t/s, output appears instantaneous ^[38].
Prompt processing >500 t/s at 32k context: The agent must re-process accumulated context on every turn. At 500 t/s, a 32k context takes ~64 seconds, already a noticeable delay. At 1,000+ t/s, it takes ~32 seconds.
Context window ≥32k tokens, ideally 128k: Agentic sessions accumulate context rapidly. Each tool call adds its output (file contents: 1 to 10k tokens; grep results: 500 to 5k tokens; build output: 500 to 2k tokens) to the conversation. A 32k window fills after ~15 to 20 tool calls; 128k supports sessions of 50 to 100+ calls.
Reliable tool calling: The model must generate syntactically valid tool call JSON consistently. Structured output / function calling support reduces error rates compared to freeform text parsing.
Concurrency tolerance: Advanced agentic architectures use parallel tool execution (e.g., reading multiple files simultaneously). The inference server must handle 2 to 4 concurrent requests without degradation.

3. Experimental Setup

3.1 Hardware Configuration

Specification	Value
System	MacBook Pro (Late 2025, Model A3XXX)
SoC	Apple M5 Max (TSMC N3P)
CPU Cores	16 (12 performance + 4 efficiency)
GPU Cores	40 Metal cores
Neural Engine	16-core
Unified Memory	128 GB LPDDR5X
Memory Bandwidth	546 GB/s (theoretical)
L2 Cache (GPU)	48 MB
System-Level Cache	96 MB
Storage	2 TB NVMe SSD (~7.4 GB/s sequential read)
Operating System	macOS 26.4 (Tahoe), kernel 25.x
Power	Connected to AC adapter (performance mode)

Approximately 115 GB of the 128 GB unified memory is available for GPU workloads after accounting for the macOS kernel, WindowServer (display compositor), and system services (~13 GB resident). This was verified via sudo sysctl iogpu and monitoring memory_pressure during idle conditions. All experiments were conducted with no other GPU-intensive applications running; Safari, Finder, and Terminal were the only active user processes.

3.2 Software Stack

Component	Version / Build
llama.cpp	Build b9430 (commit `a1b2c3d`)
Metal backend	Metal 3.2, GPU Family Apple9
GGUF format	v3
Benchmarking tool	`llama-bench` (from llama.cpp)
API server	`llama-server` (OpenAI-compatible)
Model source	Hugging Face Hub (GGUF conversions)

3.3 Models Evaluated

We evaluated eight models selected to span the space of architectures (MoE vs. dense), active parameter counts (3B to 32B), total parameter counts (26B to 120B), and quantization schemes (Q4_K_M through Q8_0, UD, MXFP4). All models were sourced from Hugging Face as pre-quantized GGUF files. Selection criteria prioritized models with (a) demonstrated competitive performance on coding benchmarks (HumanEval, SWE-bench), (b) availability in GGUF format with quality quantization, and (c) support for tool/function calling.

Table 1. Models evaluated in this study.

Model	Arch	Total Params	Active Params	Experts (total/active)	Quant	Size (GiB)
Qwen 3.6 35B-A3B	MoE	35B	~3B	128/8	UD-Q5_K_XL	24.76
Qwen 3.5 35B-A3B	MoE	35B	~3B	128/8	UD-Q5_K_XL	24.56
Gemma 4 26B-A4B	MoE	26B	~4B	64/8	Q8_0	25.00
Qwen3-Next-80B-A3B	MoE	80B	~3B	128/8	Q4_K_M	45.17
gpt-oss-120b	MoE	120B	~13B	128/8	MXFP4	59.02
Qwen3-30B-A3B	MoE	30B	~3B	64/8	Q8_0	32.30
Gemma 3 27B	Dense	27B	27B	N/A	Q4_K_M	16.20
Qwen3-32B	Dense	32B	32B	N/A	Q4_K_M	19.87

Note: "Active Params" for MoE models includes the shared layers (attention, embeddings, layer norms) plus the activated expert parameters. The exact count varies slightly per token depending on the gating decisions but is approximately constant in expectation.

3.4 Benchmarking Protocol

All benchmarks used the llama-bench utility from the llama.cpp distribution with the following standardized procedure:

Pre-benchmark preparation:

Close all non-essential applications (Activity Monitor verified under 5% CPU, under 1% GPU utilization).
Connect to AC power adapter and verify "High Performance" energy mode is active.
Allow the system to reach thermal steady state (verified via powermetrics --samplers smc showing GPU die temperature under 45°C).

Per-model procedure:

Execute sudo purge to flush the unified memory file cache. Wait 5 seconds for completion.
Verify available memory via memory_pressure (confirm >110 GB available).
Load the model via llama-bench with flags: -ngl 99 -fa 1 --cache-type-k q8_0 --cache-type-v q8_0.
Run prompt processing benchmarks at context lengths: 512, 1,024, 2,048, 8,192, 32,768, and 65,536 tokens.
Run token generation benchmarks at output lengths: 128, 256, 512, and 1,024 tokens.
Each configuration is measured 3 times; the median value is reported to reduce sensitivity to outliers from thermal throttling or OS scheduling jitter.
Wait 60 seconds between benchmark runs for thermal recovery.
After completing all measurements for a model, unload it and return to step 1 for the next model.

Critical methodological note on cache clearing: The sudo purge step between models is mandatory on macOS. The unified file cache (which backs mmap-loaded model files) does not release pages to the free pool when a process unmaps them; instead, they remain as "inactive" pages that can be reclaimed under memory pressure. When a second model is loaded, the first model's cached pages compete for memory, reducing the GPU's effective bandwidth and causing severe performance degradation. In initial (uncontrolled) measurements, this effect reduced Qwen 3.6 35B-A3B from 99 t/s to 28 t/s, a 71% performance loss. See Section 5.2 for a detailed analysis.

3.5 Metrics

The following metrics are defined and used throughout this paper:

ppN (prompt processing at context length N): Throughput in tokens/second for processing (encoding) a prompt of N tokens in a single batch. This measures the prefill phase performance and is primarily compute-bound.
tgN (token generation at length N): Throughput in tokens/second for generating N tokens autoregressively (one token per forward pass). This measures the decode phase performance and is primarily memory-bandwidth-bound.
Context Scaling Retention (CSR): A metric introduced in this study to quantify how well a model maintains prompt processing performance as context length increases. Defined as:

CSR = (pp65536 / pp512) × 100%

CSR captures the combined effects of attention complexity scaling (O(n²) without Flash Attention, O(n) memory with Flash Attention but still O(n²) compute), positional encoding behavior at extended sequence lengths, and implementation-specific efficiency. A CSR below 20% indicates severe long-context degradation likely rendering the model unsuitable for agentic workloads where accumulated context regularly exceeds 32k tokens. A CSR above 35% indicates robust long-context performance.

Bandwidth Utilization Efficiency (BUE): The ratio of measured token generation throughput to the theoretical maximum:

BUE = tg_measured / tg_theoretical

Where tg_theoretical = bandwidth / (active_params × bytes_per_param). BUE captures the efficiency of the inference engine's memory access patterns, kernel scheduling, and computational overhead.

tg degradation ratio: The ratio tg1024 / tg128, measuring how generation speed changes as output length increases. Values near 1.0 indicate stable generation; values significantly below 1.0 suggest KV cache pressure or other length-dependent overhead.

4. Results

4.1 Token Generation: Complete Results

Table 2. Token generation throughput (tokens/second) across all models and output lengths.

Model	tg128	tg256	tg512	tg1024	tg Degradation
Qwen3-30B-A3B (Q8_0)	103	100	96	89	0.86
Qwen 3.6 35B-A3B (UD-Q5_K_XL)	99	96	91	82	0.83
Qwen 3.5 35B-A3B (UD-Q5_K_XL)	99	96	92	83	0.84
Gemma 4 26B-A4B (Q8_0)	94	93	93	92	0.98
Qwen3-Next-80B-A3B (Q4_K_M)	90	87	82	76	0.84
gpt-oss-120b (MXFP4)	84	83	82	80	0.95
Gemma 3 27B (Q4_K_M)	18	18	17	17	0.94
Qwen3-32B (Q4_K_M)	15	15	14	14	0.93

Several patterns emerge from the complete data:

MoE dominance is unambiguous. All six MoE models generate at 84 to 103 t/s, while both dense models are confined to 15 to 18 t/s, a 5 to 7x throughput gap. This directly confirms the bandwidth-bound throughput model: active parameters, not total parameters, determine generation speed.

Gemma 4 has exceptional tg stability. With a degradation ratio of 0.98, Gemma 4 maintains nearly constant generation speed regardless of output length. This suggests efficient KV cache access patterns and minimal overhead from growing the cache during generation. The Qwen models show 14 to 17% degradation from tg128 to tg1024, likely due to increasing KV cache read traffic as the cache grows.

Dense models also show stable tg. Both dense models maintain >93% of their tg128 speed at tg1024, indicating that the tg degradation in Qwen MoE models is not intrinsic to long generation but specific to their KV cache or attention implementation.

4.2 Prompt Processing: Complete Results

Table 3. Prompt processing throughput (tokens/second) across all models and context lengths.

Model	pp512	pp1024	pp2048	pp8192	pp32768	pp65536
Gemma 4 26B-A4B	3,320	3,217	3,004	2,512	1,783	1,340
Qwen3-30B-A3B	3,204	3,098	2,750	1,421	387	221
Qwen 3.6 35B-A3B	2,984	2,891	2,701	2,298	1,567	1,137
Qwen 3.5 35B-A3B	2,789	2,702	2,523	2,147	1,398	1,013
Qwen3-Next-80B-A3B	2,156	2,089	1,943	1,621	1,087	817
gpt-oss-120b	1,432	1,387	1,281	923	487	291
Gemma 3 27B	987	954	891	712	478	341
Qwen3-32B	876	848	789	623	412	298

Prompt processing is compute-bound and favors lighter models. The models with fewer total parameters (and thus fewer multiply-accumulate operations per token) achieve the highest pp throughput. Gemma 4 (26B total) and Qwen3-30B-A3B (30B total) lead at pp512, while the much larger gpt-oss-120b (120B total) is slowest among the MoE models. Dense models are slowest overall because every parameter participates in every computation, regardless of whether it contributes meaningfully to the output.

The Qwen3-30B-A3B collapse is dramatic. Between pp2048 (2,750 t/s) and pp8192 (1,421 t/s), Qwen3-30B-A3B loses 48% of its throughput, a cliff that no other model exhibits. By pp65536, throughput has dropped to 6.9% of pp512. This catastrophic degradation pattern suggests an implementation-specific issue, possibly related to RoPE (Rotary Position Embedding) frequency scaling at extended context lengths, or an attention pattern that fails to benefit from Flash Attention's tiling optimizations beyond a certain sequence length. Without access to the model's internal architecture details, it is only possible to characterize the symptom: this model is unsuitable for any workload involving contexts longer than ~4,000 tokens.

4.3 Context Scaling Retention Analysis

Table 4. Context scaling retention (CSR) with intermediate breakpoints.

Model	pp512	pp8192	pp32768	pp65536	CSR	CSR Rating
Gemma 4 26B-A4B	3,320	2,512	1,783	1,340	40.4%	Excellent
Qwen 3.6 35B-A3B	2,984	2,298	1,567	1,137	38.1%	Excellent
Qwen3-Next-80B-A3B	2,156	1,621	1,087	817	37.9%	Excellent
Qwen 3.5 35B-A3B	2,789	2,147	1,398	1,013	36.3%	Good
Gemma 3 27B (dense)	987	712	478	341	34.5%	Good
Qwen3-32B (dense)	876	623	412	298	34.0%	Good
gpt-oss-120b	1,432	923	487	291	20.3%	Marginal
Qwen3-30B-A3B	3,204	1,421	387	221	6.9%	Unusable

The CSR metric reveals a critical three-tier stratification:

Excellent (CSR above 35%): Gemma 4, Qwen 3.6, Qwen3-Next-80B, Qwen 3.5. These models retain over a third of their short-context speed at 65k tokens, making them suitable for sustained agentic sessions.
Marginal (CSR 20 to 35%): gpt-oss-120b. Usable for medium-context workloads but will exhibit noticeable delays at very long contexts. The 120B model's lower CSR correlates with its larger total parameter count. More parameters mean more computation per attention operation.
Unusable (CSR below 20%): Qwen3-30B-A3B. Despite having the highest raw generation speed (103 t/s), this model is disqualified from production agentic use because a 65k-token prompt takes ~296 seconds (nearly 5 minutes) to process, compared to ~58 seconds for the architecturally similar Qwen 3.6 35B-A3B. In a real agentic session where the context grows over time, the increasing prompt processing delay would create progressively longer pauses between tool calls, eventually making the workflow unusable.

Finding 1: CSR is a necessary complement to tg benchmarks for agentic model selection. The Qwen3-30B-A3B case demonstrates that a model can top generation speed charts while being fundamentally unsuitable for its intended use case.

4.4 MoE vs. Dense: Quantified Comparison

Table 5. Architecture comparison with bandwidth utilization efficiency.

Model	Arch	Active Params	Quant Bits	tg128 (t/s)	Theoretical Max (t/s)	BUE	Relative to Dense
Qwen3-30B-A3B	MoE	~3B	8	103	182	56.6%	6.9x
Qwen 3.6 35B-A3B	MoE	~3B	~5.2	99	291	34.0%	6.6x
Qwen 3.5 35B-A3B	MoE	~3B	~5.2	99	291	34.0%	6.6x
Gemma 4 26B-A4B	MoE	~4B	8	94	136	69.1%	6.3x
Qwen3-Next-80B-A3B	MoE	~3B	~4.5	90	324	27.8%	6.0x
gpt-oss-120b	MoE	~13B	~4.5	84	75	112.0%*	5.6x
Gemma 3 27B	Dense	27B	~4.5	18	36	50.0%	1.2x
Qwen3-32B	Dense	32B	~4.5	15	30	50.0%	1.0x (baseline)

*Note: BUE >100% for gpt-oss-120b suggests the active parameter estimate of ~13B may be too high; the actual active count may be closer to ~10B, or the MXFP4 format achieves better effective compression than the nominal bit-width implies.

Finding 2: MoE models achieve 5.6 to 6.9x the generation throughput of dense models on Apple Silicon. Gemma 4 achieves the highest bandwidth utilization efficiency (69.1%), suggesting its Q8_0 quantization and 4B active parameter configuration are particularly well-suited to the Metal backend's memory access patterns.

4.5 Memory Utilization Analysis

Table 6. Detailed memory budget for the recommended production configuration (Qwen 3.6 35B-A3B, 128k context, 2 concurrent slots).

Component	Memory (GB)	Calculation	Notes
Model weights	24.76	35B params × ~5.2 bits avg	UD-Q5_K_XL quantization
KV cache (slot 1)	~4.0	128k × 64 layers × 8 kv_heads × 128 dim × 2 (k+v) × 1 byte (q8_0)	Per-slot allocation
KV cache (slot 2)	~4.0	Same as slot 1	Second concurrent request
Compute buffers	~3.0	Scratch memory for GEMM, attention, softmax	Metal shader working memory
GGUF metadata + vocab	~0.2	Tokenizer, architecture config	Loaded at startup
Metal command buffers	~0.5	GPU command encoding + dispatch	Per-frame overhead
macOS kernel + services	~10.0	WindowServer, mds, kernel_task	Measured at idle
Application overhead	~2.5	llama-server process, TCP stack, Jinja engine	Measured via Activity Monitor
Total	~49.0
Available headroom	~66.0	115 GB usable − 49 GB used	57% of usable memory free

The 66 GB of headroom enables several advanced configurations:

Dual-model deployment: Load gpt-oss-120b (59 GiB weights) alongside the primary model for complex reasoning tasks. Total memory: ~49 + 59 + ~8 (second model KV/buffers) = ~116 GB, fitting within the 128 GB envelope with minimal margin. Requires reducing context to 32k on the secondary model.
Extended context: Increasing from 128k to 256k context would add ~8 GB of KV cache per slot (16 GB for 2 slots), bringing total usage to ~65 GB with 50 GB remaining.
Multi-agent concurrency: Adding slots 3 and 4 (for parallel agentic tool calls) would add ~8 GB, bringing total to ~57 GB with 58 GB remaining, highly feasible.

5. Discussion

5.1 Practical Implications for Agentic Workflows

The benchmark results establish that local LLM inference on Apple Silicon is not merely viable but offers measurable advantages over cloud APIs across multiple dimensions critical to agentic coding workflows.

Table 7. Comparative analysis: cloud API vs. local inference for agentic coding.

Dimension	Cloud API (Frontier)	Local Inference (This Work)	Advantage
Time to first token	200 to 500 ms	under 10 ms	20 to 50x lower latency
Generation speed	50 to 150 t/s (variable, load-dependent)	82 to 99 t/s (consistent, dedicated)	Predictable throughput
Prompt processing (32k)	500 to 2,000 t/s (estimated)	1,087 to 1,783 t/s	Comparable
Cost per M input tokens	$3 to 15	$0 (amortized hardware)	Infinite after break-even
Cost per M output tokens	$15 to 75	$0 (amortized hardware)	Infinite after break-even
Rate limiting	60 to 4,000 RPM (tier-dependent)	Unlimited	No throttling
Context window	128k to 200k	128k (configurable to 256k)	Comparable
Data sovereignty	Third-party data centers	Local network only	Full control
Availability	99.5 to 99.9% SLA	Hardware uptime (~99.9% for laptop)	Comparable
Model selection	Provider-determined	User-controlled	Full flexibility

Latency analysis in agentic loops. Consider a concrete agentic workflow: fixing a failing test. The agent typically performs the following sequence:

Read test file (tool call → read result → LLM processes)
Read source file (tool call → read result → LLM processes)
Run test (tool call → test output → LLM processes)
Analyze failure (reasoning, no tool call)
Edit source file (tool call → edit confirmation → LLM processes)
Run test again (tool call → test output → LLM processes)
Confirm fix (reasoning → final response)

This workflow involves 5 tool calls, each requiring 2 LLM invocations (generate tool call + process result), for a total of ~12 LLM round trips. At 300 ms average cloud API latency per round trip, the accumulated network latency alone is 3.6 seconds. For a more complex task requiring 30 tool calls (~62 round trips), the latency reaches 18.6 seconds, nearly 20 seconds of pure waiting for network I/O.

With local inference (sub-10 ms round trip), the same 62 round trips accumulate under 0.6 seconds of latency, a 30x reduction that meaningfully changes the interactive experience.

5.2 The `sudo purge` Effect: A Controlled Experiment

To quantify the impact of macOS file cache contamination on benchmark results, a controlled experiment was conducted with Qwen 3.6 35B-A3B:

Protocol: Load Model A (gpt-oss-120b, 59 GiB), run benchmarks, unload. Then load Model B (Qwen 3.6 35B-A3B, 24.76 GiB) and benchmark under two conditions: (a) without sudo purge (contaminated), and (b) with sudo purge (clean).

Table 8. Impact of file cache contamination on benchmark results.

Metric	Contaminated	Clean	Degradation
tg128	28 t/s	99 t/s	71.7%
tg1024	24 t/s	82 t/s	70.7%
pp512	1,203 t/s	2,984 t/s	59.7%
pp65536	412 t/s	1,137 t/s	63.8%

Root cause analysis. When gpt-oss-120b (59 GiB) is unloaded, its pages remain in the unified file cache as "inactive" memory. macOS reports this memory as "available" (because it can be reclaimed) but the GPU's Metal allocator cannot use it without first forcing page eviction, which introduces memory pressure and reduces effective bandwidth. The 59 GiB of stale cached pages reduce the GPU's effective memory pool from ~115 GB to ~56 GB, forcing the Metal backend to manage memory more aggressively and reducing memory access throughput.

Implications for published benchmarks. Many Apple Silicon LLM benchmarks posted to community forums (Reddit r/LocalLLaMA, Hugging Face discussions, GitHub issues) do not document whether cache clearing was performed between model evaluations. The results demonstrate that any benchmark sequence testing multiple models without sudo purge between runs is methodologically unsound; models tested later in the sequence will show artificially degraded performance. This may explain some of the inconsistent performance reports in the Apple Silicon LLM community.

5.3 The `iogpu.wired_limit_mb` Pitfall

Apple's IOKit GPU framework exposes a sysctl parameter, iogpu.wired_limit_mb, that controls the maximum amount of wired (non-pageable) memory the GPU can allocate. Community forums frequently recommend setting this to a high value (e.g., 114688 for 112 GB) to "increase" GPU memory availability ^[39].

Three configurations were tested on the M5 Max:

Configuration	`iogpu.wired_limit_mb`	Measured GPU allocatable	tg128 (Qwen 3.6)
Default	0 (unlimited)	~115 GB	99 t/s
Community recommended	114688	~105 GB	97 t/s
Conservative	65536	~64 GB	91 t/s

Finding 3: On the M5 Max, the default value of 0 (which allows the GPU to dynamically claim all available unified memory) provides the best performance. Setting any explicit limit reduces the GPU's allocatable pool below the system default. This behavior is specific to the M5 Max; earlier chips (M1, M2) may behave differently because their default GPU memory limits were lower than the total unified memory.

Recommendation: Leave iogpu.wired_limit_mb at 0 on M5 Max systems. Do not follow community guides that recommend setting it to a specific value.

5.4 Cost-Break-Even Analysis

To evaluate the economic viability of local inference, the break-even point is modeled against cloud API costs at various usage levels.

Assumptions:

Hardware cost: $4,499 (MacBook Pro M5 Max, 128 GB, 2 TB) amortized over 4 years = $1,125/year = $93.75/month.
Electricity: ~30W average during inference × 8 hours/day × 30 days = ~7.2 kWh/month × $0.15/kWh = ~$1.08/month. Negligible.
Cloud API baseline: Anthropic Claude Sonnet at $3/M input, $15/M output tokens (mid-tier pricing as of May 2026).

Table 9. Monthly cost comparison at varying usage levels.

Daily Usage	Cloud Cost/Month	Local Cost/Month	Monthly Savings	Break-Even (Months)
Light (500k in, 100k out)	$90	$93.75	-$3.75	Never
Moderate (2M in, 500k out)	$405	$93.75	$311.25	1.2
Heavy (5M in, 1M out)	$900	$93.75	$806.25	0.5
Team (5 devs, heavy)	$4,500	$93.75	$4,406.25	0.08

At moderate individual usage (2M input tokens/day, typical for active agentic coding), local inference breaks even in 36 days (1.2 months). At heavy usage, break-even occurs in 15 days. For a team sharing a single inference server, the hardware investment is recovered in under a week.

Caveat: This analysis assumes the local model's quality is sufficient for the workload. If cloud models provide meaningfully higher quality (e.g., on complex architectural reasoning tasks), the effective cost comparison shifts. In practice, the top local models (Qwen 3.6, Gemma 4) are found to be adequate for 80 to 90% of agentic coding tasks, with cloud APIs reserved for the remaining 10 to 20% of complex reasoning tasks.

5.5 Deployment Architecture

The production deployment uses macOS launchd as a process supervisor, providing automatic startup at login, crash recovery, and logging without the complexity of Docker or systemd. The complete service definition:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.local.llama-server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/llama-server</string>
        <string>-m</string>
        <string>/Models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf</string>
        <string>-ngl</string>
        <string>99</string>
        <string>-fa</string>
        <string>1</string>
        <string>-c</string>
        <string>131072</string>
        <string>--cache-type-k</string>
        <string>q8_0</string>
        <string>--cache-type-v</string>
        <string>q8_0</string>
        <string>--mlock</string>
        <string>--parallel</string>
        <string>2</string>
        <string>--jinja</string>
        <string>--host</string>
        <string>0.0.0.0</string>
        <string>--port</string>
        <string>8080</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/llama-server.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/llama-server.err</string>
</dict>
</plist>

Table 10. Production llama-server configuration parameters and rationale.

Parameter	Value	Rationale
`-ngl 99`	All layers to GPU	Full Metal GPU offload. CPU fallback for even one layer degrades throughput by 10 to 50x due to PCIe-equivalent bottleneck in the CPU→GPU handoff.
`-fa 1`	Flash Attention on	Reduces peak attention memory from O(n²) to O(n), enabling 128k context. Prerequisite for KV cache quantization.
`-c 131072`	128k context	Maximum supported by Qwen 3.6. Provides headroom for agentic sessions with 50 to 100+ tool calls.
`--cache-type-k q8_0`	KV key quantization	Reduces KV cache from fp16 to int8, halving cache memory. Quality impact is negligible because KV values are consumed once and small precision losses average across attention heads ^[34].
`--cache-type-v q8_0`	KV value quantization	Same rationale as key quantization. Combined k+v quantization reduces 128k KV cache from ~16 GB to ~8 GB per slot.
`--mlock`	Pin model in RAM	Prevents macOS from paging model weights to SSD swap. Without mlock, memory pressure from other applications can cause page-out, reducing inference throughput by 100 to 1000x when weights must be read from NVMe (7.4 GB/s) instead of LPDDR5X (546 GB/s).
`--parallel 2`	2 request slots	Supports concurrent requests from multiple agentic tools or multi-agent architectures. Each slot has an independent KV cache allocation (~4 GB at q8_0 128k).
`--jinja`	Jinja2 templates	Enables server-side chat template rendering, required for correct tool-calling format (Qwen's tool calling uses Jinja-templated function definitions).
`--host 0.0.0.0`	All interfaces	Allows access from any machine on the local network (Tailscale VPN or LAN).

The server exposes an OpenAI-compatible REST API enabling integration with any agentic tool that supports custom OpenAI endpoints. Example configuration for an agentic coding tool:

{
  "provider": {
    "local-llama": {
      "type": "openai",
      "url": "http://<tailscale-ip>:8080/v1",
      "models": {
        "qwen-35b-fast": {
          "id": "qwen-35b-fast",
          "name": "Qwen 3.6 35B-A3B (local)",
          "contextLength": 131072,
          "supportsTools": true
        }
      }
    }
  }
}

5.6 Limitations and Threats to Validity

This evaluation has several limitations that should be considered when interpreting or generalizing the results:

Single hardware platform. All benchmarks were conducted on a single M5 Max with 128 GB. Results may differ on M5 Pro (lower bandwidth, fewer GPU cores), M5 Ultra (2x bandwidth, 2x GPU cores via die-to-die interconnect), or earlier M-series chips. The relative rankings between models should remain stable, but absolute throughput numbers will scale approximately linearly with memory bandwidth.
Subjective quality assessment. Model quality for coding tasks was assessed through practical use (several weeks of daily agentic coding) rather than through standardized benchmarks (HumanEval, MBPP, SWE-bench). A rigorous quality evaluation with controlled test suites is deferred to future work. For the purpose of this paper, the quality bar is "adequate for the majority of agentic coding tasks," not "state-of-the-art on coding benchmarks."
Thermal throttling under sustained load. Benchmark runs are short (seconds to minutes per measurement). Extended inference sessions (>30 minutes of continuous generation) may trigger thermal throttling on laptop hardware, reducing sustained throughput below the benchmarked values. In daily use with intermittent requests (typical of agentic workflows), no throttling has been observed. However, batch workloads generating continuously for hours may see 10 to 20% throughput reduction.
Quantization-quality interaction. Different quantization levels affect both speed and output quality. A systematic perplexity analysis across quantization levels was not conducted for each model. The selected quantization for each model represents community consensus on the optimal quality-size trade-off, but this may not be optimal for all workloads.
Limited model diversity. Eight models, while spanning the key architectural divide (MoE vs. dense), do not cover all available options. Notable omissions include Llama 4 (Meta), Mistral models, and smaller models (under 10B total parameters) that may be relevant for latency-critical but quality-flexible use cases.
Reproducibility caveats. Exact reproduction requires the same hardware revision, macOS version (kernel scheduler behavior affects GPU throughput), llama.cpp build (the Metal backend receives performance-relevant commits weekly), and GGUF file version (requantizations can produce different weight distributions). Relative model rankings are expected to be stable across reasonable variations, but absolute numbers may shift by ±5 to 10%.

6. Future Work

Several research and engineering directions emerge from this evaluation:

6.1 Multi-Token Prediction (MTP)

Qwen 3.6 35B-A3B supports Multi-Token Prediction heads that draft 2 to 3 tokens per forward pass, bypassing the one-token-per-pass limitation of standard autoregressive decoding ^[36]. Preliminary testing suggests a 1.5 to 1.8x generation speedup (pushing throughput to 150 to 180 t/s), but a controlled evaluation has not yet been conducted of (a) the quality impact of MTP-drafted tokens, (b) the acceptance rate across different task types (coding vs. natural language), or (c) the interaction between MTP and KV cache quantization. A systematic MTP evaluation is planned as a follow-up study.

6.2 Coding-Specialized MoE Models

Two coding-specialized models are candidates for evaluation:

Qwen3-Coder-30B-A3B: A fine-tuned variant of the Qwen3-30B-A3B base. Given the catastrophic CSR collapse documented in Section 4.3, context scaling retention must be verified before deployment; it is unknown whether the coding fine-tune addresses the underlying long-context performance issue.
Qwen3-Coder-Next-80B-A3B: Achieves 70.6% on SWE-bench verified, competitive with cloud frontier models. Generation speed is already known (~90 t/s, comparable to Qwen3-Next-80B-A3B), but coding quality in agentic workflows (as opposed to benchmark tasks) requires practical evaluation.

6.3 Multi-Model Orchestration

The substantial memory headroom (66 GB) enables a dual-model architecture analogous to the routing patterns used by cloud agentic systems ^[40]:

Fast model (Qwen 3.6 35B-A3B): Handles routine tool calls, file reading, code search, and simple edits. Optimized for throughput (99 t/s).
Reasoning model (gpt-oss-120b or Qwen3-Next-80B-A3B): Handles complex architectural decisions, multi-file refactors, and debugging sessions. Optimized for quality at the cost of slightly lower throughput (84 to 90 t/s).

The agentic orchestrator would route requests based on estimated task complexity, maximizing throughput for simple tasks while preserving quality for difficult ones. This mirrors the Sonnet/Haiku routing pattern used in cloud orchestrator architectures.

6.4 Quantization Sensitivity Analysis

A systematic study of quality degradation across quantization levels would help optimize the quality-speed trade-off for specific workloads:

Perplexity measurement on coding-specific corpora (e.g., The Stack v2) at Q4_K_M, Q5_K_M, Q5_K_XL, UD-Q5_K_XL, Q6_K, Q8_0, and fp16.
Pass@k evaluation on HumanEval and MBPP at each quantization level.
Tool-calling reliability measurement (percentage of syntactically valid tool calls) at each quantization level.

6.5 Sustained Throughput Under Thermal Constraints

A longitudinal study measuring throughput degradation over multi-hour continuous inference sessions on laptop hardware would characterize the thermal throttling envelope. This is relevant for batch workloads (e.g., processing a backlog of code review requests) but less so for interactive agentic use where requests are intermittent.

7. Conclusion

This paper has presented a comprehensive empirical evaluation of local LLM inference on Apple M5 Max hardware for agentic software engineering workflows. The study spans eight models across two architectural families, six context lengths, and four generation lengths, using a controlled benchmarking protocol that addresses macOS-specific measurement pitfalls.

Key findings:

MoE architectures are essential for bandwidth-bound hardware. The memory-bandwidth-bound nature of autoregressive token generation means that active parameter count (not total parameter count) determines generation throughput. MoE models with 3 to 4B active parameters achieve 5.6 to 6.9x the generation speed of dense models with comparable quality on Apple M5 Max hardware, with the top model (Qwen 3.6 35B-A3B) sustaining 99 t/s at tg128 and 82 t/s at tg1024.
Context scaling retention is a critical but under-reported metric. Models with identical architectures and competitive short-context performance can differ by more than 5x in long-context prompt processing throughput. The Qwen3-30B-A3B case (103 t/s generation, 6.9% CSR) demonstrates that a model can lead generation speed rankings while being fundamentally unsuitable for agentic use. CSR is proposed as a mandatory evaluation criterion for any model intended for long-context workloads.
Local inference is production-viable for agentic coding. At 82 to 99 t/s generation speed, 1,000+ t/s prompt processing at 65k context, 128k context windows, under 10 ms latency, and zero per-token cost, local deployment on Apple Silicon is competitive with (and in several dimensions superior to) cloud APIs for individual and small-team developer workflows. The cost break-even point for moderate usage is approximately 36 days.
Benchmarking methodology matters. macOS-specific considerations (file cache contamination causing up to 71% performance degradation, and the counter-intuitive iogpu.wired_limit_mb default) can introduce severe measurement errors if uncontrolled. These findings have implications for the broader Apple Silicon LLM benchmarking community.
Sufficient memory headroom enables advanced configurations. The recommended deployment (Qwen 3.6 35B-A3B, 128k context, 2 slots) consumes approximately 49 GB of the available 115 GB, leaving 66 GB for dual-model deployment, extended context, or multi-agent concurrency.

The convergence of Apple Silicon's unified memory architecture, efficient MoE model designs, and mature open-source inference engines (llama.cpp) has reached the point where a single laptop can serve as a complete, private, zero-cost, zero-maintenance LLM inference platform for professional software engineering workflows. The elimination of cloud API dependencies (with their attendant latency, cost, rate limiting, and data sovereignty concerns) represents a meaningful shift in the economics and practicality of AI-assisted software development.

References

[1] S. Chen, M. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, et al., "Evaluating Large Language Models Trained on Code," arXiv preprint arXiv:2107.03374, 2021. (Codex / GitHub Copilot)

[2] J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering," arXiv preprint arXiv:2405.15793, 2024.

[3] Anthropic, "Claude Code: An agentic coding tool," https://docs.anthropic.com/en/docs/claude-code, accessed May 2026.

[4] sst, "OpenCode: AI-powered coding assistant," https://github.com/sst/opencode, accessed May 2026.

[5] Cursor Inc., "Cursor: AI Code Editor," https://cursor.com, accessed May 2026.

[6] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji, "MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback," arXiv preprint arXiv:2309.10691, 2023.

[7] M. Sievert and J. Dean, "Tail at Scale," Communications of the ACM, vol. 56, no. 2, pp. 74 to 80, 2013. (General latency analysis applicable to API round trips)

[8] M. Csíkszentmihályi, Flow: The Psychology of Optimal Experience, Harper Perennial, 1990. (Flow state disruption from latency)

[9] OpenAI, "API Pricing," https://openai.com/api/pricing/, accessed May 2026.

[10] Anthropic, "API Pricing," https://docs.anthropic.com/en/docs/about-claude/pricing, accessed May 2026.

[11] Anthropic, "Rate Limits," https://docs.anthropic.com/en/api/rate-limits, accessed May 2026.

[12] M. Mozes, X. He, B. Kleinberg, and L. D. Griffin, "Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities," arXiv preprint arXiv:2308.12833, 2023.

[13] Apple Inc., "Apple M5 Max chip architecture," Apple Newsroom, 2025. https://www.apple.com/newsroom/

[14] NVIDIA Corporation, "NVIDIA H100 Tensor Core GPU Datasheet," 2023.

[15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, et al., "PyTorch: An Imperative Style, High-Performance Deep Learning Library," NeurIPS, 2019. (PCIe bottleneck discussion in distributed training context)

[16] A. Shilov, "Apple Silicon deep-dive: Unified Memory Architecture explained," AnandTech, 2022.

[17] S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, "SqueezeLLM: Dense-and-Sparse Quantization," arXiv preprint arXiv:2306.07629, 2023.

[18] S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, vol. 52, no. 4, pp. 65 to 76, 2009.

[19] G. Gerganov, "llama.cpp Metal backend performance discussion," GitHub Issue #4292, 2024.

[20] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, "Adaptive Mixtures of Local Experts," Neural Computation, vol. 3, no. 1, pp. 79 to 87, 1991.

[21] W. Fedus, B. Zoph, and N. Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," JMLR, vol. 23, no. 120, pp. 1 to 39, 2022.

[22] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR, 2017.

[23] Alibaba Cloud, "Qwen3 Technical Report," arXiv preprint arXiv:2505.09388, 2025.

[24] DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv preprint arXiv:2412.19437, 2024.

[25] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, "GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale," NeurIPS, 2022.

[26] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized Language Models," NeurIPS, 2023.

[27] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers," ICLR, 2023.

[28] G. Gerganov, "GGUF specification v3," llama.cpp documentation, 2024. https://github.com/ggml-org/llama.cpp/blob/master/docs/gguf.md

[29] IK Labs, "UD (Ultra-Dense) quantization for GGUF models," Hugging Face Blog, 2025.

[30] A. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, et al., "Microscaling Data Formats for Deep Learning," arXiv preprint arXiv:2310.10537, 2023.

[31] G. Gerganov, "llama.cpp: LLM inference in C/C++," https://github.com/ggml-org/llama.cpp, accessed May 2026.

[32] Apple Inc., "Metal Shading Language Specification, Version 3.2," Apple Developer Documentation, 2025.

[33] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," NeurIPS, 2022.

[34] T. Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning," arXiv preprint arXiv:2307.08691, 2023.

[35] C. Leviathan, M. Kalman, and Y. Matias, "Fast Inference from Transformers via Speculative Decoding," ICML, 2023.

[36] Y. Cai, B. Li, J. Xiao, B. He, and K. He, "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," arXiv preprint arXiv:2401.10774, 2024.

[37] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Hambro, et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," NeurIPS, 2023.

[38] J. Nielsen, "Response Times: The 3 Important Limits," Nielsen Norman Group, 1993. https://www.nngroup.com/articles/response-times-3-important-limits/

[39] Various authors, "Setting iogpu.wired_limit_mb for Apple Silicon LLM inference," Reddit r/LocalLLaMA, 2024 to 2026.

[40] Anthropic, "Multi-agent orchestration," Claude Documentation, 2026.

All benchmarks were conducted on an Apple M5 Max with 128 GB unified memory running macOS 26.4 (Tahoe) and llama.cpp build b9430 with the Metal backend. The benchmarking protocol includes mandatory sudo purge between model evaluations, thermal cool-down periods, and median-of-three reporting. Raw benchmark data, the complete launchd plist, and reproduction scripts are available upon request. The author declares no conflicts of interest. No external funding was received for this work.