← Back to Blog

Understanding --cache-ram in llama.cpp: Prompt Caching, Eviction Errors, and Apple Silicon

·15 min read
LLMllama.cppApple SiliconInfrastructureKV Cache

This post documents a debugging session that started with mysterious KV cache eviction errors on my llama-server and ended with a clearer understanding of what --cache-ram actually does, how it interacts with KV cache quantization on Apple Silicon, and why most online guides get it wrong.

The Problem

I run a llama-server on an M5 Max (128 GB unified memory) serving Qwen 3.6 35B-A3B with 128k context, 2 parallel slots, q8_0 KV cache quantization, and Flash Attention. The setup is documented in my previous benchmarking post, with the launchd plist configured for persistent inference on a local network.

The server was throwing a relentless error during long-context agentic sessions:

failed to find free space in the KV cache, retrying with smaller batch size

Requests would fail, slots would report busy, and conversations would get cut short after 15 to 20 tool calls. The math seemed straightforward: the model is approximately 25 GB, each KV cache slot is approximately 4 GB at q8_0 128k context, total is approximately 33 GB. With 128 GB available, I should have 95 GB of headroom. Where was the memory going?

What Most Guides Say (And Get Wrong)

Many online guides describe --cache-ram as forcing the KV cache into system RAM instead of VRAM. This is incorrect.

I found articles on craftrigs.com, various Medium posts, and Reddit discussions all saying the same thing: set --cache-ram to a huge value to give yourself more KV cache space. The misconception is so widespread that I believed it myself initially. The reasoning seemed to follow: "Your GPU has limited VRAM. Use system RAM for the cache instead."

The fundamental issue with this explanation is that it is simply wrong.

--cache-ram does NOT control where the active KV cache lives. The active KV cache, the one used during token generation, always stays in GPU memory. On Apple Silicon, that is Metal memory. On NVIDIA hardware, that is VRAM. The parameter does not enable offloading. It does something completely different.

This misconception leads people to set --cache-ram to 32 GB or 64 GB thinking it gives them more KV cache space. It does not. It does something else entirely, and when combined with the default settings on memory-constrained systems, it creates exactly the problem I was seeing: KV cache eviction errors on what should be plenty of memory.

What --cache-ram Actually Does

Introduced in PR #16391 (October 2025), --cache-ram controls host-memory prompt caching. It is separate from the active KV cache entirely.

Here is how it works: when a request finishes and a slot goes idle, the server saves that slot's KV state to host RAM. When a new request arrives with a matching prefix (same system prompt, same conversation history prefix), the server restores the cached state instead of re-processing the prompt from scratch. This provides up to 93% time-to-first-token (TTFT) reduction for cached prefixes.

The default value is 8192 MiB, which equals 8 GiB. This has been enabled by default since October 2025.

The critical distinction: the active KV cache (the one used during inference) is separate and always lives in GPU/Metal memory. The prompt cache is a host-RAM feature for prefix reuse optimization.

For agentic workflows with consistent system prompts, this is extremely valuable. Every request shares the same system prompt prefix; if the prefix is cached in host RAM, prefill time drops from ~60 seconds (re-processing 128k tokens) to ~200 milliseconds (restoring from cache).

Why 8 GiB Caused Problems

On my M5 Max with 128 GB: model weights approximately 25 GB, 2 KV cache slots approximately 8 GB, system overhead approximately 13 GB, total approximately 46 GB used, approximately 69 GB free. Plenty of room, on paper.

But --cache-ram 8192 allocates an additional 8 GB of host RAM for prompt caching. On Apple Silicon, host RAM and GPU memory are the same unified pool. There is no separate system RAM and VRAM.

The prompt cache competes directly with the Metal allocator for unified memory. Under certain memory pressure conditions (other applications running, file cache growth, macOS WindowServer activity), this competition causes the Metal backend to fail KV cache allocations.

The eviction errors were not because the active KV cache was too large. They occurred because the prompt cache's 8 GB allocation reduced the available unified memory below the threshold the Metal allocator needed for KV cache operations. When the Metal backend tried to allocate a new KV cache for an incoming request, the unified memory allocator could not find a contiguous or sufficiently available block, causing the eviction.

This is unique to Apple Silicon's unified memory architecture. On NVIDIA hardware, host RAM and VRAM are physically separate. When --cache-ram consumes 8 GB of host RAM on an RTX 4090 with 24 GB VRAM, the GPU's VRAM allocations are completely unaffected. But on Apple Silicon, host RAM and GPU memory are the same resource, so the prompt cache's allocation directly starves the active KV cache.

The Fix

Three options, ordered by preference:

Option 1: Reduce prompt cache size Set --cache-ram 2048 or --cache-ram 4096. Still get prompt caching benefits with less memory pressure on the unified allocator. For agentic workflows with system prompts, even 2 to 4 GB of prompt cache is enough to cache 2 to 4 full conversation prefixes.

Option 2: Disable prompt caching entirely Set --cache-ram 0. No TTFT benefit from cached prefixes, but eliminates the memory competition. On my server, I chose this path because the Qwen 3.6 model is already fast at prefill (1,137 t/s at 128k context), and the memory stability was more important than the prompt caching optimization.

Option 3: Add --kv-unified This flag is required for prompt caching to work properly anyway. Without it, the server silently disables idle slot caching and prints a warning. If you want to keep --cache-ram enabled, always pair it with --kv-unified.

The following is a reference launchd plist (the same configuration used in the benchmarking post, updated with --cache-ram 0 and --kv-unified):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.local.llama-server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/llama-server</string>
        <string>-m</string>
        <string>/Models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf</string>
        <string>-ngl</string>
        <string>99</string>
        <string>-fa</string>
        <string>1</string>
        <string>-c</string>
        <string>131072</string>
        <string>--cache-type-k</string>
        <string>q8_0</string>
        <string>--cache-type-v</string>
        <string>q8_0</string>
        <string>--cache-ram</string>
        <string>0</string>
        <string>--kv-unified</string>
        <string>--mlock</string>
        <string>--parallel</string>
        <string>2</string>
        <string>--jinja</string>
        <string>--host</string>
        <string>0.0.0.0</string>
        <string>--port</string>
        <string>8080</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/llama-server.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/llama-server.err</string>
</dict>
</plist>

Plist Structure Breakdown

The launchd plist is a configuration file that tells macOS how to manage the llama-server process. Here is what each element does:

Label (com.local.llama-server): A unique identifier for the service. Use reverse-DNS notation to avoid conflicts with other services. This is how launchd tracks the process across reboots.

ProgramArguments: The command-line invocation of llama-server. Each array element is one argument, including the binary path, model path, and all flags. The array order matches the order arguments appear on the command line.

  • /usr/local/bin/llama-server — The binary path. Install with brew install llama.cpp or build from source.
  • -m /Models/... — Path to the GGUF model file. Use absolute paths; launchd does not inherit your shell's $HOME.
  • -ngl 99 — Offload all layers to the GPU. Setting to 99 tells llama.cpp to offload as many layers as possible; if the model fits entirely in memory, all layers go to GPU.
  • -fa 1 — Enable Flash Attention. Reduces peak memory from O(n²) to O(n) and is a prerequisite for KV cache quantization.
  • -c 131072 — Set context window to 128k tokens. Controls the size of the active KV cache pool.
  • --cache-type-k q8_0 — Quantize the key cache to 8-bit. Halves KV cache memory usage with negligible quality impact.
  • --cache-type-v q8_0 — Quantize the value cache to 8-bit. Same rationale as the key cache.
  • --cache-ram 0 — Disable host-memory prompt caching. Set to a value like 4096 to enable with a 4 GB limit.
  • --kv-unified — Use a single shared KV buffer across all slots. Required for prompt caching to work.
  • --mlock — Pin model weights in memory so macOS cannot page them to disk. Without this, page-outs cause 100x to 1000x throughput drops.
  • --parallel 2 — Serve 2 concurrent requests. Each slot has its own KV cache allocation.
  • --jinja — Enable Jinja2 chat template rendering, required for tool-calling formats.
  • --host 0.0.0.0 — Bind to all network interfaces, allowing access from other machines.
  • --port 8080 — Listen on port 8080.

RunAtLoad (true): Start the service when the user logs in. If set to false, you must manually launch it with launchctl load.

KeepAlive (true): Restart the process if it crashes. Without this, a crash would require manual intervention to restore the service.

StandardOutPath and StandardErrorPath: Redirect stdout and stderr to files instead of the terminal. Use these to inspect runtime errors and server logs. The files are created automatically if they do not exist.

Interaction with Other Flags

FlagControlsMemory Pool
--cache-type-k q8_0Active KV cache key data typeGPU/Metal memory
--cache-type-v q8_0Active KV cache value data typeGPU/Metal memory
--cache-ram NPrompt cache size limitHost RAM (unified on Apple Silicon)
--cache-reuse NMinimum chunk size for KV shiftingGPU/Metal memory
--kv-unifiedSingle shared KV bufferGPU/Metal memory
-c 131072Context window sizeDetermines active KV cache size
--parallel NNumber of request slotsMultiplies active KV cache size
-b 512Logical batch sizeShapes prefill throughput
--ubatch-size 1024Physical micro-batch for GPUControls matmul efficiency

Understanding these flags separately prevents the common mistake of thinking --cache-ram affects the active KV cache size or placement.

Known Issues

Several issues with --cache-ram and prompt caching have been reported on the llama.cpp repository:

Metal plus KV offload crash (issue #23578, May 2026): Setting --cache-ram with certain quantization types causes a GGML_ASSERT(buf_dst) failed crash during KV cache updates. Workaround: set --cache-ram 0 or add --no-kv-offload.

Slow cache updates (fixed in build b8185): Prompt cache updates could take 80 or more seconds, defeating the purpose of the optimization. This was a regression introduced in October 2025 and fixed by mid-May 2026.

--kv-unified plus --cache-reuse conflict (issue #23493): Using both flags together can cancel newly allocated tasks. If you need --cache-reuse for multi-turn caching, test thoroughly before deploying.

Memory-constrained devices: On phones, tablets, or 16 GB laptops, the default 8 GB --cache-ram allocation can starve the active KV cache. Always set --cache-ram 0 on memory-constrained systems unless you are running a very small model with short context.

The --cache-reuse Optimization

Hannecke's tuning guide for llama-server on Apple Silicon highlights --cache-reuse as the single most underused flag for agent operators, and it is worth understanding alongside --cache-ram because they solve different problems and can work together.

--cache-reuse collapses repeated prefill by reusing KV cache slices already in the active GPU cache. It uses a minimum chunk size threshold (256 tokens is the common starting point) and detects when an incoming prompt shares a chunk with a recently processed one. Instead of re-evaluating the shared prefix, it shifts the cached KV slice forward and only prefill the new portion. This is fundamentally different from --cache-ram, which stores idle slot states in host RAM. --cache-reuse works within the active GPU memory pool.

For agent loops with a stable system prompt, the win is significant. A classification agent with a 2,000-token system prompt that runs 100 requests per minute would otherwise prefill 200,000 tokens per minute that are byte-for-byte identical. With --cache-reuse active, the prefill cost collapses to the per-request user message and any tool output appended after the system prompt.

llama-server \
  -m model.gguf \
  --cache-reuse 256 \
  --cache-ram 4096 \
  --kv-unified

This combination applies both optimizations: --cache-reuse collapses repeated prefill within the active KV cache, while --cache-ram carries the cache across a wider window via host RAM, re-injecting prefixes at slot switch.

A caveat applies to sliding-window-attention models. Gemma-class architectures with shared-KV layers currently log "cache reuse is not supported" even with -fa and --swa-full enabled (llama.cpp issue #21468). For hybrid Qwen3.5 and Qwen3.6 models the prefix-caching path on the attention layers behaves like any other model, but DeltaNet recurrent state has its own restoration semantics and is an active area of iteration. Verify cache-reuse hit rates empirically on your pinned commit before budgeting on them.

Measuring Prompt Cache Effectiveness

The llama-server chat-completions API provides direct measurement of cache effectiveness. Every response includes a timings object with prompt_n (tokens actually prefilled) and cache_n (tokens reused from cache). For an agent with a stable system prompt, cache_n should approach the system-prompt length on every request after the first. If it does not, you have a prefix invalidation problem, not a cache configuration problem.

In streaming mode with return_progress enabled, the same data appears in prompt_progress with fields total, cache, processed, and time_ms.

# Example: check if cache is actually firing
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6",
    "messages": [{"role": "user", "content": "hello"}],
    "stream": true,
    "return_progress": true
  }' | grep -E '"cache_n"|"prompt_n"'

You can also look for these log lines in llama-server output:

prompt cache is enabled, size limit: 8192 MiB

For agentic workflows where every request shares the same system prompt, the cache is extremely effective. You will observe 200 to 500 millisecond TTFT for cached requests versus 30 to 120 seconds for uncached full prefill.

Context Size Distribution: Static vs. Shared Pool

The way --ctx-size is distributed across slots determines how eviction errors manifest. This is the single most common cause of agents that suddenly stop working.

When you set --parallel explicitly, --ctx-size is split evenly across slots. If you set --ctx-size 131072 with --parallel 2, each slot gets a fixed 65k context. A long request in slot 1 cannot borrow from slot 2. When slot 1's KV cache fills, you get an eviction error even if slot 2 has unused capacity.

When you use auto slots or --kv-unified on, the context is a single shared pool. A single shared KV buffer covers the full --ctx-size, allocated to requests as they arrive. Heterogeneous request lengths are tolerated more gracefully, because a short request next to a long one borrows from the pool rather than wasting an isolated slot allocation. The total budget is still --ctx-size; it is just not pre-sliced.

For production, set --parallel explicitly so capacity planning stays predictable, then make a deliberate choice between the static partition (predictable per-slot ceiling) and --kv-unified (more flexible under heterogeneous load). Treating either mode as universal leads to the exact confusion I experienced: the eviction error was not caused by insufficient memory, but by the static partition allocating a slot that filled before the request was complete.

Beyond --cache-ram: Other Apple Silicon Tuning Levers

--cache-ram is one flag in a connected system. For agent workloads on Apple Silicon, the most impactful tuning parameters are:

  • --ubatch-size: The physical batch the GPU actually sees. On Apple Silicon, raising this to 1024 or 2048 often improves prefill throughput substantially, because the matmul units run closer to peak with larger micro-batches. Must divide --batch-size.
  • --cache-reuse 256: Collapses repeated prefill for stable system prompts. The single biggest lever for agent operators.
  • --kv-unified: Required for prompt caching to work properly. Enables the shared KV pool.
  • --parallel N: Sets the slot count. Controls the active KV cache multiplier.

For a complete walkthrough of these parameters and the KV-cache math behind them, see Michael Hannecke's Tuning llama-server on Apple Silicon, which covers --ubatch-size, --batch-size, --ctx-size slot budgeting, and --cache-reuse in detail.

Summary; Lessons Learned

--cache-ram is prompt caching for prefix reuse, not KV cache placement. It does not move the active KV cache to system RAM; it caches idle KV states in host RAM for fast restoration.

On Apple Silicon, prompt cache competes with the active KV cache for the same unified memory pool. The default 8 GiB allocation is too aggressive for systems already running large models with long context windows and concurrent request slots.

Always pair --cache-ram with --kv-unified for correct behavior. Without the flag, prompt caching silently disables.

When debugging KV cache eviction errors, check total memory consumption including the prompt cache, not just model weights plus KV cache math. The prompt cache's host-RAM allocation reduces the available unified memory pool on Apple Silicon.

For agent workloads, --cache-reuse 256 is the single most impactful optimization flag. It collapses repeated prefill within the active GPU cache, which is fundamentally different from --cache-ram's host-RAM approach. Use both together for maximum effect on stable system prompts.

For most production deployments on Apple Silicon, set --cache-ram 0 unless your workload has extremely consistent prompts and TTFT optimization is critical. Memory stability is more valuable than the prompt caching speedup. The eviction errors I experienced were resolved by disabling --cache-ram entirely, but adding --cache-reuse may provide the TTFT benefit without the memory competition.

Measure cache effectiveness using timings.cache_n from the chat-completions API response. If cache_n does not approach the system-prompt length on subsequent requests, you have a prefix invalidation problem rather than a configuration issue.