← Back to Blog

The LLM Field Manual: A Complete Guide to Language Model Terminology

·24 min read
LLMAIQuantizationFine-TuningReference

I am not an expert in machine learning or natural language processing. This is a working reference I built for myself while experimenting with local LLMs on Apple Silicon. Treat it as field notes from a practitioner, not a textbook. If something is wrong or out of date, I would genuinely appreciate the correction.

What is an LLM?

A Large Language Model (LLM) is a neural network trained on massive amounts of text to predict the next token in a sequence. The "large" refers to the number of trainable parameters, typically ranging from a few billion to hundreds of billions. These parameters encode patterns in language: grammar, facts, reasoning strategies, coding ability.

LLMs are autoregressive text generators. Given a prompt, they produce one token at a time, feeding each generated token back into the model to produce the next. Everything else (chat interfaces, tool calling, code completion) is built on top of this mechanism.

The Transformer architecture

Nearly all modern LLMs use the Transformer architecture (2017). Know these components:

TermWhat it does
AttentionLets the model weigh how much each token should influence the next prediction. "Self-attention" means every token attends to every other token.
Multi-Head AttentionRuns multiple attention computations in parallel, each learning different aspects of the input (syntax, semantics, position).
Feed-Forward NetworkA dense layer applied per token after attention. Stores most of the model's learned knowledge.
LayerOne transformer block: attention + feed-forward. Models stack many layers (32, 64, 128+) for depth.
EmbeddingThe numerical representation of a token. Text becomes embeddings, flows through layers, then maps back to token probabilities.

Attention variants

The attention variant determines KV cache size and inference speed. Know the differences:

VariantWhat it means for you
MHA (Multi-Head Attention)The original. Each head gets its own Key, Value, and Query projections. Full quality, highest memory cost. Found in older models like Llama 1.
MQA (Multi-Query Attention)All heads share one Key/Value projection, keep separate Queries. Cuts KV cache by 8x or more. Slight quality tradeoff.
GQA (Grouped-Query Attention)The current standard. Heads share Key/Value projections in groups (e.g., 32 query heads, 8 KV groups = 4x cache reduction, minimal quality loss). Used by Llama 3, Qwen, Gemma, and most modern models.

Bottom line: GQA with 8 KV groups uses roughly 4x less memory for its KV cache than MHA. This is why newer models handle much longer context windows without running out of memory.

Positional encoding

Transformers have no built-in sense of token order. Positional encoding adds position information so the model distinguishes "the cat sat" from "sat the cat."

MethodWhen you'll see it
Absolute positional encodingFixed position vectors per token. Capped at training length. Used by the original Transformer and GPT-2. Legacy.
RoPE (Rotary Position Embedding)Rotates query/key vectors by position. The dominant method today (Llama, Qwen, Gemma). Supports extension beyond training length.
ALiBi (Attention with Linear Biases)Adds a linear distance penalty. No learned parameters. Naturally extrapolates to longer sequences.
RoPE scalingExtends context beyond training length. Linear scaling (simple, lossy), NTK-aware (better), YaRN (best, combines methods).

When a model advertises "128k context" but trained on 8k, it uses RoPE scaling. Quality at extended lengths depends on the scaling method and whether the model was fine-tuned at the longer length.

Flash Attention

Flash Attention computes attention in tiles that fit in fast on-chip SRAM instead of materializing the full attention matrix. Faster, less memory, same output.

TermWhat to know
Flash AttentionTiled attention. Drops memory from O(n²) to O(n), runs 2 to 4x faster. Required for efficient long-context inference.
Flash Attention 2Better parallelism and work partitioning. Standard in most inference engines today.
Flash Attention 3Hopper (H100) GPU optimizations. Adds FP8 and async computation.
PagedAttentionUsed by vLLM. Manages KV cache like virtual memory pages. Not related to Flash Attention despite frequent confusion.

In practice: enable Flash Attention in llama.cpp with the -fa flag. Required for KV cache quantization. Recommended for any context above 4k tokens.

# Enable Flash Attention on a server
llama-server \
  -m model.gguf \
  -fa

Parameters and model sizing

Parameter count is the primary size metric. Parameters are the weights and biases learned during training.

NotationMeaningExample
BBillion parametersLlama 3.1 70B = 70 billion parameters
MMillion parametersRare in modern LLMs
Active paramsParameters used per token (MoE only)Qwen3 35B-A3B uses 3B of its 35B per token

Quick math: each parameter in fp16 takes 2 bytes. A 70B model needs ~140 GB just for weights before quantization.

Tokens and tokenization

Models process tokens (subword units), not raw text. A tokenizer handles the conversion.

TermWhat to know
TokenA subword unit. "unhappiness" might become ["un", "happiness"]. Common words stay whole; rare words split.
VocabularyThe full token set. Typical sizes: 32k to 152k tokens.
BPEByte Pair Encoding. The dominant tokenization algorithm. Merges frequent character pairs iteratively.
Context lengthMaximum tokens the model handles in one session (prompt + generation combined). Common: 8k, 32k, 128k, 1M.
Context windowSame as context length.

Rule of thumb: 1 token ≈ 3/4 of an English word. 128k tokens ≈ 96,000 words ≈ 300 pages.

Dense vs. Mixture of Experts (MoE)

Two architectures to know:

Dense models activate every parameter for every token. A 70B dense model reads all 70 billion parameters per token. High quality, slow inference.

MoE models have many parameters but only activate a subset ("experts") per token. A router picks which experts fire.

TermWhat to know
ExpertA subset of feed-forward layers. Many exist, few activate per token.
RouterDecides which experts process each token.
Active parametersParameters used per token. The number that determines inference speed.
Total parametersFull count including all experts. Determines download size and memory.
Top-k routingActivating k most relevant experts per token. Common: top-2, top-4, top-8.

Reading the name: XB-AYB = X billion total, Y billion active per token. Qwen3 35B-A3B has 35B total but activates 3B per token, giving 3B-class speed with much higher quality.

Why this matters on Apple Silicon: LLM inference is memory bandwidth bound. MoE models only read their active parameters from memory each token, so a 35B-A3B model achieves the throughput of a 3B dense model. On bandwidth-rich Apple Silicon (400+ GB/s), this is the optimal architecture for local inference.

Quantization

Quantization reduces weight precision from the original 16 bits per parameter to fewer bits. Smaller model, less memory, faster inference, minimal quality loss.

Precision formats

FormatBitsMemory per 1B paramsWhen to use
fp32324 GBTraining only. Never for inference.
fp16 / bf16162 GBBaseline "unquantized" inference format.
Q8_08~1 GBNearly lossless. Use when memory allows.
Q6_K~6.5~0.8 GBHigh quality, minimal loss.
Q5_K_M~5.5~0.7 GBGood quality/size balance.
Q4_K_M~4.5~0.56 GBThe sweet spot for most users. Start here.
Q4_04~0.5 GBBasic 4-bit. Noticeable loss on small models.
Q3_K_M~3.5~0.44 GBAggressive. Quality degrades.
Q2_K~2.5~0.31 GBExtreme. Significant quality loss. Last resort.
MXFP44~0.5 GBMicroscaling FP4. Newer format with better numerics than Q4_0.

Quantization methods and formats

MethodWhat to know
GGUFThe file format for llama.cpp. Single file with embedded metadata (tokenizer, config). The standard for local inference. Use this.
GPTQPost-training quantization with calibration data. Accurate 4-bit. Used with GPU servers (vLLM, TGI).
AWQActivation-Aware Weight Quantization. Faster to apply than GPTQ, often slightly better.
EXL2Variable bit-rate quantization for ExLlamaV2. Allocates more bits to important layers.
GGMLPredecessor to GGUF. Deprecated. Convert any GGML files you find.
UD (Ultra Dense)Unsloth's importance-weighted quantization. UD-Q5_K_XL allocates extra bits to critical layers.

How to read a quantization name

QK breaks down as:

  • Q = quantized
  • = approximate bits per weight (4, 5, 6, 8)
  • K = K-quant method (importance-based mixed precision)
  • = S (small), M (medium), L (large), XL (extra-large). Larger = more bits on important layers.

Q5_K_M = 5-bit, K-quant method, medium variant.

GGUF vs. MLX: model formats for Apple Silicon

Two ecosystems for running models on Mac: GGUF (llama.cpp) and MLX (Apple's framework). Same goal, different approaches.

AspectGGUF (llama.cpp)MLX
DeveloperGeorgi Gerganov + communityApple
BackendC/C++ with Metal compute shadersPython/C++ with Metal, native Apple framework
Model formatSingle .gguf file with embedded metadataDirectory of .safetensors + JSON config
QuantizationK-quants (Q4_K_M, Q5_K_XL, etc.)Linear quantization (4-bit, 8-bit)
StrengthsMature, battle-tested, huge model library, best raw perf on Apple Silicon, speculative decoding, advanced KV cachePythonic API, easy to modify, native Apple integration, better for research and fine-tuning
WeaknessesC++ codebase harder to hack onYounger ecosystem, fewer pre-quantized models, generally slower inference
Best forProduction inference, serving, maximum throughputResearch, prototyping, fine-tuning on Mac
Server modellama-server (OpenAI-compatible API)mlx_lm.server (OpenAI-compatible API)
Model sourceHugging Face (search "GGUF")Hugging Face (search "MLX")

Decision rule: for inference speed and model serving, use llama.cpp with GGUF. For experimentation, fine-tuning, or custom pipelines in Python, use MLX.

Start a llama.cpp server with an OpenAI-compatible API:

# Basic server (auto-detects Apple Silicon Metal backend)
llama-server \
  -m model.gguf \
  --port 8080
 
# With all the recommended flags for production use
llama-server -m model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -fa \
  --jinja \
  -c 32768 \
  -ctk q8_0 \
  -ctv q8_0

Test the API with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Many power users keep both: llama.cpp for daily model serving, MLX for trying architectures or quick fine-tunes.

Fine-tuning

Fine-tuning continues training a pretrained model on a specific dataset to change its behavior.

Methods

MethodWhat to know
Full fine-tuningUpdates all parameters. Same memory as training from scratch. Best quality, most expensive.
LoRALow-Rank Adaptation. Freezes original weights, trains small adapter matrices (0.1% to 1% of parameters). Dramatically cheaper. Start here.
QLoRALoads base model in 4-bit, trains LoRA on top. Fine-tune 70B models on 24 GB VRAM.
DoRAWeight-Decomposed LoRA. Separates magnitude from direction. Drop-in upgrade over LoRA.

Alignment and preference tuning

After pretraining, models are aligned to be helpful, harmless, and honest:

MethodWhat to know
SFTSupervised Fine-Tuning on curated instruction/response pairs. First alignment step.
RLHFReinforcement Learning from Human Feedback. Trains a reward model, then optimizes against it. Made ChatGPT possible.
DPODirect Preference Optimization. RLHF results without the reward model. Simpler, more stable.
ORPOOdds Ratio Preference Optimization. Combines SFT + preference alignment in one step.
GRPOGroup Relative Policy Optimization. Samples multiple responses, uses relative quality. Used by DeepSeek.

Model stages

StageWhat it means
Base modelRaw pretrained model. Good at text completion, bad at following instructions.
Instruct modelFine-tuned to follow instructions. What most people mean by "the model."
Chat modelFurther tuned for multi-turn conversation with a specific chat template.
Censor/uncensoredWhether the model refuses certain request categories.

Distillation

Knowledge distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's output probabilities, which carry richer information than simple correct/incorrect labels.

TermWhat to know
Teacher modelThe large, high-quality source. Often a frontier model (GPT-4, Claude).
Student modelThe smaller model learning to replicate the teacher.
Logit distillationStudent matches the teacher's full probability distribution, not just the top prediction. Transfers "soft" knowledge about alternatives the teacher considered.
On-policy distillationStudent generates responses, teacher scores them. More effective than training on teacher outputs directly.
Synthetic data distillationTeacher generates a training dataset, student trains on it with standard SFT. The most common approach.

When a model card says "distilled from" or "trained on synthetic data from" a larger model, this is what happened. Many Llama and Qwen variants are distilled from larger models in the same family or from proprietary models.

Chat templates and prompt formatting

LLMs require specific chat templates that structure conversations into roles (system, user, assistant). Wrong template = garbage output, even from a good model.

TermWhat to know
Chat templateThe exact text format for multi-turn conversation. Defines delimiters for system prompts, user messages, assistant responses.
ChatMLUses <|im_start|> and <|im_end|> tags. Used by Qwen and many community models.
Llama formatUses <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>. Varies slightly between versions.
Jinja templateTemplate engine in the tokenizer config that auto-formats messages. Most modern models include one.
System promptSets behavior, personality, or constraints at conversation start. Not supported by all models.
Special tokensTokens like <eos>, <pad>, or template delimiters with special meaning. Not generated as regular text.
BOS / EOSBeginning/End of Sequence tokens. Mark text boundaries.

In practice: always use the --jinja flag in llama.cpp for automatic template detection. Manual template specification is error-prone. Let the model's embedded metadata handle it.

# Auto-detect the chat template from model metadata
llama-server \
  -m model.gguf \
  --jinja
 
# Override the system prompt
llama-cli -m model.gguf \
  --jinja \
  -sys "You are a helpful coding assistant." \
  -p "Write a Python function to merge two sorted lists."

Reasoning and thinking models

Reasoning models produce an internal chain of thought before answering. More tokens spent thinking, better answers on hard problems.

TermWhat to know
Chain-of-Thought (CoT)Model shows reasoning step by step before the final answer. Major accuracy gains on math, logic, and coding.
Reasoning/thinking tokensTokens generated during internal reasoning. Some models (DeepSeek R1, QwQ) show them; others hide them.
Thinking budgetConfigurable limit on thinking tokens. Higher = better quality, more latency and cost.
o1-style reasoningModel develops reasoning strategies via reinforcement learning during training (not taught explicit patterns). Named after OpenAI's o1.
Hybrid thinkingToggle between fast (no thinking) and slow (extended reasoning) per query. Qwen3 supports /think and /no_think modes.

When to use thinking: complex coding, math, multi-step reasoning. When to skip it: simple questions, creative writing, straightforward lookups. The thinking overhead is wasted on easy tasks.

Tool calling and agentic use

Tool calling lets the model invoke external functions by generating structured output (usually JSON). The host executes the function and returns results.

TermWhat to know
Tool calling / function callingModel outputs JSON requesting a function call with arguments. Host executes, returns result to model.
Tool use loopModel requests tool → host executes → result fed back → model continues. Can repeat many times per response.
MCP (Model Context Protocol)Anthropic's open standard for connecting LLMs to tools and data sources. Standardized tool, resource, and prompt exposure.
Agentic workflowLLM as autonomous agent: plan, act via tools, observe results, iterate. Multi-step task execution beyond single-turn chat.
ReActReasoning + Acting pattern. Model alternates between reasoning about what to do and acting via tool calls.
Structured outputConstraining output to a schema (JSON, XML) for reliable machine parsing. Essential for tool calling.

Check before relying on it: not all models handle tool calling well. It requires specific training. Look for explicit "tool calling" or "function calling" in the model card.

RAG (Retrieval-Augmented Generation)

RAG augments the model's prompt with retrieved documents before generation. Reduces hallucination, enables access to private or current data, no fine-tuning required.

TermWhat to know
RAGRetrieve relevant context, inject it into the prompt, then generate. The standard pattern for private/current data.
Embedding modelConverts text to dense vectors. Similar texts produce similar vectors. Separate from the generative LLM.
Vector databaseStores and searches embeddings by similarity. Qdrant, Chroma, Pinecone, pgvector.
ChunkingSplitting documents into pieces for embedding. Chunk size and overlap significantly affect retrieval quality.
Semantic searchFinding documents by meaning, not keywords. Uses embedding similarity.
Hybrid searchCombines semantic search (embeddings) with keyword search (BM25). Usually outperforms either alone.
RerankingSecond-pass model that rescores retrieved documents. Improves precision after initial retrieval.

Multimodal and vision models

VLMs (Vision Language Models) process both text and images. Some handle audio, video, or other modalities.

TermWhat to know
VLMLLM that accepts images alongside text. Describes images, answers questions, reads text in photos.
Vision encoderProcesses images into embeddings the language model understands. Usually SigLIP or ViT based.
Image tokensImages convert to token sequences (often hundreds per image). These consume context length like text tokens.
OCR capabilityMany VLMs read text in images (receipts, screenshots, documents) without a separate OCR system.
ExamplesQwen-VL, LLaVA, Gemma with vision, Llama with vision adapters. Most major families now have vision variants.

For local inference, VLMs work with llama.cpp (--mmproj for vision adapter) and MLX. Expect higher memory usage than text-only models due to the vision encoder.

# Run a vision model (e.g., Gemma or Qwen-VL)
llama-cli -m vision-model.gguf \
  --mmproj vision-encoder.gguf \
  --jinja \
  --image photo.jpg \
  -p "What is in this image?"

Licenses

Model licenses determine what you can do with the weights. Check before deploying.

LicenseWhat it allows
Apache 2.0Fully permissive. Any use including commercial. No restrictions. Mistral, some Qwen.
MITSame as Apache 2.0 in practice. Fully permissive.
Llama Community LicenseFree for most uses. Companies with 700M+ MAU need separate license. Commercial OK for most orgs.
Qwen LicenseGenerally permissive for research and commercial. Some restrictions on training competing models. Varies by version.
Gemma Terms of UseCommercial OK with redistribution restrictions.
CC-BY-NCResearch and personal only. No commercial use.
Research onlyStrictly research. No commercial use.

"Open weights" does not mean "open source." Always check the model card on Hugging Face before commercial deployment.

Inference terminology

Inference = running a trained model to generate output. Know these metrics and concepts.

Speed metrics

TermWhat it measures
t/sTokens per second. The primary speed measurement.
pp (prompt processing)Speed of processing the input prompt. Also called "prefill."
tg (token generation)Speed of generating new tokens. Always slower than pp (sequential).
pp512, pp65536Prompt processing speed at 512 or 65,536 tokens. Measures context scaling.
tg128, tg1024Token generation speed at 128 or 1,024 output tokens.
TTFTTime To First Token. Delay before generation starts. Driven by prompt processing speed.
ThroughputTotal t/s across all concurrent users (server metric).

Speculative decoding and MTP

Standard inference generates one token per forward pass (sequential bottleneck). These techniques draft multiple tokens to break through it.

TechniqueHow it works
Speculative decoding (draft model)Small, fast model drafts candidate tokens. Large model verifies them all in one pass. Accepted tokens are free. Typical speedup: 1.5 to 2.5x.
MTP (Multi-Token Prediction)Built-in prediction heads draft 2 to 3 tokens per pass. No separate model needed. Expected speedup: 1.5 to 1.8x.
EagleTrains a lightweight draft head on top of the target model's hidden states. Faster than a separate draft model.
Lookahead decodingUses the model's own n-gram patterns to speculate without any draft model.
MedusaAdds parallel prediction heads, each predicting a different future token position.

Key property: speculative decoding never changes output quality (rejected drafts are discarded). Costs extra memory for the draft model or heads. On memory-rich Apple Silicon, almost always worth enabling.

# Speculative decoding with a draft model (1.5 to 2.5x speedup)
llama-server -m large-model.gguf \
  -md small-draft-model.gguf \
  --draft-max 16 \
  --draft-min 4 \
  -ngl 99 \
  -fa \
  --jinja

Sampling parameters

The model outputs a probability distribution over its vocabulary. These parameters control token selection:

ParameterWhat it does
TemperatureControls randomness. 0 = greedy (always pick most likely). 1 = proportional sampling. Higher = more creative/chaotic.
Top-p (nucleus)Only consider tokens whose cumulative probability reaches p. 0.9 = smallest set covering 90% probability.
Top-kOnly consider the k most likely tokens. 40 = choose from top 40.
Repetition penaltyPenalizes already-used tokens. Above 1.0 increases the penalty.
Min-pOnly consider tokens with probability ≥ min-p × highest token's probability. Simpler alternative to top-p.
# Control sampling in llama-cli
llama-cli -m model.gguf \
  --jinja \
  --temp 0.7 \
  --top-p 0.9 \
  --top-k 40 \
  --min-p 0.05 \
  --repeat-penalty 1.1 \
  -p "Explain quantum computing in simple terms."
 
# Greedy decoding (deterministic, best for code/factual tasks)
llama-cli -m model.gguf \
  --jinja \
  --temp 0 \
  -p "Write a function that reverses a linked list."

Memory concepts

TermWhat to know
KV cacheStores intermediate attention computations to avoid recalculation per token. Grows linearly with context length.
KV cache quantizationCompress the cache (e.g., fp16 → q8_0 or q4_0). Enables longer context.
VRAMGPU memory. The primary bottleneck on NVIDIA hardware.
Unified memoryApple Silicon: CPU and GPU share the same memory pool. Eliminates the VRAM wall.
OffloadingMoving model layers to CPU RAM or disk when GPU memory is insufficient. Slower, but runs larger models.
ngl (n-gpu-layers)Number of layers on GPU. Set to 99 to offload everything to GPU.
# Full GPU offload with KV cache quantization for long context
llama-server -m model.gguf \
  -ngl 99 \
  -fa \
  -c 131072 \
  -ctk q8_0 \
  -ctv q8_0
 
# Partial offload when the model doesn't fully fit in memory
llama-server -m huge-model.gguf \
  -ngl 20 \
  -fa \
  -c 8192

Local inference software

ToolWhen to use it
llama.cppDefault choice. C/C++ with Metal (Apple) and CUDA (NVIDIA). GGUF format. Best performance.
OllamaGetting started quickly. User-friendly llama.cpp wrapper. Manages downloads, serves API.
vLLMHigh-throughput GPU serving. PagedAttention, production-grade. NVIDIA only.
ExLlamaV2Fast NVIDIA inference with EXL2 quantization. Excellent speed at low bit rates.
MLXApple Silicon native. Python-first. Growing ecosystem. Best for research/fine-tuning on Mac.
TGIHugging Face's production server. GPTQ, AWQ support.
koboldcppllama.cpp fork for creative writing/roleplay. Extra samplers and UI features.

Model naming conventions

Official names

Format: Organization ModelFamily-Size-Variant

ComponentExamplesMeaning
OrganizationQwen, Meta, Google, MistralWho made it
FamilyLlama, Gemma, Qwen, MistralModel series
Size7B, 70B, 405BParameter count (billions)
VariantInstruct, Chat, Code, MathSpecialization
Version3.1, 3.5, 4Release version

Reading examples:

  • Llama-3.1-70B-Instruct = Meta, Llama v3.1, 70B params, instruction-tuned
  • Qwen3-235B-A22B = Qwen v3, 235B total / 22B active (MoE)
  • Gemma-4-27B-IT = Google Gemma v4, 27B params, Instruction Tuned

Community quantization names

Format: ModelName-Quant.gguf

Example: Qwen3-35B-A3B-UD-Q5_K_XL.gguf

Decoded: Qwen3, 35B total / 3B active, Unsloth Dense quantization, Q5 K-quant extra-large variant, GGUF format.

The Hugging Face ecosystem

Hugging Face is the central hub for open-source models. Know these conventions:

TermWhat to know
Model cardThe README for a model. Architecture, training data, benchmarks, license, usage. Read this before downloading.
safetensorsStandard weight file format. Safer and faster than PyTorch .bin. Used by MLX and most frameworks.
Model repoGit repo containing weights, config, tokenizer, metadata.
SpacesHosted demo apps. Try models before downloading.
Transformers libraryHugging Face's Python library. Reference implementation, not speed-optimized.
GGUF on HFCommunity members (Unsloth, bartowski, others) upload pre-quantized GGUF files. Search "GGUF" in model names.

Download a GGUF model from Hugging Face:

# Install the CLI if you haven't
pip install huggingface-hub
 
# Download a specific quantization
huggingface-cli download \
  unsloth/Qwen3-30B-A3B-GGUF \
  Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
  --local-dir ./models

Benchmarks and evaluation

BenchmarkWhat it tests
MMLU57 academic subjects, elementary to professional. Broad knowledge.
HumanEvalCode generation from docstrings. Pass@1 success rate.
SWE-benchFixes real GitHub issues. The gold standard for coding ability.
MATHCompetition-level math. Tests reasoning.
GSM8KGrade school multi-step arithmetic. Easier than MATH.
ARCGrade school science requiring reasoning.
HellaSwagCommonsense sentence completion.
TruthfulQATruthful answers vs. common misconceptions.
PerplexityHow "surprised" the model is by text. Lower = better. Raw quality metric, doesn't predict usefulness directly.
Chatbot Arena / ELOHuman preference from blind A/B tests. Most ecologically valid benchmark.

Putting it all together

When someone says "I'm running Qwen3 35B-A3B UD-Q5_K_XL on llama.cpp with Flash Attention and q8_0 KV cache at 128k context," decode it:

  • Qwen3: Qwen family, version 3
  • 35B-A3B: 35 billion total, 3 billion active per token (MoE with GQA)
  • UD-Q5_K_XL: Unsloth Dense quantization, ~5.5 bits per weight, extra-large variant, GGUF format
  • llama.cpp: inference engine, C/C++ with Metal backend on Apple Silicon
  • Flash Attention: tiled attention for memory efficiency, enabled with -fa
  • q8_0 KV cache: attention cache compressed to 8-bit, halving its memory footprint
  • 128k context: up to 128,000 tokens per session, using RoPE for position encoding

That setup: ~25 GB in memory, ~99 t/s on an M5 Max, entire codebase fits in context. Once you know the terminology, you can evaluate any model configuration at a glance.

The full command:

llama-server \
  -m Qwen3-35B-A3B-UD-Q5_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -fa \
  --jinja \
  -c 131072 \
  -ctk q8_0 \
  -ctv q8_0

Terminology current as of mid-2026. The field moves fast. When in doubt, check the model card on Hugging Face and the docs for your inference engine.