The LLM Field Manual: A Complete Guide to Language Model Terminology

I am not an expert in machine learning or natural language processing. This is a working reference I built for myself while experimenting with local LLMs on Apple Silicon. Treat it as field notes from a practitioner, not a textbook. If something is wrong or out of date, I would genuinely appreciate the correction.

What is an LLM?

A Large Language Model (LLM) is a neural network trained on massive amounts of text to predict the next token in a sequence. The "large" refers to the number of trainable parameters, typically ranging from a few billion to hundreds of billions. These parameters encode patterns in language: grammar, facts, reasoning strategies, coding ability.

LLMs are autoregressive text generators. Given a prompt, they produce one token at a time, feeding each generated token back into the model to produce the next. Everything else (chat interfaces, tool calling, code completion) is built on top of this mechanism.

The Transformer architecture

Nearly all modern LLMs use the Transformer architecture (2017). Know these components:

Term	What it does
Attention	Lets the model weigh how much each token should influence the next prediction. "Self-attention" means every token attends to every other token.
Multi-Head Attention	Runs multiple attention computations in parallel, each learning different aspects of the input (syntax, semantics, position).
Feed-Forward Network	A dense layer applied per token after attention. Stores most of the model's learned knowledge.
Layer	One transformer block: attention + feed-forward. Models stack many layers (32, 64, 128+) for depth.
Embedding	The numerical representation of a token. Text becomes embeddings, flows through layers, then maps back to token probabilities.

Attention variants

The attention variant determines KV cache size and inference speed. Know the differences:

Variant	What it means for you
MHA (Multi-Head Attention)	The original. Each head gets its own Key, Value, and Query projections. Full quality, highest memory cost. Found in older models like Llama 1.
MQA (Multi-Query Attention)	All heads share one Key/Value projection, keep separate Queries. Cuts KV cache by 8x or more. Slight quality tradeoff.
GQA (Grouped-Query Attention)	The current standard. Heads share Key/Value projections in groups (e.g., 32 query heads, 8 KV groups = 4x cache reduction, minimal quality loss). Used by Llama 3, Qwen, Gemma, and most modern models.

Bottom line: GQA with 8 KV groups uses roughly 4x less memory for its KV cache than MHA. This is why newer models handle much longer context windows without running out of memory.

Positional encoding

Transformers have no built-in sense of token order. Positional encoding adds position information so the model distinguishes "the cat sat" from "sat the cat."

Method	When you'll see it
Absolute positional encoding	Fixed position vectors per token. Capped at training length. Used by the original Transformer and GPT-2. Legacy.
RoPE (Rotary Position Embedding)	Rotates query/key vectors by position. The dominant method today (Llama, Qwen, Gemma). Supports extension beyond training length.
ALiBi (Attention with Linear Biases)	Adds a linear distance penalty. No learned parameters. Naturally extrapolates to longer sequences.
RoPE scaling	Extends context beyond training length. Linear scaling (simple, lossy), NTK-aware (better), YaRN (best, combines methods).

When a model advertises "128k context" but trained on 8k, it uses RoPE scaling. Quality at extended lengths depends on the scaling method and whether the model was fine-tuned at the longer length.

Flash Attention

Flash Attention computes attention in tiles that fit in fast on-chip SRAM instead of materializing the full attention matrix. Faster, less memory, same output.

Term	What to know
Flash Attention	Tiled attention. Drops memory from O(n²) to O(n), runs 2 to 4x faster. Required for efficient long-context inference.
Flash Attention 2	Better parallelism and work partitioning. Standard in most inference engines today.
Flash Attention 3	Hopper (H100) GPU optimizations. Adds FP8 and async computation.
PagedAttention	Used by vLLM. Manages KV cache like virtual memory pages. Not related to Flash Attention despite frequent confusion.

In practice: enable Flash Attention in llama.cpp with the -fa flag. Required for KV cache quantization. Recommended for any context above 4k tokens.

# Enable Flash Attention on a server
llama-server \
  -m model.gguf \
  -fa

Parameters and model sizing

Parameter count is the primary size metric. Parameters are the weights and biases learned during training.

Notation	Meaning	Example
B	Billion parameters	Llama 3.1 70B = 70 billion parameters
M	Million parameters	Rare in modern LLMs
Active params	Parameters used per token (MoE only)	Qwen3 35B-A3B uses 3B of its 35B per token

Quick math: each parameter in fp16 takes 2 bytes. A 70B model needs ~140 GB just for weights before quantization.

Tokens and tokenization

Models process tokens (subword units), not raw text. A tokenizer handles the conversion.

Term	What to know
Token	A subword unit. "unhappiness" might become ["un", "happiness"]. Common words stay whole; rare words split.
Vocabulary	The full token set. Typical sizes: 32k to 152k tokens.
BPE	Byte Pair Encoding. The dominant tokenization algorithm. Merges frequent character pairs iteratively.
Context length	Maximum tokens the model handles in one session (prompt + generation combined). Common: 8k, 32k, 128k, 1M.
Context window	Same as context length.

Rule of thumb: 1 token ≈ 3/4 of an English word. 128k tokens ≈ 96,000 words ≈ 300 pages.

Dense vs. Mixture of Experts (MoE)

Two architectures to know:

Dense models activate every parameter for every token. A 70B dense model reads all 70 billion parameters per token. High quality, slow inference.

MoE models have many parameters but only activate a subset ("experts") per token. A router picks which experts fire.

Term	What to know
Expert	A subset of feed-forward layers. Many exist, few activate per token.
Router	Decides which experts process each token.
Active parameters	Parameters used per token. The number that determines inference speed.
Total parameters	Full count including all experts. Determines download size and memory.
Top-k routing	Activating k most relevant experts per token. Common: top-2, top-4, top-8.

Reading the name: XB-AYB = X billion total, Y billion active per token. Qwen3 35B-A3B has 35B total but activates 3B per token, giving 3B-class speed with much higher quality.

Why this matters on Apple Silicon: LLM inference is memory bandwidth bound. MoE models only read their active parameters from memory each token, so a 35B-A3B model achieves the throughput of a 3B dense model. On bandwidth-rich Apple Silicon (400+ GB/s), this is the optimal architecture for local inference.

Quantization

Quantization reduces weight precision from the original 16 bits per parameter to fewer bits. Smaller model, less memory, faster inference, minimal quality loss.

Precision formats

Format	Bits	Memory per 1B params	When to use
fp32	32	4 GB	Training only. Never for inference.
fp16 / bf16	16	2 GB	Baseline "unquantized" inference format.
Q8_0	8	~1 GB	Nearly lossless. Use when memory allows.
Q6_K	~6.5	~0.8 GB	High quality, minimal loss.
Q5_K_M	~5.5	~0.7 GB	Good quality/size balance.
Q4_K_M	~4.5	~0.56 GB	The sweet spot for most users. Start here.
Q4_0	4	~0.5 GB	Basic 4-bit. Noticeable loss on small models.
Q3_K_M	~3.5	~0.44 GB	Aggressive. Quality degrades.
Q2_K	~2.5	~0.31 GB	Extreme. Significant quality loss. Last resort.
MXFP4	4	~0.5 GB	Microscaling FP4. Newer format with better numerics than Q4_0.

Quantization methods and formats

Method	What to know
GGUF	The file format for llama.cpp. Single file with embedded metadata (tokenizer, config). The standard for local inference. Use this.
GPTQ	Post-training quantization with calibration data. Accurate 4-bit. Used with GPU servers (vLLM, TGI).
AWQ	Activation-Aware Weight Quantization. Faster to apply than GPTQ, often slightly better.
EXL2	Variable bit-rate quantization for ExLlamaV2. Allocates more bits to important layers.
GGML	Predecessor to GGUF. Deprecated. Convert any GGML files you find.
UD (Ultra Dense)	Unsloth's importance-weighted quantization. UD-Q5_K_XL allocates extra bits to critical layers.

How to read a quantization name

QK breaks down as:

Q = quantized
= approximate bits per weight (4, 5, 6, 8)
K = K-quant method (importance-based mixed precision)
= S (small), M (medium), L (large), XL (extra-large). Larger = more bits on important layers.

Q5_K_M = 5-bit, K-quant method, medium variant.

GGUF vs. MLX: model formats for Apple Silicon

Two ecosystems for running models on Mac: GGUF (llama.cpp) and MLX (Apple's framework). Same goal, different approaches.

Aspect	GGUF (llama.cpp)	MLX
Developer	Georgi Gerganov + community	Apple
Backend	C/C++ with Metal compute shaders	Python/C++ with Metal, native Apple framework
Model format	Single `.gguf` file with embedded metadata	Directory of `.safetensors` + JSON config
Quantization	K-quants (Q4_K_M, Q5_K_XL, etc.)	Linear quantization (4-bit, 8-bit)
Strengths	Mature, battle-tested, huge model library, best raw perf on Apple Silicon, speculative decoding, advanced KV cache	Pythonic API, easy to modify, native Apple integration, better for research and fine-tuning
Weaknesses	C++ codebase harder to hack on	Younger ecosystem, fewer pre-quantized models, generally slower inference
Best for	Production inference, serving, maximum throughput	Research, prototyping, fine-tuning on Mac
Server mode	`llama-server` (OpenAI-compatible API)	`mlx_lm.server` (OpenAI-compatible API)
Model source	Hugging Face (search "GGUF")	Hugging Face (search "MLX")

Decision rule: for inference speed and model serving, use llama.cpp with GGUF. For experimentation, fine-tuning, or custom pipelines in Python, use MLX.

Start a llama.cpp server with an OpenAI-compatible API:

# Basic server (auto-detects Apple Silicon Metal backend)
llama-server \
  -m model.gguf \
  --port 8080
 
# With all the recommended flags for production use
llama-server -m model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -fa \
  --jinja \
  -c 32768 \
  -ctk q8_0 \
  -ctv q8_0

Test the API with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Many power users keep both: llama.cpp for daily model serving, MLX for trying architectures or quick fine-tunes.

Fine-tuning

Fine-tuning continues training a pretrained model on a specific dataset to change its behavior.

Methods

Method	What to know
Full fine-tuning	Updates all parameters. Same memory as training from scratch. Best quality, most expensive.
LoRA	Low-Rank Adaptation. Freezes original weights, trains small adapter matrices (0.1% to 1% of parameters). Dramatically cheaper. Start here.
QLoRA	Loads base model in 4-bit, trains LoRA on top. Fine-tune 70B models on 24 GB VRAM.
DoRA	Weight-Decomposed LoRA. Separates magnitude from direction. Drop-in upgrade over LoRA.

Alignment and preference tuning

After pretraining, models are aligned to be helpful, harmless, and honest:

Method	What to know
SFT	Supervised Fine-Tuning on curated instruction/response pairs. First alignment step.
RLHF	Reinforcement Learning from Human Feedback. Trains a reward model, then optimizes against it. Made ChatGPT possible.
DPO	Direct Preference Optimization. RLHF results without the reward model. Simpler, more stable.
ORPO	Odds Ratio Preference Optimization. Combines SFT + preference alignment in one step.
GRPO	Group Relative Policy Optimization. Samples multiple responses, uses relative quality. Used by DeepSeek.

Model stages

Stage	What it means
Base model	Raw pretrained model. Good at text completion, bad at following instructions.
Instruct model	Fine-tuned to follow instructions. What most people mean by "the model."
Chat model	Further tuned for multi-turn conversation with a specific chat template.
Censor/uncensored	Whether the model refuses certain request categories.

Distillation

Knowledge distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's output probabilities, which carry richer information than simple correct/incorrect labels.

Term	What to know
Teacher model	The large, high-quality source. Often a frontier model (GPT-4, Claude).
Student model	The smaller model learning to replicate the teacher.
Logit distillation	Student matches the teacher's full probability distribution, not just the top prediction. Transfers "soft" knowledge about alternatives the teacher considered.
On-policy distillation	Student generates responses, teacher scores them. More effective than training on teacher outputs directly.
Synthetic data distillation	Teacher generates a training dataset, student trains on it with standard SFT. The most common approach.

When a model card says "distilled from" or "trained on synthetic data from" a larger model, this is what happened. Many Llama and Qwen variants are distilled from larger models in the same family or from proprietary models.

Chat templates and prompt formatting

LLMs require specific chat templates that structure conversations into roles (system, user, assistant). Wrong template = garbage output, even from a good model.

Term	What to know
Chat template	The exact text format for multi-turn conversation. Defines delimiters for system prompts, user messages, assistant responses.
ChatML	Uses `<\|im_start\|>` and `<\|im_end\|>` tags. Used by Qwen and many community models.
Llama format	Uses `<\|begin_of_text\|>`, `<\|start_header_id\|>`, `<\|end_header_id\|>`. Varies slightly between versions.
Jinja template	Template engine in the tokenizer config that auto-formats messages. Most modern models include one.
System prompt	Sets behavior, personality, or constraints at conversation start. Not supported by all models.
Special tokens	Tokens like `<eos>`, `<pad>`, or template delimiters with special meaning. Not generated as regular text.
BOS / EOS	Beginning/End of Sequence tokens. Mark text boundaries.

In practice: always use the --jinja flag in llama.cpp for automatic template detection. Manual template specification is error-prone. Let the model's embedded metadata handle it.

# Auto-detect the chat template from model metadata
llama-server \
  -m model.gguf \
  --jinja
 
# Override the system prompt
llama-cli -m model.gguf \
  --jinja \
  -sys "You are a helpful coding assistant." \
  -p "Write a Python function to merge two sorted lists."

Reasoning and thinking models

Reasoning models produce an internal chain of thought before answering. More tokens spent thinking, better answers on hard problems.

Term	What to know
Chain-of-Thought (CoT)	Model shows reasoning step by step before the final answer. Major accuracy gains on math, logic, and coding.
Reasoning/thinking tokens	Tokens generated during internal reasoning. Some models (DeepSeek R1, QwQ) show them; others hide them.
Thinking budget	Configurable limit on thinking tokens. Higher = better quality, more latency and cost.
o1-style reasoning	Model develops reasoning strategies via reinforcement learning during training (not taught explicit patterns). Named after OpenAI's o1.
Hybrid thinking	Toggle between fast (no thinking) and slow (extended reasoning) per query. Qwen3 supports `/think` and `/no_think` modes.

When to use thinking: complex coding, math, multi-step reasoning. When to skip it: simple questions, creative writing, straightforward lookups. The thinking overhead is wasted on easy tasks.

Tool calling and agentic use

Tool calling lets the model invoke external functions by generating structured output (usually JSON). The host executes the function and returns results.

Term	What to know
Tool calling / function calling	Model outputs JSON requesting a function call with arguments. Host executes, returns result to model.
Tool use loop	Model requests tool → host executes → result fed back → model continues. Can repeat many times per response.
MCP (Model Context Protocol)	Anthropic's open standard for connecting LLMs to tools and data sources. Standardized tool, resource, and prompt exposure.
Agentic workflow	LLM as autonomous agent: plan, act via tools, observe results, iterate. Multi-step task execution beyond single-turn chat.
ReAct	Reasoning + Acting pattern. Model alternates between reasoning about what to do and acting via tool calls.
Structured output	Constraining output to a schema (JSON, XML) for reliable machine parsing. Essential for tool calling.

Check before relying on it: not all models handle tool calling well. It requires specific training. Look for explicit "tool calling" or "function calling" in the model card.

RAG (Retrieval-Augmented Generation)

RAG augments the model's prompt with retrieved documents before generation. Reduces hallucination, enables access to private or current data, no fine-tuning required.

Term	What to know
RAG	Retrieve relevant context, inject it into the prompt, then generate. The standard pattern for private/current data.
Embedding model	Converts text to dense vectors. Similar texts produce similar vectors. Separate from the generative LLM.
Vector database	Stores and searches embeddings by similarity. Qdrant, Chroma, Pinecone, pgvector.
Chunking	Splitting documents into pieces for embedding. Chunk size and overlap significantly affect retrieval quality.
Semantic search	Finding documents by meaning, not keywords. Uses embedding similarity.
Hybrid search	Combines semantic search (embeddings) with keyword search (BM25). Usually outperforms either alone.
Reranking	Second-pass model that rescores retrieved documents. Improves precision after initial retrieval.

Multimodal and vision models

VLMs (Vision Language Models) process both text and images. Some handle audio, video, or other modalities.

Term	What to know
VLM	LLM that accepts images alongside text. Describes images, answers questions, reads text in photos.
Vision encoder	Processes images into embeddings the language model understands. Usually SigLIP or ViT based.
Image tokens	Images convert to token sequences (often hundreds per image). These consume context length like text tokens.
OCR capability	Many VLMs read text in images (receipts, screenshots, documents) without a separate OCR system.
Examples	Qwen-VL, LLaVA, Gemma with vision, Llama with vision adapters. Most major families now have vision variants.

For local inference, VLMs work with llama.cpp (--mmproj for vision adapter) and MLX. Expect higher memory usage than text-only models due to the vision encoder.

# Run a vision model (e.g., Gemma or Qwen-VL)
llama-cli -m vision-model.gguf \
  --mmproj vision-encoder.gguf \
  --jinja \
  --image photo.jpg \
  -p "What is in this image?"

Licenses

Model licenses determine what you can do with the weights. Check before deploying.

License	What it allows
Apache 2.0	Fully permissive. Any use including commercial. No restrictions. Mistral, some Qwen.
MIT	Same as Apache 2.0 in practice. Fully permissive.
Llama Community License	Free for most uses. Companies with 700M+ MAU need separate license. Commercial OK for most orgs.
Qwen License	Generally permissive for research and commercial. Some restrictions on training competing models. Varies by version.
Gemma Terms of Use	Commercial OK with redistribution restrictions.
CC-BY-NC	Research and personal only. No commercial use.
Research only	Strictly research. No commercial use.

"Open weights" does not mean "open source." Always check the model card on Hugging Face before commercial deployment.

Inference terminology

Inference = running a trained model to generate output. Know these metrics and concepts.

Speed metrics

Term	What it measures
t/s	Tokens per second. The primary speed measurement.
pp (prompt processing)	Speed of processing the input prompt. Also called "prefill."
tg (token generation)	Speed of generating new tokens. Always slower than pp (sequential).
pp512, pp65536	Prompt processing speed at 512 or 65,536 tokens. Measures context scaling.
tg128, tg1024	Token generation speed at 128 or 1,024 output tokens.
TTFT	Time To First Token. Delay before generation starts. Driven by prompt processing speed.
Throughput	Total t/s across all concurrent users (server metric).

Speculative decoding and MTP

Standard inference generates one token per forward pass (sequential bottleneck). These techniques draft multiple tokens to break through it.

Technique	How it works
Speculative decoding (draft model)	Small, fast model drafts candidate tokens. Large model verifies them all in one pass. Accepted tokens are free. Typical speedup: 1.5 to 2.5x.
MTP (Multi-Token Prediction)	Built-in prediction heads draft 2 to 3 tokens per pass. No separate model needed. Expected speedup: 1.5 to 1.8x.
Eagle	Trains a lightweight draft head on top of the target model's hidden states. Faster than a separate draft model.
Lookahead decoding	Uses the model's own n-gram patterns to speculate without any draft model.
Medusa	Adds parallel prediction heads, each predicting a different future token position.

Key property: speculative decoding never changes output quality (rejected drafts are discarded). Costs extra memory for the draft model or heads. On memory-rich Apple Silicon, almost always worth enabling.

# Speculative decoding with a draft model (1.5 to 2.5x speedup)
llama-server -m large-model.gguf \
  -md small-draft-model.gguf \
  --draft-max 16 \
  --draft-min 4 \
  -ngl 99 \
  -fa \
  --jinja

Sampling parameters

The model outputs a probability distribution over its vocabulary. These parameters control token selection:

Parameter	What it does
Temperature	Controls randomness. 0 = greedy (always pick most likely). 1 = proportional sampling. Higher = more creative/chaotic.
Top-p (nucleus)	Only consider tokens whose cumulative probability reaches p. 0.9 = smallest set covering 90% probability.
Top-k	Only consider the k most likely tokens. 40 = choose from top 40.
Repetition penalty	Penalizes already-used tokens. Above 1.0 increases the penalty.
Min-p	Only consider tokens with probability ≥ min-p × highest token's probability. Simpler alternative to top-p.

# Control sampling in llama-cli
llama-cli -m model.gguf \
  --jinja \
  --temp 0.7 \
  --top-p 0.9 \
  --top-k 40 \
  --min-p 0.05 \
  --repeat-penalty 1.1 \
  -p "Explain quantum computing in simple terms."
 
# Greedy decoding (deterministic, best for code/factual tasks)
llama-cli -m model.gguf \
  --jinja \
  --temp 0 \
  -p "Write a function that reverses a linked list."

Memory concepts

Term	What to know
KV cache	Stores intermediate attention computations to avoid recalculation per token. Grows linearly with context length.
KV cache quantization	Compress the cache (e.g., fp16 → q8_0 or q4_0). Enables longer context.
VRAM	GPU memory. The primary bottleneck on NVIDIA hardware.
Unified memory	Apple Silicon: CPU and GPU share the same memory pool. Eliminates the VRAM wall.
Offloading	Moving model layers to CPU RAM or disk when GPU memory is insufficient. Slower, but runs larger models.
ngl (n-gpu-layers)	Number of layers on GPU. Set to 99 to offload everything to GPU.

# Full GPU offload with KV cache quantization for long context
llama-server -m model.gguf \
  -ngl 99 \
  -fa \
  -c 131072 \
  -ctk q8_0 \
  -ctv q8_0
 
# Partial offload when the model doesn't fully fit in memory
llama-server -m huge-model.gguf \
  -ngl 20 \
  -fa \
  -c 8192

Local inference software

Tool	When to use it
llama.cpp	Default choice. C/C++ with Metal (Apple) and CUDA (NVIDIA). GGUF format. Best performance.
Ollama	Getting started quickly. User-friendly llama.cpp wrapper. Manages downloads, serves API.
vLLM	High-throughput GPU serving. PagedAttention, production-grade. NVIDIA only.
ExLlamaV2	Fast NVIDIA inference with EXL2 quantization. Excellent speed at low bit rates.
MLX	Apple Silicon native. Python-first. Growing ecosystem. Best for research/fine-tuning on Mac.
TGI	Hugging Face's production server. GPTQ, AWQ support.
koboldcpp	llama.cpp fork for creative writing/roleplay. Extra samplers and UI features.

Model naming conventions

Official names

Format: Organization ModelFamily-Size-Variant

Component	Examples	Meaning
Organization	Qwen, Meta, Google, Mistral	Who made it
Family	Llama, Gemma, Qwen, Mistral	Model series
Size	7B, 70B, 405B	Parameter count (billions)
Variant	Instruct, Chat, Code, Math	Specialization
Version	3.1, 3.5, 4	Release version

Reading examples:

Llama-3.1-70B-Instruct = Meta, Llama v3.1, 70B params, instruction-tuned
Qwen3-235B-A22B = Qwen v3, 235B total / 22B active (MoE)
Gemma-4-27B-IT = Google Gemma v4, 27B params, Instruction Tuned

Community quantization names

Format: ModelName-Quant.gguf

Example: Qwen3-35B-A3B-UD-Q5_K_XL.gguf

Decoded: Qwen3, 35B total / 3B active, Unsloth Dense quantization, Q5 K-quant extra-large variant, GGUF format.

The Hugging Face ecosystem

Hugging Face is the central hub for open-source models. Know these conventions:

Term	What to know
Model card	The README for a model. Architecture, training data, benchmarks, license, usage. Read this before downloading.
safetensors	Standard weight file format. Safer and faster than PyTorch `.bin`. Used by MLX and most frameworks.
Model repo	Git repo containing weights, config, tokenizer, metadata.
Spaces	Hosted demo apps. Try models before downloading.
Transformers library	Hugging Face's Python library. Reference implementation, not speed-optimized.
GGUF on HF	Community members (Unsloth, bartowski, others) upload pre-quantized GGUF files. Search "GGUF" in model names.

Download a GGUF model from Hugging Face:

# Install the CLI if you haven't
pip install huggingface-hub
 
# Download a specific quantization
huggingface-cli download \
  unsloth/Qwen3-30B-A3B-GGUF \
  Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
  --local-dir ./models

Benchmarks and evaluation

Benchmark	What it tests
MMLU	57 academic subjects, elementary to professional. Broad knowledge.
HumanEval	Code generation from docstrings. Pass@1 success rate.
SWE-bench	Fixes real GitHub issues. The gold standard for coding ability.
MATH	Competition-level math. Tests reasoning.
GSM8K	Grade school multi-step arithmetic. Easier than MATH.
ARC	Grade school science requiring reasoning.
HellaSwag	Commonsense sentence completion.
TruthfulQA	Truthful answers vs. common misconceptions.
Perplexity	How "surprised" the model is by text. Lower = better. Raw quality metric, doesn't predict usefulness directly.
Chatbot Arena / ELO	Human preference from blind A/B tests. Most ecologically valid benchmark.

Putting it all together

When someone says "I'm running Qwen3 35B-A3B UD-Q5_K_XL on llama.cpp with Flash Attention and q8_0 KV cache at 128k context," decode it:

Qwen3: Qwen family, version 3
35B-A3B: 35 billion total, 3 billion active per token (MoE with GQA)
UD-Q5_K_XL: Unsloth Dense quantization, ~5.5 bits per weight, extra-large variant, GGUF format
llama.cpp: inference engine, C/C++ with Metal backend on Apple Silicon
Flash Attention: tiled attention for memory efficiency, enabled with -fa
q8_0 KV cache: attention cache compressed to 8-bit, halving its memory footprint
128k context: up to 128,000 tokens per session, using RoPE for position encoding

That setup: ~25 GB in memory, ~99 t/s on an M5 Max, entire codebase fits in context. Once you know the terminology, you can evaluate any model configuration at a glance.

The full command:

llama-server \
  -m Qwen3-35B-A3B-UD-Q5_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -fa \
  --jinja \
  -c 131072 \
  -ctk q8_0 \
  -ctv q8_0

Terminology current as of mid-2026. The field moves fast. When in doubt, check the model card on Hugging Face and the docs for your inference engine.