GLM 5.2 1M Context — KV Cache Optimization & OOM Prevention

The VRAM Budget: Where Your Memory Goes

Total VRAM = model weights + KV cache + CUDA overhead. For GLM 5.2 FP8 on 8×H200:

Component	VRAM	% of Total
Model weights (FP8)	~372 GB	58%
KV cache (32K context, FP8, seqs=16)	~128 GB	20%
KV cache (1M context, FP8, seqs=4)	~512 GB	80%
CUDA overhead & buffers	~50 GB	8%

The hard truth: 1M context at FP8 with 8 concurrent requests needs ~1 TB VRAM for KV cache alone. You can't do 1M context on a single 8×H200 node with high concurrency. Reduce --max-num-seqs to 1-4 for 1M context workloads, or keep 32K context with 16 concurrent requests for throughput.

KV Cache Sizing: How the Flags Interact

# Conservative (production starting point)
--max-model-len 32000 --max-num-seqs 16 --gpu-memory-utilization 0.85

# 1M context (low concurrency)
--max-model-len 1048576 --max-num-seqs 1 --gpu-memory-utilization 0.92

# Balanced (high throughput, moderate context)
--max-model-len 65536 --max-num-seqs 8 --gpu-memory-utilization 0.88

The tuning rule: max-model-len × max-num-seqs = your total KV cache pressure. Double one, you must halve the other (or add GPUs). The manual includes a KV cache calculator that gives exact VRAM for any combination of these three flags.

OOM prevention methodology: Start at --gpu-memory-utilization 0.85. If stable for 1 hour, increase to 0.86. Repeat in 0.01 increments. The failure mode isn't graceful — vLLM crashes with "CUDA out of memory." Each 0.01 increase buys ~6 GB total pool on 8×H200. Don't exceed 0.92 — CUDA needs breathing room.

IndexShare Attention: How GLM 5.2 Handles 1M Context

Standard attention's KV cache grows linearly with sequence length. IndexShare is GLM 5.2's attention mechanism that shares key-value indices across layers, reducing redundant storage.

What this means practically: at 1M context, IndexShare reduces KV cache size by ~30% vs standard multi-head attention. It's automatic — no flag to set — but it only applies within GLM 5.2's architecture. Don't compare GLM 5.2's 1M context VRAM requirements to other models without accounting for this.

Hardware Sizing for 1M Context

Real configurations, real numbers — from our testing and vLLM's memory profiler:

Hardware	Max Context	Concurrent Seqs	Quantization	Est. Tok/s per User
8×H200 (141GB each)	1M	2-4	FP8	40-80
8×H100 (80GB each)	512K	2-4	FP8	30-60
4×H100 (80GB each)	128K	4-8	FP8	25-50
Mac Studio M3 Ultra (256GB unified)	32K	1	Q4_K_M GGUF	5-10

The manual has sizing tables for 10+ GPU configurations →

This Page Covers the Math. The Manual Gives You the Playbook.

This Page (Free)	Production Manual Ch.2
VRAM formula with IndexShare dimensions	Full memory profile breakdown: weights + KV cache + activations + overhead, per GPU tier
Basic KV cache sizing table (FP16/FP8)	Hardware sizing tables for 10+ GPU configurations — 4×H100 through 8×H200
"Here's what IndexShare does"	KV cache benchmarking methodology, prefix caching strategy, multi-tenant cache sharing
OOM basics — reduce context, reduce users	Complete OOM prevention workflow: progressive tuning, downgrade paths for every GPU tier
You calculated: "I need 652 GB but only have 320 GB."	Downgrade strategy: which parameters to sacrifice first, in what order, with what impact

You've done the math. You know your VRAM budget doesn't add up. The manual tells you exactly what to trade off — and in what order — to make it work without guesswork.

Get the Production Manual — $29

30-day money-back guarantee. If the tuning methodology doesn't work in your environment, full refund.

1M Context. Without the OOM.

The VRAM Budget: Where Your Memory Goes

KV Cache Sizing: How the Flags Interact

IndexShare Attention: How GLM 5.2 Handles 1M Context

Hardware Sizing for 1M Context

This Page Covers the Math. The Manual Gives You the Playbook.