Free Guide — Pillar 1

1M Context. Without the OOM.

KV cache for 1M tokens can consume 100s of GB alone. Here's the VRAM math, IndexShare mechanics, and the tuning methodology that keeps GLM 5.2 serving instead of crashing.

The VRAM Budget: Where Your Memory Goes

Total VRAM = model weights + KV cache + CUDA overhead. For GLM 5.2 FP8 on 8×H200:

ComponentVRAM% of Total
Model weights (FP8)~372 GB58%
KV cache (32K context, FP8, seqs=16)~128 GB20%
KV cache (1M context, FP8, seqs=4)~512 GB80%
CUDA overhead & buffers~50 GB8%

The hard truth: 1M context at FP8 with 8 concurrent requests needs ~1 TB VRAM for KV cache alone. You can't do 1M context on a single 8×H200 node with high concurrency. Reduce --max-num-seqs to 1-4 for 1M context workloads, or keep 32K context with 16 concurrent requests for throughput.

KV Cache Sizing: How the Flags Interact

# Conservative (production starting point)
--max-model-len 32000 --max-num-seqs 16 --gpu-memory-utilization 0.85

# 1M context (low concurrency)
--max-model-len 1048576 --max-num-seqs 1 --gpu-memory-utilization 0.92

# Balanced (high throughput, moderate context)
--max-model-len 65536 --max-num-seqs 8 --gpu-memory-utilization 0.88

The tuning rule: max-model-len × max-num-seqs = your total KV cache pressure. Double one, you must halve the other (or add GPUs). The manual includes a KV cache calculator that gives exact VRAM for any combination of these three flags.

OOM prevention methodology: Start at --gpu-memory-utilization 0.85. If stable for 1 hour, increase to 0.86. Repeat in 0.01 increments. The failure mode isn't graceful — vLLM crashes with "CUDA out of memory." Each 0.01 increase buys ~6 GB total pool on 8×H200. Don't exceed 0.92 — CUDA needs breathing room.

IndexShare Attention: How GLM 5.2 Handles 1M Context

Standard attention's KV cache grows linearly with sequence length. IndexShare is GLM 5.2's attention mechanism that shares key-value indices across layers, reducing redundant storage.

What this means practically: at 1M context, IndexShare reduces KV cache size by ~30% vs standard multi-head attention. It's automatic — no flag to set — but it only applies within GLM 5.2's architecture. Don't compare GLM 5.2's 1M context VRAM requirements to other models without accounting for this.

Hardware Sizing for 1M Context

Real configurations, real numbers — from our testing and vLLM's memory profiler:

HardwareMax ContextConcurrent SeqsQuantizationEst. Tok/s per User
8×H200 (141GB each)1M2-4FP840-80
8×H100 (80GB each)512K2-4FP830-60
4×H100 (80GB each)128K4-8FP825-50
Mac Studio M3 Ultra (256GB unified)32K1Q4_K_M GGUF5-10

The manual has sizing tables for 10+ GPU configurations →

This Page Covers the Math. The Manual Gives You the Playbook.

This Page (Free)Production Manual Ch.2
VRAM formula with IndexShare dimensionsFull memory profile breakdown: weights + KV cache + activations + overhead, per GPU tier
Basic KV cache sizing table (FP16/FP8)Hardware sizing tables for 10+ GPU configurations — 4×H100 through 8×H200
"Here's what IndexShare does"KV cache benchmarking methodology, prefix caching strategy, multi-tenant cache sharing
OOM basics — reduce context, reduce usersComplete OOM prevention workflow: progressive tuning, downgrade paths for every GPU tier
You calculated: "I need 652 GB but only have 320 GB."Downgrade strategy: which parameters to sacrifice first, in what order, with what impact

You've done the math. You know your VRAM budget doesn't add up. The manual tells you exactly what to trade off — and in what order — to make it work without guesswork.

Get the Production Manual — $29

30-day money-back guarantee. If the tuning methodology doesn't work in your environment, full refund.