Install GLM 5.2 — vLLM Setup & First Inference in 10 Minutes

Prerequisites

You need

8×H200 (recommended) or 4×H100 (minimum for FP8)
CUDA 12.4+ and NVIDIA driver 550+
~750 GB free disk (for FP8 weights)
Python 3.10+ and pip

You don't need

Docker (vLLM runs natively or containerized — both work)
Paid API keys — it's your hardware, your model
Model fine-tuning experience

Step 1: Install vLLM

pip install vllm>=0.23.0

vLLM 0.23.0+ includes native GLM 5.2 support with FP8 KV cache and IndexShare attention. Verify:

python -c "import vllm; print(vllm.__version__)"

GLM 5.2 requires --trust-remote-code — the model's custom architecture isn't in upstream transformers yet. vLLM handles this automatically when you pass the flag, but it's a security consideration for some environments. The manual covers the trust boundary tradeoff in detail.

Step 2: Start the vLLM Server

vllm serve zai-org/GLM-5.2-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32000 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

What each flag does:

Flag	What it does
`--tensor-parallel-size 8`	Splits model across 8 GPUs. Required for 744B model.
`--gpu-memory-utilization 0.90`	Use 90% of GPU VRAM. Start at 0.85 and increase in 0.01 steps if stable.
`--max-model-len 32000`	Max context length. Start conservative — every doubling balloons KV cache.
`--kv-cache-dtype fp8`	Halves KV cache VRAM vs FP16. H100/H200 native support.
`--enable-prefix-caching`	Caches shared prefixes (system prompts). Saves VRAM on repeated queries.

Step 3: Verify It's Working

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5.2-FP8",
    "messages": [{"role": "user", "content": "Explain MoE routing in one sentence."}],
    "max_tokens": 100
  }'

Expected: JSON response with generated text, within 200ms TTFT on 8×H200.

Cold start note: first load compiles CUDA graphs — allow 5-10 minutes. Subsequent restarts are faster.

Troubleshooting: If It Didn't Work

OOM at startup — "CUDA out of memory"

Drop --gpu-memory-utilization to 0.85. Reduce --max-model-len to 8192. Check that all GPUs are visible: nvidia-smi. If one GPU is missing, check NCCL configuration.

"trust_remote_code" error

GLM 5.2's architecture requires custom code. Always pass --trust-remote-code. In production, review the model code first — the manual includes a security review checklist for this specific concern.

Model re-downloads on every restart

vLLM caches models in ~/.cache/huggingface/hub/. If your container doesn't persist this volume, mount it. The manual's docker-compose includes volume persistence for all model caches.

NCCL timeout on multi-GPU

Multi-node setups: set NCCL_SOCKET_IFNAME to your high-speed interface (e.g., ib0 for InfiniBand). Single-node multi-GPU: try NCCL_P2P_DISABLE=1 as a diagnostic (but don't leave it on in production — it kills throughput).

Still stuck? The production manual covers 20+ verified error fixes — OOM, NCCL hangs, tokenizer mismatches, FP8 compatibility, driver update safety, and slow cold start diagnostics. Get the manual →

GLM 5.2 is Serving. Now What?

You just solved 4 common errors. The manual covers 20 — each with symptom, root cause, verified fix, and prevention. Plus the production stack that keeps it running.

This Page (Free)	Production Manual Ch.1 + Ch.7
One-shot `vllm serve` command	Production docker-compose.yml with health checks, resource limits, GPU affinity, persistent volumes
4 common errors (OOM, trust_remote_code, re-download, NCCL timeout)	20 verified errors — each with symptom, root cause, verified fix, and prevention
Basic nginx mention	Full nginx config: TLS termination, rate limiting, JWT auth, access logging in JSON
Not covered	Systemd unit with reboot test, pre-deployment checklist, log rotation, zero-downtime updates
Not covered	Multi-node NCCL debug flowcharts, GPU memory leak detection scripts, driver update safe procedure

You got it running in 10 minutes. Keeping it running at 3am when NCCL hangs and the disk is full — that's what the manual is for.

Get the Production Manual — $29

30-day money-back guarantee. Full refund if the error fixes don't solve your problem.

Install GLM 5.2 & Run Your First Inference