Free Guide
Install GLM 5.2 & Run Your First Inference
10 minutes from zero to working GLM 5.2 endpoint. vLLM + FP8 on 8×H200.
Prerequisites
You need
- 8×H200 (recommended) or 4×H100 (minimum for FP8)
- CUDA 12.4+ and NVIDIA driver 550+
- ~750 GB free disk (for FP8 weights)
- Python 3.10+ and pip
You don't need
- Docker (vLLM runs natively or containerized — both work)
- Paid API keys — it's your hardware, your model
- Model fine-tuning experience
Step 1: Install vLLM
pip install vllm>=0.23.0
vLLM 0.23.0+ includes native GLM 5.2 support with FP8 KV cache and IndexShare attention. Verify:
python -c "import vllm; print(vllm.__version__)"
--trust-remote-code — the model's custom architecture isn't in upstream transformers yet. vLLM handles this automatically when you pass the flag, but it's a security consideration for some environments. The manual covers the trust boundary tradeoff in detail.
Step 2: Start the vLLM Server
vllm serve zai-org/GLM-5.2-FP8 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32000 \
--max-num-seqs 8 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000
What each flag does:
| Flag | What it does |
|---|---|
--tensor-parallel-size 8 | Splits model across 8 GPUs. Required for 744B model. |
--gpu-memory-utilization 0.90 | Use 90% of GPU VRAM. Start at 0.85 and increase in 0.01 steps if stable. |
--max-model-len 32000 | Max context length. Start conservative — every doubling balloons KV cache. |
--kv-cache-dtype fp8 | Halves KV cache VRAM vs FP16. H100/H200 native support. |
--enable-prefix-caching | Caches shared prefixes (system prompts). Saves VRAM on repeated queries. |
Step 3: Verify It's Working
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-5.2-FP8",
"messages": [{"role": "user", "content": "Explain MoE routing in one sentence."}],
"max_tokens": 100
}'
Expected: JSON response with generated text, within 200ms TTFT on 8×H200.
Cold start note: first load compiles CUDA graphs — allow 5-10 minutes. Subsequent restarts are faster.
Troubleshooting: If It Didn't Work
OOM at startup — "CUDA out of memory"
Drop --gpu-memory-utilization to 0.85. Reduce --max-model-len to 8192. Check that all GPUs are visible: nvidia-smi. If one GPU is missing, check NCCL configuration.
"trust_remote_code" error
GLM 5.2's architecture requires custom code. Always pass --trust-remote-code. In production, review the model code first — the manual includes a security review checklist for this specific concern.
Model re-downloads on every restart
vLLM caches models in ~/.cache/huggingface/hub/. If your container doesn't persist this volume, mount it. The manual's docker-compose includes volume persistence for all model caches.
NCCL timeout on multi-GPU
Multi-node setups: set NCCL_SOCKET_IFNAME to your high-speed interface (e.g., ib0 for InfiniBand). Single-node multi-GPU: try NCCL_P2P_DISABLE=1 as a diagnostic (but don't leave it on in production — it kills throughput).
GLM 5.2 is Serving. Now What?
You just solved 4 common errors. The manual covers 20 — each with symptom, root cause, verified fix, and prevention. Plus the production stack that keeps it running.
| This Page (Free) | Production Manual Ch.1 + Ch.7 |
|---|---|
One-shot vllm serve command | Production docker-compose.yml with health checks, resource limits, GPU affinity, persistent volumes |
| 4 common errors (OOM, trust_remote_code, re-download, NCCL timeout) | 20 verified errors — each with symptom, root cause, verified fix, and prevention |
| Basic nginx mention | Full nginx config: TLS termination, rate limiting, JWT auth, access logging in JSON |
| Not covered | Systemd unit with reboot test, pre-deployment checklist, log rotation, zero-downtime updates |
| Not covered | Multi-node NCCL debug flowcharts, GPU memory leak detection scripts, driver update safe procedure |
You got it running in 10 minutes. Keeping it running at 3am when NCCL hangs and the disk is full — that's what the manual is for.
Get the Production Manual — $2930-day money-back guarantee. Full refund if the error fixes don't solve your problem.