Free Guide

Install GLM 5.2 & Run Your First Inference

10 minutes from zero to working GLM 5.2 endpoint. vLLM + FP8 on 8×H200.

Prerequisites

You need

  • 8×H200 (recommended) or 4×H100 (minimum for FP8)
  • CUDA 12.4+ and NVIDIA driver 550+
  • ~750 GB free disk (for FP8 weights)
  • Python 3.10+ and pip

You don't need

  • Docker (vLLM runs natively or containerized — both work)
  • Paid API keys — it's your hardware, your model
  • Model fine-tuning experience

Step 1: Install vLLM

pip install vllm>=0.23.0

vLLM 0.23.0+ includes native GLM 5.2 support with FP8 KV cache and IndexShare attention. Verify:

python -c "import vllm; print(vllm.__version__)"
GLM 5.2 requires --trust-remote-code — the model's custom architecture isn't in upstream transformers yet. vLLM handles this automatically when you pass the flag, but it's a security consideration for some environments. The manual covers the trust boundary tradeoff in detail.

Step 2: Start the vLLM Server

vllm serve zai-org/GLM-5.2-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32000 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

What each flag does:

FlagWhat it does
--tensor-parallel-size 8Splits model across 8 GPUs. Required for 744B model.
--gpu-memory-utilization 0.90Use 90% of GPU VRAM. Start at 0.85 and increase in 0.01 steps if stable.
--max-model-len 32000Max context length. Start conservative — every doubling balloons KV cache.
--kv-cache-dtype fp8Halves KV cache VRAM vs FP16. H100/H200 native support.
--enable-prefix-cachingCaches shared prefixes (system prompts). Saves VRAM on repeated queries.

Step 3: Verify It's Working

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5.2-FP8",
    "messages": [{"role": "user", "content": "Explain MoE routing in one sentence."}],
    "max_tokens": 100
  }'

Expected: JSON response with generated text, within 200ms TTFT on 8×H200.

Cold start note: first load compiles CUDA graphs — allow 5-10 minutes. Subsequent restarts are faster.

Troubleshooting: If It Didn't Work

OOM at startup — "CUDA out of memory"

Drop --gpu-memory-utilization to 0.85. Reduce --max-model-len to 8192. Check that all GPUs are visible: nvidia-smi. If one GPU is missing, check NCCL configuration.

"trust_remote_code" error

GLM 5.2's architecture requires custom code. Always pass --trust-remote-code. In production, review the model code first — the manual includes a security review checklist for this specific concern.

Model re-downloads on every restart

vLLM caches models in ~/.cache/huggingface/hub/. If your container doesn't persist this volume, mount it. The manual's docker-compose includes volume persistence for all model caches.

NCCL timeout on multi-GPU

Multi-node setups: set NCCL_SOCKET_IFNAME to your high-speed interface (e.g., ib0 for InfiniBand). Single-node multi-GPU: try NCCL_P2P_DISABLE=1 as a diagnostic (but don't leave it on in production — it kills throughput).

Still stuck? The production manual covers 20+ verified error fixes — OOM, NCCL hangs, tokenizer mismatches, FP8 compatibility, driver update safety, and slow cold start diagnostics. Get the manual →

GLM 5.2 is Serving. Now What?

You just solved 4 common errors. The manual covers 20 — each with symptom, root cause, verified fix, and prevention. Plus the production stack that keeps it running.

This Page (Free)Production Manual Ch.1 + Ch.7
One-shot vllm serve commandProduction docker-compose.yml with health checks, resource limits, GPU affinity, persistent volumes
4 common errors (OOM, trust_remote_code, re-download, NCCL timeout)20 verified errors — each with symptom, root cause, verified fix, and prevention
Basic nginx mentionFull nginx config: TLS termination, rate limiting, JWT auth, access logging in JSON
Not coveredSystemd unit with reboot test, pre-deployment checklist, log rotation, zero-downtime updates
Not coveredMulti-node NCCL debug flowcharts, GPU memory leak detection scripts, driver update safe procedure

You got it running in 10 minutes. Keeping it running at 3am when NCCL hangs and the disk is full — that's what the manual is for.

Get the Production Manual — $29

30-day money-back guarantee. Full refund if the error fixes don't solve your problem.