Free Guide — Pillar 2

744B MoE. 8 GPUs. One Endpoint.

Tensor parallelism isn't enough for MoE. Here's how to configure vLLM and SGLang for GLM 5.2's expert routing — with real numbers from 8×H200 and 4×H100.

MoE Architecture: Why It Changes Deployment

GLM 5.2 has 744B total parameters but only ~40B are active per forward pass. Each token is routed to a subset of "experts." This means your VRAM holds the full model, but compute is roughly equivalent to a 40B dense model per token.

The deployment implication: you're paying VRAM for 744B but getting latency of ~40B. That's the tradeoff. MoE means heavier GPU demand but better inference speed at a given quality level.

GPU Sizing: What You Actually Need

QuantizationDiskVRAM (total)Minimum GPUsExample Config
FP16~1.4 TB~744 GB8×H200Production serving, <50 tok/s per user
FP8~744 GB~372 GB4×H100 (80GB each)Production, <100 tok/s per user
Q4_K_M GGUF~220 GB~220 GB + CPU offload4×RTX 4090 (24GB each) + 128GB RAMExperimentation only, 2-5 tok/s
Reality check: 4×RTX 4090 can load GLM 5.2 with heavy quantization and CPU offloading. You'll get 2-5 tok/s — not usable for interactive chat, but enough to test prompts, verify output quality, and experiment before committing to cloud GPU spend. This is your fast path to "does GLM 5.2 work for my use case" without $50/hr H100 rental.

vLLM Configuration for GLM 5.2

Every flag that matters for production MoE deployment:

vllm serve zai-org/GLM-5.2-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32000 \
  --max-num-seqs 16 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-batched-tokens 65536 \
  --host 0.0.0.0 \
  --port 8000
FlagWhy for MoE
--tensor-parallel-size 8Split each expert across all GPUs. For 744B, you need at least 4.
--max-num-seqs 16Concurrent requests. Higher = better throughput, higher VRAM. MoE models can handle more concurrency than dense at same batch size because only ~40B active params per token.
--max-num-batched-tokens 65536Max tokens in flight. For 1M context, this needs tuning — the manual covers the tradeoff curve.
--enable-prefix-cachingMoE system prompts are long (coding tasks, security audits). This saves repeated encoding.

SGLang Configuration

SGLang's RadixAttention provides additional cache efficiency for MoE models. Equivalent config:

python -m sglang.launch_server \
  --model-path zai-org/GLM-5.2-FP8 \
  --trust-remote-code \
  --tp-size 8 \
  --mem-fraction-static 0.90 \
  --context-length 32000 \
  --enable-mixed-chunk \
  --host 0.0.0.0 \
  --port 8000

--enable-mixed-chunk is SGLang's key MoE optimization — mixes prefill and decode chunks to keep all GPUs busy during expert routing. For long-context workloads, this can improve throughput by 15-30% vs standard vLLM.

NCCL Tuning for Multi-GPU

Multi-GPU inference lives or dies on NCCL configuration. The defaults are wrong for most setups.

# Single-node 8-GPU (typical H200 box)
export NCCL_SOCKET_IFNAME=lo
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=NVL

# Multi-node (rare for inference, common for experimentation)
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=5
Don't guess with NCCL. NCCL_P2P_DISABLE=1 is a common troubleshooting crutch — it works, but disables GPU-direct communication and cuts throughput by 40-60%. The manual includes NCCL diagnostics and per-hardware tuning matrices for 4/8 GPU setups.

llama.cpp GGUF: The Budget Option

For developers who want to test GLM 5.2 without $50/hr cloud GPU rental:

# Download Q4_K_M GGUF (~220 GB)
# Convert from HF: see manual for conversion script

./llama.cpp/build/bin/llama-server \
  -m glm52-Q4_K_M.gguf \
  --n-gpu-layers 48 \
  --tensor-split 24,24,24,24 \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8000

4×RTX 4090 (24GB each, 96GB total) + 128GB system RAM: model loads, runs at 2-5 tok/s. This is your proof-of-concept rig — not production, but enough to validate whether GLM 5.2 works for your use case before renting H100s.

This Page Shows You the Options. The Manual Tunes Them.

This Page (Free)Production Manual Ch.3
GPU sizing table (VRAM per setup)Every vLLM/SGLang flag explained — what it does, why this value, what happens if you change it
vLLM vs SGLang decision matrixFull SGLang RadixAttention config with chunked prefill tuning for GLM 5.2
Single-node NCCL environment variablesMulti-node NCCL tuning matrices for 8 hardware setups — including RoCE v2 fabric config
Basic llama.cpp GGUF commandComplete conversion workflow (HF→GGUF), quality impact by use case, budget hardware reality check
Expert parallelism — conceptual overviewEP vs TP decision tree, hybrid EP+TP for 2+ nodes, expert-to-GPU mapping strategy
Your single-node config works. Your multi-node doesn't.NCCL_MIN_NCHANNELS tuning that prevents 40% throughput loss on MoE all-to-all — discovered through testing

The single-node commands on this page work today. When you scale to 2+ nodes, you'll hit the NCCL wall. The manual has the exact tuning matrix that fixes it — not generic advice, specific environment variables per hardware setup.

Get the Production Manual — $29

30-day money-back guarantee. If the NCCL configs don't work on your hardware, full refund.