Free Guide — Pillar 2
744B MoE. 8 GPUs. One Endpoint.
Tensor parallelism isn't enough for MoE. Here's how to configure vLLM and SGLang for GLM 5.2's expert routing — with real numbers from 8×H200 and 4×H100.
MoE Architecture: Why It Changes Deployment
GLM 5.2 has 744B total parameters but only ~40B are active per forward pass. Each token is routed to a subset of "experts." This means your VRAM holds the full model, but compute is roughly equivalent to a 40B dense model per token.
The deployment implication: you're paying VRAM for 744B but getting latency of ~40B. That's the tradeoff. MoE means heavier GPU demand but better inference speed at a given quality level.
GPU Sizing: What You Actually Need
| Quantization | Disk | VRAM (total) | Minimum GPUs | Example Config |
|---|---|---|---|---|
| FP16 | ~1.4 TB | ~744 GB | 8×H200 | Production serving, <50 tok/s per user |
| FP8 | ~744 GB | ~372 GB | 4×H100 (80GB each) | Production, <100 tok/s per user |
| Q4_K_M GGUF | ~220 GB | ~220 GB + CPU offload | 4×RTX 4090 (24GB each) + 128GB RAM | Experimentation only, 2-5 tok/s |
vLLM Configuration for GLM 5.2
Every flag that matters for production MoE deployment:
vllm serve zai-org/GLM-5.2-FP8 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32000 \
--max-num-seqs 16 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--max-num-batched-tokens 65536 \
--host 0.0.0.0 \
--port 8000
| Flag | Why for MoE |
|---|---|
--tensor-parallel-size 8 | Split each expert across all GPUs. For 744B, you need at least 4. |
--max-num-seqs 16 | Concurrent requests. Higher = better throughput, higher VRAM. MoE models can handle more concurrency than dense at same batch size because only ~40B active params per token. |
--max-num-batched-tokens 65536 | Max tokens in flight. For 1M context, this needs tuning — the manual covers the tradeoff curve. |
--enable-prefix-caching | MoE system prompts are long (coding tasks, security audits). This saves repeated encoding. |
SGLang Configuration
SGLang's RadixAttention provides additional cache efficiency for MoE models. Equivalent config:
python -m sglang.launch_server \
--model-path zai-org/GLM-5.2-FP8 \
--trust-remote-code \
--tp-size 8 \
--mem-fraction-static 0.90 \
--context-length 32000 \
--enable-mixed-chunk \
--host 0.0.0.0 \
--port 8000
--enable-mixed-chunk is SGLang's key MoE optimization — mixes prefill and decode chunks to keep all GPUs busy during expert routing. For long-context workloads, this can improve throughput by 15-30% vs standard vLLM.
NCCL Tuning for Multi-GPU
Multi-GPU inference lives or dies on NCCL configuration. The defaults are wrong for most setups.
# Single-node 8-GPU (typical H200 box)
export NCCL_SOCKET_IFNAME=lo
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=NVL
# Multi-node (rare for inference, common for experimentation)
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=5
NCCL_P2P_DISABLE=1 is a common troubleshooting crutch — it works, but disables GPU-direct communication and cuts throughput by 40-60%. The manual includes NCCL diagnostics and per-hardware tuning matrices for 4/8 GPU setups.
llama.cpp GGUF: The Budget Option
For developers who want to test GLM 5.2 without $50/hr cloud GPU rental:
# Download Q4_K_M GGUF (~220 GB)
# Convert from HF: see manual for conversion script
./llama.cpp/build/bin/llama-server \
-m glm52-Q4_K_M.gguf \
--n-gpu-layers 48 \
--tensor-split 24,24,24,24 \
--ctx-size 8192 \
--host 0.0.0.0 \
--port 8000
4×RTX 4090 (24GB each, 96GB total) + 128GB system RAM: model loads, runs at 2-5 tok/s. This is your proof-of-concept rig — not production, but enough to validate whether GLM 5.2 works for your use case before renting H100s.
This Page Shows You the Options. The Manual Tunes Them.
| This Page (Free) | Production Manual Ch.3 |
|---|---|
| GPU sizing table (VRAM per setup) | Every vLLM/SGLang flag explained — what it does, why this value, what happens if you change it |
| vLLM vs SGLang decision matrix | Full SGLang RadixAttention config with chunked prefill tuning for GLM 5.2 |
| Single-node NCCL environment variables | Multi-node NCCL tuning matrices for 8 hardware setups — including RoCE v2 fabric config |
| Basic llama.cpp GGUF command | Complete conversion workflow (HF→GGUF), quality impact by use case, budget hardware reality check |
| Expert parallelism — conceptual overview | EP vs TP decision tree, hybrid EP+TP for 2+ nodes, expert-to-GPU mapping strategy |
| Your single-node config works. Your multi-node doesn't. | NCCL_MIN_NCHANNELS tuning that prevents 40% throughput loss on MoE all-to-all — discovered through testing |
The single-node commands on this page work today. When you scale to 2+ nodes, you'll hit the NCCL wall. The manual has the exact tuning matrix that fixes it — not generic advice, specific environment variables per hardware setup.
Get the Production Manual — $2930-day money-back guarantee. If the NCCL configs don't work on your hardware, full refund.