A Single vllm serve Isn't Production.

The 3-layer production stack: nginx reverse proxy → vLLM/SGLang serving → GPU backend. TLS termination, health checks, autorestart, log rotation — the things that keep your model serving at 3am when you're asleep.

The Production Topology

LayerComponentWhat It Does
EdgenginxTLS termination, rate limiting, API key validation, request buffering, access logging
AppvLLM / SGLangModel inference, KV cache management, batching, OpenAI-compatible API endpoint
GPUNVIDIA Driver + CUDAFP8 kernel execution, NCCL multi-GPU communication, VRAM allocation
Infrasystemd + DockerAuto-restart on crash, container lifecycle, log rotation, resource limits

Docker Compose — The Skeleton

This gets you past "it works on my machine." One file, two services, GPU passthrough. The manual adds health checks, resource limits, and multi-node config.

# docker-compose.yml — production skeleton
version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --model glm-5-2/fp8
      --tensor-parallel-size 8
      --max-model-len 131072
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 8gb

  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - vllm

This skeleton runs. But it's missing health checks (Docker will mark vLLM "running" before the model loads), resource limits (OOM kills the wrong container), and the systemd unit that survives a reboot. The manual covers all three.

Nginx — The Gateway

Your model speaks OpenAI-compatible API. Clients shouldn't talk to it directly. nginx adds the missing production layer.

# nginx.conf — minimal production gateway
upstream glm_backend {
    server vllm:8000 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

server {
    listen 443 ssl;
    server_name glm-api.yourdomain.com;

    ssl_certificate     /etc/nginx/certs/fullchain.pem;
    ssl_certificate_key /etc/nginx/certs/privkey.pem;

    # Rate limit: 60 requests per minute per IP
    limit_req_zone $binary_remote_addr zone=glm_limit:10m rate=60r/m;
    limit_req zone=glm_limit burst=20 nodelay;

    # Max 128K context — prevent OOM from oversized prompts
    client_max_body_size 2m;

    location /v1/ {
        proxy_pass http://glm_backend;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;

        # API key validation (JWT example — manual has full auth setup)
        auth_request /auth;
    }

    location /health {
        proxy_pass http://glm_backend/health;
        access_log off;
    }

    location = /auth {
        internal;
        proxy_pass http://auth-service/validate;
    }
}

rate limiting + TLS + auth endpoint — the three things that separate a demo from a production API. The manual adds JWT token generation, Cloudflare Tunnel as an alternative, and multi-user queue management for burst traffic.

Systemd — Survive a Reboot

Without a process manager, your model dies when the SSH session drops. systemd is the simplest answer.

# /etc/systemd/system/glm52.service
[Unit]
Description=GLM 5.2 Production Stack
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/glm52
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down
ExecReload=/usr/bin/docker compose restart
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

systemctl enable glm52 and your model comes back after a power outage. The manual covers the pre-deployment checklist: reboot test, cold start timing (8×H200 loads FP8 weights in ~110 seconds), and zero-downtime model updates via rolling restart.

What the Manual Adds (Ch.1)

This Page (Free)Production Manual Ch.1
Docker Compose skeletonFull production compose with health checks, resource limits, GPU affinity
Basic nginx configJWT auth + Cloudflare Tunnel + API key rotation + access log parsing
systemd unit templatePre-deployment checklist: reboot test, cold start timing, rollback procedure
Single-node topologyMulti-node architecture with load balancer + shared KV cache strategy
Not coveredZero-downtime model updates, rolling restart, canary deployment pattern

This Page Shows the Pattern. The Manual Ships the Configs.

Every config on this page is a skeleton — enough to understand the architecture, not enough to run production traffic. Ch.1 gives you the complete docker-compose.yml with every flag validated on 8×H200, the nginx config that handles 100+ concurrent users, and the systemd unit that survived 50 reboot cycles in testing.

Get the Production Manual — $29

30-day money-back guarantee. If the configs don't work in your environment, full refund.