GLM-5.2-FP8 auf HGX-H200: SGLang Docker-Konfiguration mit 262k Kontext und 70 t/s

Warum es zählt

Die Konfiguration zeigt konkret, dass moe-a2a-backend deepep auf H200 die Performance senkt (50 t/s statt 70 t/s) und mem-fraction-static über 0.83 zu OOM führt. vLLM-Rezepte für H200 funktionieren wegen FP8-KV-Cache auf DSv3-Architektur nicht – SGLang ist aktuell die praktikable Alternative.

— Lumeric Redaktion

Quelle lesenreddit.com

70 t/s @ 262k Kontext

GLM-5.2-FP8 auf HGX-H200 mit SGLang

Inferenz Infra Open Source Developer Tooling

Frag die KI zum Artikel

Folgefragen zu Headline, Quelle und Volltext — Antwort streamt in wenigen Sekunden.

GLM-5.2-FP8 auf HGX-H200: SGLang Docker-Konfiguration mit 262k Kontext und 70 t/s

Warum es zählt

— Lumeric Redaktion

70 t/s @ 262k Kontext

GLM-5.2-FP8 auf HGX-H200 mit SGLang

Frag die KI zum Artikel

Folgefragen zu Headline, Quelle und Volltext — Antwort streamt in wenigen Sekunden.

GLM-5.2-FP8 auf HGX-H200: SGLang Docker-Konfiguration mit 262k Kontext und 70 t/s

Frag die KI zum Artikel

Verwandte Beiträge

GLM-5.2-FP8 auf HGX-H200: SGLang Docker-Konfiguration mit 262k Kontext und 70 t/s

Frag die KI zum Artikel

Verwandte Beiträge