Any cloud vendor offering this model? I would like to try it.

PhilippGille · 2026-01-19T15:45:08 1768837508

z.ai itself, or Novita fow now, but others will follow soon probably

https://openrouter.ai/z-ai/glm-4.7-flash/providers

sdrinf · 2026-01-19T20:39:22 1768855162

Note: I strongly recommend against using Novita -their main gig is serving quantized versions of the model to offer it for cheaper / at better latency; but if you ran an eval against other providers vs novita, you can spot the quality degradation. This is nowhere marked, or displayed in their offering.

Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.

epolanski · 2026-01-19T15:52:06 1768837926

Interesting, it costs less than a tenth than Haiku.

saratogacx · 2026-01-19T16:29:38 1768840178

GLM itself is quite inexpensive. A year sub to their coding plan is only $29 and works with a bunch of various tools. I use it heavily as a "I don't want to spend my anthropic credits" day-to-day model (mostly using Crush)

latchkey · 2026-01-19T16:32:28 1768840348

We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.

ssh admin.hotaisle.app

Yes, this should be made easier to just get a VM with it pre-installed. Working on that.

omneity · 2026-01-19T16:37:14 1768840634

Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

latchkey · 2026-01-19T16:43:39 1768841019

Agreed, the OOB experience kind of suck.

Here is the magic (assuming a 4x)...

  docker run -it --rm \
  --pull=always \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/mem \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /home/hotaisle:/mnt/data \
  -v /root/.cache:/mnt/model \
  rocm/vllm-dev:nightly
  
  mv /root/.cache /root/.cache.foo
  ln -s /mnt/model /root/.cache
  
  VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --quantization fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --load-format fastsafetensors \
  --enable-expert-parallel \
  --allowed-local-media-path / \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --mm-encoder-tp-mode data

Der_Einzige · 2026-01-20T11:50:31 1768909831

Speculative decoding isn’t needed at all, right? Why include the final bits about it?

latchkey · 2026-01-20T17:04:53 1768928693

https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html...

foobar10000 · 2026-01-20T16:06:58 1768925218

GLM 4.7 supports it - and in my experience for Claude code a 80 plus hit rate in speculative is reasonable. So it is a significant speed up.

dvs13 · 2026-01-19T15:49:23 1768837763

https://huggingface.co/inference/models?model=zai-org%2FGLM-... :)

xena · 2026-01-19T15:41:55 1768837315

The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.

idiliv · 2026-01-19T15:44:55 1768837495

Sometimes model developers coordinate with inference platforms to time releases in sync.