Post

Qwen3.6-27B on an M1 Max: when the laptop config matches the 3090

Benchmarked Qwen3.6-27B Q4_K_M on an M1 Max via llama-cpp-turboquant. The same prod config that runs on my 3090 was already optimal — but llama-cli's interactive auto-mode nearly ate my disk.

Qwen3.6-27B on an M1 Max: when the laptop config matches the 3090

I have a llama-swap config running Qwen3.6-27B on my RTX 3090 with cache-type-k: turbo3 and a 262k context window. I wanted to know how the same config behaves on an M1 Max with 32 GB of unified memory — the laptop I actually carry around. The short answer: the prod yaml from the 3090 was already the right config for the M1, decode is about 3-4× slower (memory-bandwidth limited, exactly as the math predicts), and llama-cli will gleefully fill your disk with > characters if you don’t pass the right flag.

The setup

The fork is llama-cpp-turboquant, a llama.cpp fork that adds Metal kernels for a custom KV cache type called turbo3 (~4 bpw KV with sparse-V dequant) plus a TQ4_1S weight quantization. On Apple Silicon Metal is enabled by default — building is two cmake commands:

1
2
3
cd ~/Github/llama-cpp-turboquant
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

The model: unsloth/Qwen3.6-27B-GGUF, Q4_K_M variant, 16 GB on disk. The repo also ships everything from BF16 splits down to UD-IQ2_XXS, plus three mmproj-*.gguf files for vision input — Qwen3.6 is multimodal. For this run I stuck with text.

One detail I learned the hard way: Qwen3.6-27B isn’t a stock transformer. It’s a hybrid of 48 Gated DeltaNet layers and 16 Gated Attention layers (24 query heads, 4 KV heads, head_dim 256). DeltaNet layers are recurrent — they have no KV cache. So the KV math is not 2 × 64 × kv_heads × head_dim like a generic 27B; it’s 2 × 16 × kv_heads × head_dim. At fp16 that’s about 32 KB per token instead of 131 KB. A 32k context fits in roughly 1 GB of fp16 KV. With turbo3 it’s a quarter of that.

The bench matrix

I wrote a small bench.sh that runs llama-bench -p 512 -n 128 -r 3 -ngl 99 across seven configurations, captures ps snapshots into per-config telemetry files, and ends with a real-inference run via llama-cli. The seven configs varied the KV cache type, flash attention, ubatch size, and the TURBO_SPARSE_V env var.

Here’s what came back, ranked by decode speed:

Configpp512 t/stg128 t/sPeak RSS
prod-turbo3 (3090 yaml)114.45 ±5.8910.26 ±0.0213.93 GB
baseline-fp16117.87 ±3.9110.02 ±0.6515.92 GB
turbo3-sparse-off113.50 ±2.6810.00 ±0.3214.27 GB
fa-off112.95 ±10.3110.07 ±0.1015.96 GB
kv-q8103.66 ±6.2110.03 ±0.1615.91 GB
kv-q4113.70 ±13.089.44 ±0.1816.00 GB
ub-1024109.09 ±10.019.70 ±0.4615.93 GB

A few things stand out:

  • prod-turbo3 wins both ways: top decode at 10.26 t/s and 2 GB less RAM than fp16. The same yaml I run on the 3090 was already the right call for the M1 Max.
  • TURBO_SPARSE_V=0 costs ~2.6% on decode (10.26 → 10.00). Tiny but real, and free if you keep it on.
  • q4_0 KV is a trap. Same RSS as fp16, slowest tg in the matrix. If you want to save KV memory, use turbo3 or q8_0, not q4_0.
  • q8_0 KV has the slowest prompt eval (103 vs 117 for fp16). Decode is unaffected. Pay for that asymmetry only if you actually need the memory.
  • Flash attention barely matters at 512 ctx. -fa 0 and -fa 1 are within noise. It would matter much more at long context, which this bench didn’t stress.

The compare with the 3090

The decode ceiling on either machine is memory_bandwidth / model_size. For Q4_K_M Qwen3.6-27B at 16 GB:

  • M1 Max (32-core GPU bin): 400 GB/s ÷ 16 GB ≈ 25 t/s theoretical, ~10 t/s real
  • RTX 3090: 936 GB/s ÷ 16 GB ≈ 58 t/s theoretical, ~35-40 t/s real

The ratio between the two real numbers (~3.5×) almost exactly matches the bandwidth ratio (~2.3×) plus the M1’s CPU/GPU dispatch overhead. There is no Apple-specific tuning that breaks past memory bandwidth for dense decoding on Q4. If you want faster, the only levers are smaller weights (lower quant, more aggressive packing) or smaller models — and neither of those is going to land another 3.5× without quality damage.

The good news: the M1 Max is at ~3% sustained CPU during decode. The GPU is doing all the work. Plug it in, plug headphones in, and it’s a perfectly usable assistant.

Where it almost went sideways

Two things cost me real time on this run.

1. llama-cli auto-conversation mode. I kicked off the real-inference run with -no-cnv --no-display-prompt -p "Write a 100-line Python script…" -n 256 and went to grab coffee. Came back to a 213 MB log file containing roughly six million > lines. The process was still alive, generating nothing.

What was happening: when llama.cpp detects a chat template in the GGUF (Qwen3.6 has one), llama-cli auto-enables conversation mode. After the -n 256 tokens are generated, the CLI loops on stdin waiting for the next user turn. With stdin closed (background process), the readline loop just spits the prompt character > into the log forever. -no-cnv alone does not override this on this fork.

The fix is two flags, not one:

1
llama-cli ... -st --simple-io < /dev/null

-st (--single-turn) actually exits after one generation. --simple-io keeps it from doing terminal magic in non-tty contexts. Both together. With both, the run completes cleanly and prints a built-in summary line:

1
[ Prompt: 61.6 t/s | Generation: 10.6 t/s ]

2. The watchdog agent. I tried to set up a monitor agent to poll the bench every 45 seconds and write a status snapshot to /tmp/bench_status.md. It self-terminated after one snapshot, decided its job was done, and returned a summary that read like an essay. The lesson, again, is that long-running side processes belong in plain shell loops, not in agent abstractions that have their own ideas about when their work is finished.

The replacement was a single ps sampler:

1
2
3
4
( while kill -0 $PID 2>/dev/null; do
    echo "$(date +%H:%M:%S) $(ps -o rss=,pcpu= -p $PID 2>/dev/null)"
    sleep 2
  done ) > inference.ps.txt &

Boring. Reliable. Six bytes of state instead of a multi-paragraph status report.

TurboQuant knobs worth knowing

The fork exposes a handful of env vars on top of the standard llama.cpp surface. None of them needed touching on the M1 Max — defaults are right — but they’re worth knowing for diagnostic toggling:

VarDefaultWhat it does
TURBO_SPARSE_VonSkip V dequant on negligible attention weights
TURBO_FORCE_4MAGauto4-mag LUT vs 8-LUT for turbo3 attention
TURBO_FLASHoffTwo-pass fused asymmetric attention (corrupt on Apple10)
TURBO_LAYER_ADAPTIVEautoPer-layer KV quant strategy 0–7

On startup, the Metal init logs tell you which of these kicked in. On the M1 Max it prints:

1
2
3
turbo3 using 4-mag LUT (pre-M5 hardware)
turbo3 sparse V dequant enabled (opt-out: TURBO_SPARSE_V=0)
GPU family: MTLGPUFamilyApple7

That Apple7 line is what you want — it means TURBO_FLASH (which is broken on Apple10 / M5 Max) is irrelevant here. M1, M1 Pro, and M1 Max are all Apple7. M2 is Apple8. The fork’s commit log has a Metal patch from a few weeks ago that disabled TURBO_FLASH by default on Apple10 because of corrupt outputs; on Apple7 it was always-off.

Takeaway

For dense decoding on quantized weights, memory bandwidth is the only number that matters at the laptop tier. The M1 Max landed at 10.6 t/s on Qwen3.6-27B Q4_K_M — about 3.5× slower than my 3090, almost exactly what 400 GB/s vs 936 GB/s predicts. Same yaml, no Apple-specific tuning, turbo3 KV cache wins on both speed and memory.

The two non-obvious wins were operational, not architectural: pass -st --simple-io to llama-cli whenever you run it non-interactively against a chat-template model, and trust a while kill -0 shell loop over any monitoring abstraction that thinks for itself.


Written with Claude Opus 4.7 (claude-opus-4-7) via Claude Code.

This post is licensed under CC BY 4.0 by the author.