Qwen3.6-27B on an M1 Max: when the laptop config matches the 3090
Benchmarked Qwen3.6-27B Q4_K_M on an M1 Max via llama-cpp-turboquant. The same prod config that runs on my 3090 was already optimal — but llama-cli's interactive auto-mode nearly ate my disk.
I have a llama-swap config running Qwen3.6-27B on my RTX 3090 with cache-type-k: turbo3 and a 262k context window. I wanted to know how the same config behaves on an M1 Max with 32 GB of unified memory — the laptop I actually carry around. The short answer: the prod yaml from the 3090 was already the right config for the M1, decode is about 3-4× slower (memory-bandwidth limited, exactly as the math predicts), and llama-cli will gleefully fill your disk with > characters if you don’t pass the right flag.
The setup
The fork is llama-cpp-turboquant, a llama.cpp fork that adds Metal kernels for a custom KV cache type called turbo3 (~4 bpw KV with sparse-V dequant) plus a TQ4_1S weight quantization. On Apple Silicon Metal is enabled by default — building is two cmake commands:
1
2
3
cd ~/Github/llama-cpp-turboquant
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
The model: unsloth/Qwen3.6-27B-GGUF, Q4_K_M variant, 16 GB on disk. The repo also ships everything from BF16 splits down to UD-IQ2_XXS, plus three mmproj-*.gguf files for vision input — Qwen3.6 is multimodal. For this run I stuck with text.
One detail I learned the hard way: Qwen3.6-27B isn’t a stock transformer. It’s a hybrid of 48 Gated DeltaNet layers and 16 Gated Attention layers (24 query heads, 4 KV heads, head_dim 256). DeltaNet layers are recurrent — they have no KV cache. So the KV math is not 2 × 64 × kv_heads × head_dim like a generic 27B; it’s 2 × 16 × kv_heads × head_dim. At fp16 that’s about 32 KB per token instead of 131 KB. A 32k context fits in roughly 1 GB of fp16 KV. With turbo3 it’s a quarter of that.
The bench matrix
I wrote a small bench.sh that runs llama-bench -p 512 -n 128 -r 3 -ngl 99 across seven configurations, captures ps snapshots into per-config telemetry files, and ends with a real-inference run via llama-cli. The seven configs varied the KV cache type, flash attention, ubatch size, and the TURBO_SPARSE_V env var.
Here’s what came back, ranked by decode speed:
| Config | pp512 t/s | tg128 t/s | Peak RSS |
|---|---|---|---|
| prod-turbo3 (3090 yaml) | 114.45 ±5.89 | 10.26 ±0.02 | 13.93 GB |
| baseline-fp16 | 117.87 ±3.91 | 10.02 ±0.65 | 15.92 GB |
| turbo3-sparse-off | 113.50 ±2.68 | 10.00 ±0.32 | 14.27 GB |
| fa-off | 112.95 ±10.31 | 10.07 ±0.10 | 15.96 GB |
| kv-q8 | 103.66 ±6.21 | 10.03 ±0.16 | 15.91 GB |
| kv-q4 | 113.70 ±13.08 | 9.44 ±0.18 | 16.00 GB |
| ub-1024 | 109.09 ±10.01 | 9.70 ±0.46 | 15.93 GB |
A few things stand out:
- prod-turbo3 wins both ways: top decode at 10.26 t/s and 2 GB less RAM than fp16. The same yaml I run on the 3090 was already the right call for the M1 Max.
TURBO_SPARSE_V=0costs ~2.6% on decode (10.26 → 10.00). Tiny but real, and free if you keep it on.q4_0KV is a trap. Same RSS as fp16, slowest tg in the matrix. If you want to save KV memory, useturbo3orq8_0, notq4_0.q8_0KV has the slowest prompt eval (103 vs 117 for fp16). Decode is unaffected. Pay for that asymmetry only if you actually need the memory.- Flash attention barely matters at 512 ctx.
-fa 0and-fa 1are within noise. It would matter much more at long context, which this bench didn’t stress.
The compare with the 3090
The decode ceiling on either machine is memory_bandwidth / model_size. For Q4_K_M Qwen3.6-27B at 16 GB:
- M1 Max (32-core GPU bin): 400 GB/s ÷ 16 GB ≈ 25 t/s theoretical, ~10 t/s real
- RTX 3090: 936 GB/s ÷ 16 GB ≈ 58 t/s theoretical, ~35-40 t/s real
The ratio between the two real numbers (~3.5×) almost exactly matches the bandwidth ratio (~2.3×) plus the M1’s CPU/GPU dispatch overhead. There is no Apple-specific tuning that breaks past memory bandwidth for dense decoding on Q4. If you want faster, the only levers are smaller weights (lower quant, more aggressive packing) or smaller models — and neither of those is going to land another 3.5× without quality damage.
The good news: the M1 Max is at ~3% sustained CPU during decode. The GPU is doing all the work. Plug it in, plug headphones in, and it’s a perfectly usable assistant.
Where it almost went sideways
Two things cost me real time on this run.
1. llama-cli auto-conversation mode. I kicked off the real-inference run with -no-cnv --no-display-prompt -p "Write a 100-line Python script…" -n 256 and went to grab coffee. Came back to a 213 MB log file containing roughly six million > lines. The process was still alive, generating nothing.
What was happening: when llama.cpp detects a chat template in the GGUF (Qwen3.6 has one), llama-cli auto-enables conversation mode. After the -n 256 tokens are generated, the CLI loops on stdin waiting for the next user turn. With stdin closed (background process), the readline loop just spits the prompt character > into the log forever. -no-cnv alone does not override this on this fork.
The fix is two flags, not one:
1
llama-cli ... -st --simple-io < /dev/null
-st (--single-turn) actually exits after one generation. --simple-io keeps it from doing terminal magic in non-tty contexts. Both together. With both, the run completes cleanly and prints a built-in summary line:
1
[ Prompt: 61.6 t/s | Generation: 10.6 t/s ]
2. The watchdog agent. I tried to set up a monitor agent to poll the bench every 45 seconds and write a status snapshot to /tmp/bench_status.md. It self-terminated after one snapshot, decided its job was done, and returned a summary that read like an essay. The lesson, again, is that long-running side processes belong in plain shell loops, not in agent abstractions that have their own ideas about when their work is finished.
The replacement was a single ps sampler:
1
2
3
4
( while kill -0 $PID 2>/dev/null; do
echo "$(date +%H:%M:%S) $(ps -o rss=,pcpu= -p $PID 2>/dev/null)"
sleep 2
done ) > inference.ps.txt &
Boring. Reliable. Six bytes of state instead of a multi-paragraph status report.
TurboQuant knobs worth knowing
The fork exposes a handful of env vars on top of the standard llama.cpp surface. None of them needed touching on the M1 Max — defaults are right — but they’re worth knowing for diagnostic toggling:
| Var | Default | What it does |
|---|---|---|
TURBO_SPARSE_V | on | Skip V dequant on negligible attention weights |
TURBO_FORCE_4MAG | auto | 4-mag LUT vs 8-LUT for turbo3 attention |
TURBO_FLASH | off | Two-pass fused asymmetric attention (corrupt on Apple10) |
TURBO_LAYER_ADAPTIVE | auto | Per-layer KV quant strategy 0–7 |
On startup, the Metal init logs tell you which of these kicked in. On the M1 Max it prints:
1
2
3
turbo3 using 4-mag LUT (pre-M5 hardware)
turbo3 sparse V dequant enabled (opt-out: TURBO_SPARSE_V=0)
GPU family: MTLGPUFamilyApple7
That Apple7 line is what you want — it means TURBO_FLASH (which is broken on Apple10 / M5 Max) is irrelevant here. M1, M1 Pro, and M1 Max are all Apple7. M2 is Apple8. The fork’s commit log has a Metal patch from a few weeks ago that disabled TURBO_FLASH by default on Apple10 because of corrupt outputs; on Apple7 it was always-off.
Takeaway
For dense decoding on quantized weights, memory bandwidth is the only number that matters at the laptop tier. The M1 Max landed at 10.6 t/s on Qwen3.6-27B Q4_K_M — about 3.5× slower than my 3090, almost exactly what 400 GB/s vs 936 GB/s predicts. Same yaml, no Apple-specific tuning, turbo3 KV cache wins on both speed and memory.
The two non-obvious wins were operational, not architectural: pass -st --simple-io to llama-cli whenever you run it non-interactively against a chat-template model, and trust a while kill -0 shell loop over any monitoring abstraction that thinks for itself.
Written with Claude Opus 4.7 (claude-opus-4-7) via Claude Code.