Qwen3.6-27B on an M1 Max: when the laptop config matches the 3090

Benchmarked Qwen3.6-27B Q4_K_M on an M1 Max via llama-cpp-turboquant. The same prod config that runs on my 3090 was already optimal — but llama-cli's interactive auto-mode nearly ate my disk.

Posted Apr 29, 2026

By Jean Brito

7 min read

I have a llama-swap config running Qwen3.6-27B on my RTX 3090 with cache-type-k: turbo3 and a 262k context window. I wanted to know how the same config behaves on an M1 Max with 32 GB of unified memory — the laptop I actually carry around. The short answer: the prod yaml from the 3090 was already the right config for the M1, decode is about 3-4× slower (memory-bandwidth limited, exactly as the math predicts), and llama-cli will gleefully fill your disk with > characters if you don’t pass the right flag.

The setup

The fork is llama-cpp-turboquant, a llama.cpp fork that adds Metal kernels for a custom KV cache type called turbo3 (~4 bpw KV with sparse-V dequant) plus a TQ4_1S weight quantization. On Apple Silicon Metal is enabled by default — building is two cmake commands:

  
cd ~/Github/llama-cpp-turboquant
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

The model: unsloth/Qwen3.6-27B-GGUF, Q4_K_M variant, 16 GB on disk. The repo also ships everything from BF16 splits down to UD-IQ2_XXS, plus three mmproj-*.gguf files for vision input — Qwen3.6 is multimodal. For this run I stuck with text.

One detail I learned the hard way: Qwen3.6-27B isn’t a stock transformer. It’s a hybrid of 48 Gated DeltaNet layers and 16 Gated Attention layers (24 query heads, 4 KV heads, head_dim 256). DeltaNet layers are recurrent — they have no KV cache. So the KV math is not 2 × 64 × kv_heads × head_dim like a generic 27B; it’s 2 × 16 × kv_heads × head_dim. At fp16 that’s about 32 KB per token instead of 131 KB. A 32k context fits in roughly 1 GB of fp16 KV. With turbo3 it’s a quarter of that.

The bench matrix

I wrote a small bench.sh that runs llama-bench -p 512 -n 128 -r 3 -ngl 99 across seven configurations, captures ps snapshots into per-config telemetry files, and ends with a real-inference run via llama-cli. The seven configs varied the KV cache type, flash attention, ubatch size, and the TURBO_SPARSE_V env var.

Here’s what came back, ranked by decode speed:

Config	pp512 t/s	tg128 t/s	Peak RSS
prod-turbo3 (3090 yaml)	114.45 ±5.89	10.26 ±0.02	13.93 GB
baseline-fp16	117.87 ±3.91	10.02 ±0.65	15.92 GB
turbo3-sparse-off	113.50 ±2.68	10.00 ±0.32	14.27 GB
fa-off	112.95 ±10.31	10.07 ±0.10	15.96 GB
kv-q8	103.66 ±6.21	10.03 ±0.16	15.91 GB
kv-q4	113.70 ±13.08	9.44 ±0.18	16.00 GB
ub-1024	109.09 ±10.01	9.70 ±0.46	15.93 GB

A few things stand out:

prod-turbo3 wins both ways: top decode at 10.26 t/s and 2 GB less RAM than fp16. The same yaml I run on the 3090 was already the right call for the M1 Max.
TURBO_SPARSE_V=0 costs ~2.6% on decode (10.26 → 10.00). Tiny but real, and free if you keep it on.
q4_0 KV is a trap. Same RSS as fp16, slowest tg in the matrix. If you want to save KV memory, use turbo3 or q8_0, not q4_0.
q8_0 KV has the slowest prompt eval (103 vs 117 for fp16). Decode is unaffected. Pay for that asymmetry only if you actually need the memory.
Flash attention barely matters at 512 ctx. -fa 0 and -fa 1 are within noise. It would matter much more at long context, which this bench didn’t stress.

The compare with the 3090

The decode ceiling on either machine is memory_bandwidth / model_size. For Q4_K_M Qwen3.6-27B at 16 GB:

M1 Max (32-core GPU bin): 400 GB/s ÷ 16 GB ≈ 25 t/s theoretical, ~10 t/s real
RTX 3090: 936 GB/s ÷ 16 GB ≈ 58 t/s theoretical, ~35-40 t/s real

The ratio between the two real numbers (~3.5×) almost exactly matches the bandwidth ratio (~2.3×) plus the M1’s CPU/GPU dispatch overhead. There is no Apple-specific tuning that breaks past memory bandwidth for dense decoding on Q4. If you want faster, the only levers are smaller weights (lower quant, more aggressive packing) or smaller models — and neither of those is going to land another 3.5× without quality damage.

The good news: the M1 Max is at ~3% sustained CPU during decode. The GPU is doing all the work. Plug it in, plug headphones in, and it’s a perfectly usable assistant.

Where it almost went sideways

Two things cost me real time on this run.

1. llama-cli auto-conversation mode. I kicked off the real-inference run with -no-cnv --no-display-prompt -p "Write a 100-line Python script…" -n 256 and went to grab coffee. Came back to a 213 MB log file containing roughly six million > lines. The process was still alive, generating nothing.

What was happening: when llama.cpp detects a chat template in the GGUF (Qwen3.6 has one), llama-cli auto-enables conversation mode. After the -n 256 tokens are generated, the CLI loops on stdin waiting for the next user turn. With stdin closed (background process), the readline loop just spits the prompt character > into the log forever. -no-cnv alone does not override this on this fork.

The fix is two flags, not one:

llama-cli ... -st --simple-io < /dev/null

-st (--single-turn) actually exits after one generation. --simple-io keeps it from doing terminal magic in non-tty contexts. Both together. With both, the run completes cleanly and prints a built-in summary line:

[ Prompt: 61.6 t/s | Generation: 10.6 t/s ]

2. The watchdog agent. I tried to set up a monitor agent to poll the bench every 45 seconds and write a status snapshot to /tmp/bench_status.md. It self-terminated after one snapshot, decided its job was done, and returned a summary that read like an essay. The lesson, again, is that long-running side processes belong in plain shell loops, not in agent abstractions that have their own ideas about when their work is finished.

The replacement was a single ps sampler:

  
( while kill -0 $PID 2>/dev/null; do
    echo "$(date +%H:%M:%S) $(ps -o rss=,pcpu= -p $PID 2>/dev/null)"
    sleep 2
  done ) > inference.ps.txt &

Boring. Reliable. Six bytes of state instead of a multi-paragraph status report.

TurboQuant knobs worth knowing

The fork exposes a handful of env vars on top of the standard llama.cpp surface. None of them needed touching on the M1 Max — defaults are right — but they’re worth knowing for diagnostic toggling:

Var	Default	What it does
`TURBO_SPARSE_V`	on	Skip V dequant on negligible attention weights
`TURBO_FORCE_4MAG`	auto	4-mag LUT vs 8-LUT for turbo3 attention
`TURBO_FLASH`	off	Two-pass fused asymmetric attention (corrupt on Apple10)
`TURBO_LAYER_ADAPTIVE`	auto	Per-layer KV quant strategy 0–7

On startup, the Metal init logs tell you which of these kicked in. On the M1 Max it prints:

turbo3 using 4-mag LUT (pre-M5 hardware)
turbo3 sparse V dequant enabled (opt-out: TURBO_SPARSE_V=0)
GPU family: MTLGPUFamilyApple7

That Apple7 line is what you want — it means TURBO_FLASH (which is broken on Apple10 / M5 Max) is irrelevant here. M1, M1 Pro, and M1 Max are all Apple7. M2 is Apple8. The fork’s commit log has a Metal patch from a few weeks ago that disabled TURBO_FLASH by default on Apple10 because of corrupt outputs; on Apple7 it was always-off.

Takeaway

For dense decoding on quantized weights, memory bandwidth is the only number that matters at the laptop tier. The M1 Max landed at 10.6 t/s on Qwen3.6-27B Q4_K_M — about 3.5× slower than my 3090, almost exactly what 400 GB/s vs 936 GB/s predicts. Same yaml, no Apple-specific tuning, turbo3 KV cache wins on both speed and memory.

The two non-obvious wins were operational, not architectural: pass -st --simple-io to llama-cli whenever you run it non-interactively against a chat-template model, and trust a while kill -0 shell loop over any monitoring abstraction that thinks for itself.

Written with Claude Opus 4.7 (claude-opus-4-7) via Claude Code.

AI, Hardware

This post is licensed under CC BY 4.0 by the author.