Post

Qwen 3.6 Dense vs MOE on Local Stack: what MTP actually delivers

Practical comparison between Qwen 3.6 Dense and MOE on an RTX 3090, focused on real throughput by scenario and the practical impact of Multi-Token Prediction in local inference flow.

Qwen 3.6 Dense vs MOE on Local Stack: what MTP actually delivers

Instead of discussing “speedup marketing”, I ran numbers on the real local lab setup (dedicated llama-server per port, fixed context, 1 slot, same stack) to decide the daily setup between:

  • Dense 27B (Qwen3.6-27B-UD-Q4_K_XL)
  • MOE 35B (Qwen3.6-35B-A3B-UD-Q3_K_M)

With and without MTP.

Setup that kept comparability

  • --batch-size 2048
  • --ubatch-size 512
  • --cache-type-k q4_0
  • --cache-type-v q4_0
  • --fit on
  • --split-mode none
  • --main-gpu 0
  • --flash-attn on
  • --cont-batching
  • --parallel 1
  • --timeout 900

Differences between runs:

  • Dense: --ctx-size 131072
  • MOE: --ctx-size 65536
  • MTP on: --spec-type mtp --spec-draft-n-max
  • MTP off (MOE): --spec-type none

Numerical result

Short scenario

ModelThroughput short (tok/s)VRAM
Dense 27B (MTP)6021.9 GiB
Dense 27B (non-MTP)36.221.9 GiB
MOE 35B (MTP)117.918.7 GiB
MOE 35B (non-MTP)125.519.0 GiB

Medium scenario

ModelThroughput medium (tok/s)VRAM
Dense 27B (MTP)4921.9 GiB
Dense 27B (non-MTP)34.221.9 GiB
MOE 35B (MTP)129.318.7 GiB
MOE 35B (non-MTP)93.619.0 GiB

Long scenario

ModelThroughput long (tok/s)VRAM
Dense 27B (MTP)49.021.9 GiB
Dense 27B (non-MTP)34.021.9 GiB
MOE 35B (MTP)116.118.7 GiB
MOE 35B (non-MTP)85.919.0 GiB

Comparability notes:

  • Dense 27B (non-MTP) was recollected now in an isolated campaign (spec-type none, 2 runs/scenario, 131072 ctx).
  • Even so, long is not apples-to-apples with MOE: contexts differ (Dense 27B 131072 vs MOE 65536).
  • The MTP gain for MOE remains robust in the available scenarios; MOE numbers did not change in this round.

Honest interpretation of numbers

  • For MOE 35B, MTP improves the medium and especially long scenario:
    • medium: 129.3 vs 93.6 (~+38%)
    • long: 116.1 vs 85.9 (~+35%)
  • In short and medium, variability depends more on warmup jitter than on spec-type, so these values should be read with caution.
  • The isolated architecture gain (Dense 27B vs MOE 35B) is not apples-to-apples because ctx-size and operational limits differ (131072 vs 65536).
  • The objective of this round was to lock practical cost/benefit: MTP remains the practical differentiator for MOE, with spec-draft-n-max 1 as the stable default.

spec-draft-n-max: 1 or 2?

--spec-draft-n-max 2 and 1 were tested on MOE with MTP build:

  • nmax=2 produced peaks in some samples, but produced recurring draft truncation warning:
    • draft size 2 exceeds max 1, truncating
  • nmax=1 removes this truncation behavior, simplifies operation, and still keeps a strong gain versus non-MTP with better acceptance stability.

Operational conclusion: for daily validation, spec-draft-n-max 1 is the cleanest configuration.

Commands used (summary)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Dense 27B with MTP
/path/to/llama-server \
  --model /path/to/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --ctx-size 131072 \
  --batch-size 2048 --ubatch-size 512 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --fit on --split-mode none --main-gpu 0 --flash-attn on \
  --cont-batching --parallel 1 --timeout 900 \
  --spec-type mtp --spec-draft-n-max 2

# MOE 35B with MTP
/path/to/llama-server \
  --model /path/to/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf \
  --ctx-size 65536 \
  --batch-size 2048 --ubatch-size 512 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --fit on --split-mode none --main-gpu 0 --flash-attn on \
  --cont-batching --parallel 1 --timeout 900 \
  --spec-type mtp --spec-draft-n-max 1

# MOE 35B without MTP
/path/to/llama-server \
  --model /path/to/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf \
  --ctx-size 65536 \
  --batch-size 2048 --ubatch-size 512 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --fit on --split-mode none --main-gpu 0 --flash-attn on \
  --cont-batching --parallel 1 --timeout 900 \
  --spec-type none

Practical decision

  • If priority is continuous local production use: MOE + MTP, with --spec-draft-n-max 1, was the best cost/benefit balance.
  • Dense 27B with MTP remains strong as fallback and for larger context usage, with stable operation.
  • MOE without MTP is for technical comparison only, not the default.

Written with GPT-5.5 High

This post is licensed under CC BY 4.0 by the author.