Post

1.5× Faster Agentic Coding with MTP on Qwen 3.6 27B

Benchmarking Multi-Token Prediction (MTP) on Qwen 3.6 27B via llama.cpp on an RTX 3090 — 1.5× speedup in agentic tool-call chains.

1.5× Faster Agentic Coding with MTP on Qwen 3.6 27B

A recent llama.cpp PR added support for Qwen 3.6’s built-in Multi-Token Prediction (MTP) heads. Instead of generating one token at a time, the model drafts and verifies up to 3 tokens per step using its native speculative decoding layers — no external draft model needed.

I benchmarked it against my daily-driver setup (Heretic Q4_K_M, turbo3 KV cache) on an RTX 3090, simulating realistic agentic coding workloads with tool-call chains.

The Setup

A custom llama.cpp build with CUDA on an RTX 3090 24 GB:

ComponentMTP configBaseline config
Buildllama.cpp PR #22673 (b100)llama.cpp turbo3 fork
ModelQwen3.6-27B-MTP-Q4_K_M (16 GB)Heretic IQ4_XS (13 GB)
KV cacheq8_0turbo3
Context32K262K
GPURTX 3090 (24 GB), single GPURTX 3090, single GPU
Parallel slots11

The MTP model needs a special GGUF with the draft head tensors preserved — standard quants won’t work. I used brittlewis12’s Q4_K_M quant, which loaded 866 tensors (vs ~814 for standard).

Key launch flags:

1
2
3
4
5
6
7
llama-server -m Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --spec-type mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --reasoning off \
  -np 1 -c 32768 -ngl 99 \
  --main-gpu 0 --split-mode none \
  --port 8082

Qwen 3.6 27B is a hybrid model — only 16 of 65 layers use traditional KV cache. The other 48 use linear attention with a fixed 898 MiB recurrent state. This means KV memory is ~4× smaller than a standard dense model, which gives more room for speculative decoding overhead.

The Benchmark: 3-Turn Tool Call Chain

I designed a realistic agentic scenario — exploring a codebase, reading files, searching for patterns, and running tests, all using function calling:

Turn 1: “Explore the project structure” → list_directory Turn 2: “Read the main entry point” → read_file
Turn 3: “Search for functions and run tests” → search_code

Each turn appends real tool results back into context, simulating what actually happens in agent loops.

Results

TurnContext (tokens)MTP (tok/s)Baseline (tok/s)SpeedupMTP Draft Accept
T157447.731.41.52×69%
T263548.932.91.49×100%
T379961.429.72.07×94%

Every response was a proper tool call — correct function names and arguments. The draft acceptance rate averaged 88% across all turns.

Single-Request Baselines

Generation lengthMTP (tok/s)Baseline (tok/s)Speedup
Short (~60 tok)49.7341.46×
Long (~340 tok)37.7341.11×

MTP shines on short generations — exactly what agentic tool calling produces (30-40 tokens per tool call, multiple turns).

Why the speedup isn’t 2.5×

The Reddit post claiming 2.5× was from an M2 Max Mac with Apple Silicon — different hardware, different memory bandwidth profile. On CUDA, the speedup depends heavily on:

  1. Draft acceptance rate: The MTP draft head is a lightweight predictor. It’s most accurate at the beginning of a generation (100% accept), but degrades to ~65% as context grows.
  2. The --spec-draft-n-max value: The sweet spot is 3-4. Higher values draft more tokens but lower acceptance rates (more speculations to verify).
  3. KV cache precision: q8_0 vs turbo3 vs f16 all trade off between memory and quality. The MTP build uses q8_0 — switching to a more efficient format would free VRAM for larger context.

Caveats

  • Vision crashes with MTP — reported on the PR thread. If you need multimodal, MTP isn’t an option yet.
  • Single slot only — no parallel processing (-np 1 required).
  • MTP-specific GGUF required — standard quants don’t include the draft head tensors and won’t load with --spec-type mtp.
  • VRAM pressure — the MTP model uses ~22 GB on the 3090, leaving almost no headroom. A smaller quant (IQ4_XS) would give space for longer context.
  • Model differences — the MTP quant I tested isn’t the same fine-tune as my daily Heretic. A “Heretic + MTP” quant would combine both optimizations.

Is It Worth It?

For agentic coding — yes. Tool calls are short (30-40 tokens) and MTP delivers the best speedup at that range. A consistent 1.5× means a 3-turn conversation finishes in the time a 2-turn one normally takes. Over hundreds of agent iterations in a session, that adds up.

The tradeoff is giving up vision and losing ~4 GB of VRAM headroom. For a dedicated coding agent, that’s a worthwhile exchange.

If you want to try it yourself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Build llama.cpp with MTP support
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target llama-server

# Download an MTP quant
# Options:
# - brittlewis12 (Q4_K_M, tested here)
# - RDson (IQ4_KS, good VRAM efficiency)
# - llmfan46 (Heretic v2 with native MTP preserved)

# Run
./build/bin/llama-server -m Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --spec-type mtp --spec-draft-n-max 3 \
  -np 1 -c 32768 -ngl 99 --port 8082

Written with DeepSeek V4 Pro

This post is licensed under CC BY 4.0 by the author.