1.5× Faster Agentic Coding with MTP on Qwen 3.6 27B
Benchmarking Multi-Token Prediction (MTP) on Qwen 3.6 27B via llama.cpp on an RTX 3090 — 1.5× speedup in agentic tool-call chains.
A recent llama.cpp PR added support for Qwen 3.6’s built-in Multi-Token Prediction (MTP) heads. Instead of generating one token at a time, the model drafts and verifies up to 3 tokens per step using its native speculative decoding layers — no external draft model needed.
I benchmarked it against my daily-driver setup (Heretic Q4_K_M, turbo3 KV cache) on an RTX 3090, simulating realistic agentic coding workloads with tool-call chains.
The Setup
A custom llama.cpp build with CUDA on an RTX 3090 24 GB:
| Component | MTP config | Baseline config |
|---|---|---|
| Build | llama.cpp PR #22673 (b100) | llama.cpp turbo3 fork |
| Model | Qwen3.6-27B-MTP-Q4_K_M (16 GB) | Heretic IQ4_XS (13 GB) |
| KV cache | q8_0 | turbo3 |
| Context | 32K | 262K |
| GPU | RTX 3090 (24 GB), single GPU | RTX 3090, single GPU |
| Parallel slots | 1 | 1 |
The MTP model needs a special GGUF with the draft head tensors preserved — standard quants won’t work. I used brittlewis12’s Q4_K_M quant, which loaded 866 tensors (vs ~814 for standard).
Key launch flags:
1
2
3
4
5
6
7
llama-server -m Qwen3.6-27B-MTP-Q4_K_M.gguf \
--spec-type mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--reasoning off \
-np 1 -c 32768 -ngl 99 \
--main-gpu 0 --split-mode none \
--port 8082
Qwen 3.6 27B is a hybrid model — only 16 of 65 layers use traditional KV cache. The other 48 use linear attention with a fixed 898 MiB recurrent state. This means KV memory is ~4× smaller than a standard dense model, which gives more room for speculative decoding overhead.
The Benchmark: 3-Turn Tool Call Chain
I designed a realistic agentic scenario — exploring a codebase, reading files, searching for patterns, and running tests, all using function calling:
Turn 1: “Explore the project structure” → list_directory Turn 2: “Read the main entry point” → read_file
Turn 3: “Search for functions and run tests” → search_code
Each turn appends real tool results back into context, simulating what actually happens in agent loops.
Results
| Turn | Context (tokens) | MTP (tok/s) | Baseline (tok/s) | Speedup | MTP Draft Accept |
|---|---|---|---|---|---|
| T1 | 574 | 47.7 | 31.4 | 1.52× | 69% |
| T2 | 635 | 48.9 | 32.9 | 1.49× | 100% |
| T3 | 799 | 61.4 | 29.7 | 2.07× | 94% |
Every response was a proper tool call — correct function names and arguments. The draft acceptance rate averaged 88% across all turns.
Single-Request Baselines
| Generation length | MTP (tok/s) | Baseline (tok/s) | Speedup |
|---|---|---|---|
| Short (~60 tok) | 49.7 | 34 | 1.46× |
| Long (~340 tok) | 37.7 | 34 | 1.11× |
MTP shines on short generations — exactly what agentic tool calling produces (30-40 tokens per tool call, multiple turns).
Why the speedup isn’t 2.5×
The Reddit post claiming 2.5× was from an M2 Max Mac with Apple Silicon — different hardware, different memory bandwidth profile. On CUDA, the speedup depends heavily on:
- Draft acceptance rate: The MTP draft head is a lightweight predictor. It’s most accurate at the beginning of a generation (100% accept), but degrades to ~65% as context grows.
- The
--spec-draft-n-maxvalue: The sweet spot is 3-4. Higher values draft more tokens but lower acceptance rates (more speculations to verify). - KV cache precision: q8_0 vs turbo3 vs f16 all trade off between memory and quality. The MTP build uses q8_0 — switching to a more efficient format would free VRAM for larger context.
Caveats
- Vision crashes with MTP — reported on the PR thread. If you need multimodal, MTP isn’t an option yet.
- Single slot only — no parallel processing (
-np 1required). - MTP-specific GGUF required — standard quants don’t include the draft head tensors and won’t load with
--spec-type mtp. - VRAM pressure — the MTP model uses ~22 GB on the 3090, leaving almost no headroom. A smaller quant (IQ4_XS) would give space for longer context.
- Model differences — the MTP quant I tested isn’t the same fine-tune as my daily Heretic. A “Heretic + MTP” quant would combine both optimizations.
Is It Worth It?
For agentic coding — yes. Tool calls are short (30-40 tokens) and MTP delivers the best speedup at that range. A consistent 1.5× means a 3-turn conversation finishes in the time a 2-turn one normally takes. Over hundreds of agent iterations in a session, that adds up.
The tradeoff is giving up vision and losing ~4 GB of VRAM headroom. For a dedicated coding agent, that’s a worthwhile exchange.
If you want to try it yourself:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Build llama.cpp with MTP support
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target llama-server
# Download an MTP quant
# Options:
# - brittlewis12 (Q4_K_M, tested here)
# - RDson (IQ4_KS, good VRAM efficiency)
# - llmfan46 (Heretic v2 with native MTP preserved)
# Run
./build/bin/llama-server -m Qwen3.6-27B-MTP-Q4_K_M.gguf \
--spec-type mtp --spec-draft-n-max 3 \
-np 1 -c 32768 -ngl 99 --port 8082
Written with DeepSeek V4 Pro